Motivation:
This quarter, we focus on how short-read sequence data is used to identify difficult-to-detect types of genetic variation, including short tandem repeats (STRs), mobile elements (MEs), and structural variants(SVs). It is important to note that genotypes are derived from high-throughput sequencing data, not directly observed. Now imagine you have a pool of ~150bp sequences. On their own, they are not very informative, but if you can align them with each other and a reference genome, you can identify small mismatches. Those mismatches can be evaluated, and eventually called as variant positions. The number of copies of that variant allele estimated from the data from a single sample can be used to call a genotype. If a region of the genome is repetitive (like a short tandem repeat) or not very unique (like a mobile element), it is harder to tell if a given 150bp fragment matches that region or another region with similar sequence. If a variant is much larger than 150bp, like a structural variant, it can be harder to identify because the information is spread over many fragments. This quarter, we explore different approaches for identifying and genotyping these kinds of variants with these kinds of data.Background:
High-throughput sequencing data: A 5-minute video by Illumina that explains the
technology generating most of the data represented in this quarter’s reading
list can be found here. Dr. Rob Edwards offers a slightly longer video with a real professor and a
marker board here. He has a whole series of short videos about computational genomics, including how these sequencing reads can be assembled into genomes, exomes, and so on that you might find helpful.
Alternatively, Dr. Eric Chow touches on several sequencing techniques in a 30-minute video here.
Short tandem repeats (STRs): Microsatellites, or STRs are highly polymorphic sequences where a short number of base pairs are sequentially repeated many times. They are a popular choice for forensic identification
and linkage analysis because genotypes from relatively few STRs can be very informative.
Mobile elements (MEs): These DNA sequences can
move around the genome using either a copy or cut and paste mechanism. These
variants have a wide
variety of applications, including population genetics and medical genetics.
Structural variants (SVs): These are genetic variants
involving large segments of DNA, including changes in copy number (CNVs),
orientation, and location within the genome. This is a bucket term that
includes STRs and MEs, but many studies simply focus on large deletions or
duplications. This heterogeneous class of variation is used for wide variety of
studies, including medical genomics
and population
genomics.
Overview
of the reading list:
We start the quarter by
investigating multiple approaches to identify STR variation in high-throughput
sequencing data. These kinds of variants are fairly small and well-defined with
respect to their properties. Some approaches focus on correctly identifying the
repeat length of known pathogenic variants, while others take a wider approach.
We next move to MEs, because polymorphic subfamilies of MEs in humans share
high sequence identity or consensus regions which can aid in their
identification. We expand our focus to SVs in general, and conclude with a
benchmarking paper comparing different types of SV and ME callers and their
ability to identify known variants in real and simulated data.
Expectations:
This is the first quarter
our journal club goes online. This is a 1 credit hour journal club, and I think
the move to Zoom will be pretty seamless.
Everyone is expected to read the paper prior to the seminar each week,
and each student (in pairs if possible) will present the seminar on one paper
during the quarter. I will send a Zoom meeting invitation to the class list for
our usual class time; the platform is pretty user friendly but let me know if
you have any questions. If you find that your internet access is limited or
otherwise cannot attend the journal club “live”, please let me know and I will
work with you on an alternative (Ex., write a paragraph or two on the paper and
send it by the end of Tuesday for the week). The goal of this seminar is
to introduce you to a variety of ways to tackle a complicated problem, for you
to get a sense of the variables and assumptions involved, and an idea of which
tools might work better under different conditions.
Schedule:
March 31: organizational meeting
April 7: Dolzhenko et al. (2019) ExpansionHunter: a sequence-graph based tools to analyze
variation in short tandem repeat regions. Bioinformatics 35(22): 4754-4756. *Note, a lot of the information
in this paper is in the supplement. Presenter: Alan Min.
April 14: Dashnow et al. (2018) STRetch: detecting and discovering pathogenic short tandem
repeat expansion. Genome Biology 19:121. Presenters: Charles Wolock and Jacob Alfieri.
April 21: Mousavi et al. (2019)
Profiling the
genome-wide landscape of tandem repeat expansions. Nucleic Acids Research 47(15): e90. Presenter: Seth Temple and Hong Xiao.
April 28: Thung et al. (2019) Mobster: accurate
detection of mobile element insertions in next generation sequencing data. Genome Biology 15(10): 488. Presenter: Michael Goldberg.
May 5: Gardner et al.
(2017) The
Mobile Element Locator Tool (MELT): population-scale mobile element discovery
and biology. Genome Research 27(11):
1916-1929. Presenters: Nandana Rao and Alyna Khan.
May 12: Guest lecture: Accurate genotyping of tandem repeats and identification of repeat expansions using long-read sequencing by Arvis Sulovari in Genome Sciences.
May 19: Chen et al. (2014)
Manta: Rapid Detection of Structural Variants and Indels for Germline and Cancer Sequencing Applications. Bioinformatics 32(8): 1220-2. *Note, a lot of the information in this paper is in the supplement. Presenters: Ruoyi Cai.
May 26: CANCELED. Cameron et al. (2017) GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Research 27(12): 2050-2060.
June 2: Kosugi et al. (2019) Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biology 20:117. *Note, this is part of a special issue on benchmarking with several other papers on related topics. Presenters: Cameron Haas and Joe Zhou.