Statistical Genetics Seminar, Spring 2020

Instructor: Liz Blue (em27@uw.edu). Also, Sharon Browning, Tim Thornton, and Ellen Wijsman.

Topic: Genotype calling of repeats and structural variants.

Motivation:

This quarter, we focus on how short-read sequence data is used to identify difficult-to-detect types of genetic variation, including short tandem repeats (STRs), mobile elements (MEs), and structural variants(SVs). It is important to note that genotypes are derived from high-throughput sequencing data, not directly observed. Now imagine you have a pool of ~150bp sequences. On their own, they are not very informative, but if you can align them with each other and a reference genome, you can identify small mismatches. Those mismatches can be evaluated, and eventually called as variant positions. The number of copies of that variant allele estimated from the data from a single sample can be used to call a genotype. If a region of the genome is repetitive (like a short tandem repeat) or not very unique (like a mobile element), it is harder to tell if a given 150bp fragment matches that region or another region with similar sequence. If a variant is much larger than 150bp, like a structural variant, it can be harder to identify because the information is spread over many fragments. This quarter, we explore different approaches for identifying and genotyping these kinds of variants with these kinds of data.

Background:

High-throughput sequencing data: A 5-minute video by Illumina that explains the technology generating most of the data represented in this quarter’s reading list can be found here. Dr. Rob Edwards offers a slightly longer video with a real professor and a marker board here. He has a whole series of short videos about computational genomics, including how these sequencing reads can be assembled into genomes, exomes, and so on that you might find helpful. Alternatively, Dr. Eric Chow touches on several sequencing techniques in a 30-minute video here. 

Short tandem repeats (STRs):  Microsatellites, or STRs are highly polymorphic sequences where a short number of base pairs are sequentially repeated many times. They are a popular choice for forensic identification and linkage analysis because genotypes from relatively few STRs can be very informative.

Mobile elements (MEs): These DNA sequences can move around the genome using either a copy or cut and paste mechanism. These variants have a wide variety of applications, including population genetics and medical genetics.

Structural variants (SVs): These are genetic variants involving large segments of DNA, including changes in copy number (CNVs), orientation, and location within the genome. This is a bucket term that includes STRs and MEs, but many studies simply focus on large deletions or duplications. This heterogeneous class of variation is used for wide variety of studies, including medical genomics and population genomics.

Overview of the reading list:

We start the quarter by investigating multiple approaches to identify STR variation in high-throughput sequencing data. These kinds of variants are fairly small and well-defined with respect to their properties. Some approaches focus on correctly identifying the repeat length of known pathogenic variants, while others take a wider approach. We next move to MEs, because polymorphic subfamilies of MEs in humans share high sequence identity or consensus regions which can aid in their identification. We expand our focus to SVs in general, and conclude with a benchmarking paper comparing different types of SV and ME callers and their ability to identify known variants in real and simulated data.

Expectations:

This is the first quarter our journal club goes online. This is a 1 credit hour journal club, and I think the move to Zoom will be pretty seamless.  Everyone is expected to read the paper prior to the seminar each week, and each student (in pairs if possible) will present the seminar on one paper during the quarter. I will send a Zoom meeting invitation to the class list for our usual class time; the platform is pretty user friendly but let me know if you have any questions. If you find that your internet access is limited or otherwise cannot attend the journal club “live”, please let me know and I will work with you on an alternative (Ex., write a paragraph or two on the paper and send it by the end of Tuesday for the week). The goal of this seminar is to introduce you to a variety of ways to tackle a complicated problem, for you to get a sense of the variables and assumptions involved, and an idea of which tools might work better under different conditions. 

Schedule:

March 31: organizational meeting

April 7: Dolzhenko et al. (2019) ExpansionHunter: a sequence-graph based tools to analyze variation in short tandem repeat regions. Bioinformatics 35(22): 4754-4756. *Note, a lot of the information in this paper is in the supplement. Presenter: Alan Min.

April 14: Dashnow et al. (2018) STRetch: detecting and discovering pathogenic short tandem repeat expansion. Genome Biology 19:121. Presenters: Charles Wolock and Jacob Alfieri.

April 21: Mousavi et al. (2019) Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Research 47(15): e90. Presenter: Seth Temple and Hong Xiao.

April 28: Thung et al. (2019) Mobster: accurate detection of mobile element insertions in next generation sequencing data. Genome Biology 15(10): 488. Presenter: Michael Goldberg.

May 5: Gardner et al. (2017) The Mobile Element Locator Tool (MELT): population-scale mobile element discovery and biology. Genome Research 27(11): 1916-1929. Presenters: Nandana Rao and Alyna Khan.

May 12: Guest lecture: Accurate genotyping of tandem repeats and identification of repeat expansions using long-read sequencing by Arvis Sulovari in Genome Sciences.

May 19: Chen et al. (2014) Manta: Rapid Detection of Structural Variants and Indels for Germline and Cancer Sequencing Applications. Bioinformatics 32(8): 1220-2. *Note, a lot of the information in this paper is in the supplement. Presenters: Ruoyi Cai.

May 26: CANCELED. Cameron et al. (2017) GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Research 27(12): 2050-2060.

June 2: Kosugi et al. (2019) Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biology 20:117. *Note, this is part of a special issue on benchmarking with several other papers on related topics. Presenters: Cameron Haas and Joe Zhou.