Workshop in Biostatistics

DATE: October 6, 2016
TIME: 1:30 - 2:50 pm
LOCATION: Medical School Office Building, Rm x303
TITLE: Population assisted genome inference

Benedict Paten
Associate Research Scientist
Director, Computational Genomics Lab at the
UC Santa Cruz Genomics Institute

The human reference genome has transformed human genetics by providing a proxy to a universal coordinate system. However, the reference is but one genome, and as such can not contain all the variations present in the population. Analysis relative to it creates a so called reference allele bias. When identifying the variations within a new sample by mapping against the reference it is easy to find alleles within the reference but harder to near impossible to find the alleles not contained within it. Adding additional variations to the reference genome naturally defines a graph structure, a genome graph, with the intersections between additional sequences defining vertices that connect myriad possible human genomes. This subtle extension opens numerous possibilities and forces us to redefine many basic concepts that the field has taken for granted. I will layout our theoretical and empirical investigations of these issues, and show our progress towards a holy grail: comprehensive genome inference conditioned on not a single genome but a population.

Suggested readings:
A global reference for human genetic variation.  Nature 526, 68-74 2015.

An integrated map of structural variation in 2,504 human genomes.  2015.  Nature 526, 75-81.

Benedict Paten, Adam Novak, David Haussler.  Mapping to a Reference Genome Structure.

Durbin R.  Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT).  Bioinformatics.  2014.  30(9):1266-72.