Workshop in Biostatistics

M112 Alway Building, Medical Center

(next to the Dean's courtyard)

DATE: April 6, 2017
TIME: 1:30 - 2:50 pm
TITLE: Haplotype Phasing and Community Recovery
SPEAKER: David Tse
Professor of Electrical Engineering
Electrical Engineering Department, Stanford


Humans have 23 pairs of homologous chromosomes which are identical except on single nucleotide polymorphisms (SNPs). A haplotype of an individual is the pair of sequences of SNPs on the two homologous chromosomes. Knowing the haplotypes of individuals can lead to a better understanding of the interplay of genetic variation and disease as well as better inference of human demographic history. In this talk,  we discuss the problem of inferring haplotypes from high-throughput sequencing data in the form of reads with long-range connectivity. We give a simple formula for the minimum number of reads needed to accurately recover a haplotype, as well as an optimal near linear-time algorithm that can achieve the information theoretic limit for exact recovery.

We evaluate the algorithm on sequencing data from 10X Genomics and compare its performance with state-of-the-art. Our work leverages connections between the phasing problem and  the problem of community recovery, where communities have to be inferred based on the friendship graph of users. Our solution leads to an optimal community recovery algorithm for general graphs with locality, where pairwise noisy measurements of whether two nodes are in the same community or different communities come mainly or exclusively from nearby nodes rather than uniformly sampled between all nodes pairs, as in most existing models.

Suggested readings:

Y. Chen, G. Kamath, C. Suh, D. Tse.  "Community Recovery in Graphs with Locality", International Conference on Machine Learning, 2016.

G. Kamath, E. Sasoglu, D. Tse, " Optimal Haplotype Assembly from High-throughput mate-pair reads", International Symposium on Information Theory, 2015.