Workshop in Biostatistics

DATE: October 27, 2016
TIME: 1:30 - 2:50 pm
LOCATION: Medical School Office Building, Rm x303
TITLE: Capture-recapture models for DNA sequencing experiments

Timothy Patrick Daley
Postdoctoral Scholar, Stanford Statistics and Bioengineering


Current technologies DNA sequencing experiments involve sampling genomic fragments from a large pool, called a library. The library is constructed from a small initial amount of DNA using amplification procedures, hence each original fragment exists in the thousands or millions of copies. This amplification, although necessary to produce enough material for the experiment, can introduce large biases and implies that the properties of the library cannot be known beforehand. Our goal is to infer properties of the experiment based on a small initial sample of the library. The capture-recapture framework naturally fits this scenario, however next-generation sequencing experiments produce data several orders of magnitude larger than traditional capture-recapture experiments. This gives rise to challenges in extrapolating but also opportunities for for methods that utilize the size of the data for highly accurate inferences. We will discuss the application of non-parametric empirical Bayes models to predict critical aspects of sequencing experiments to allow for optimal allocation of sequencing resources in large-scale sequencing experiments.

Suggested readings:
Estimating the Number of Unsen Species: How Many Words Did Shakespeare Know?" by Efron & Thisted.

Predicting the molecular complexity of sequencing libraries (, 
Modeling genome coverage in single-cell sequencing (, & 
Applications of species accumulation curves in large-scale biological data analysis (