Workshop in Biostatistics

DATE: February 18, 2016
TIME: 1:30 - 3:00 pm
LOCATION: Medical School Office Building, Rm x303
TITLE: Statistical methods for identifying shared evolutionary history of human genes
Department of Statistics, Harvard University

Human has more than 20,000 genes but till now most of their functions are uncharacterized.  It was observed that functionally associated genes tend to gain and lose simultaneously during evolution, thus identifying co-evolution of genes predicts gene function. Pioneering work in (Pellegrini et. al., 1999) introduced the concept of ‘‘phylogenetic profiling’’ to relate genes by their similar presence/absence profiles across species.  Although there have been more than 10 phylogenetic profiling algorithms, most of them only used simple metrics (e.g. Hamming distance, Pearson correlation) to measure the similarity between genes’ presence/absence patterns.  We propose a tree-structured hidden Markov model (HMM) to model the stochastic gain/loss process of genes on a given phylogenetic tree, and a Bayesian Dirichlet process mixture of HMMs to group genes by virtue of shared evolutionary history.  We calculate the posterior distribution of evolutionary history of genes and co-evolved gene module assignments via dynamic programming and Markov chain Monte Carlo.  Application of our method (named as CLIME) to ~ 1000 annotated human pathways reveals both known and unanticipated evolutionary modularity and co-evolving components.

Suggested reading:
Pellegrini, et al. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proceedings of the National Academy of Sciences (1999)

Li, et al. Expansion of biological pathways based on evolutionary inference. Cell (2014) (Main text and supplementary statistical method of this paper can be downloaded from