Please refer to my NIH biosketch:
We have a highly collaborative research program in the evolutionary genomics of cancer. We apply well-established principles of phylogenetics to cancer evolution on the basis of whole genome sequencing and functional genomics data of multiple tumor samples from the same patient. Introductions to our work and the concepts we apply are best found in the Newburger et al paper in Genome Research (2013) and the Sidow and Spies review in TIGS (2015).
More information can be found here: http://www.sidowlab.org
Next-generation sequencing technologies are fueling a wave of new diagnostic tests. Progress on a key set of nine research challenge areas will help generate the knowledge required to advance effectively these diagnostics to the clinic.
View details for DOI 10.1126/scitranslmed.aaf7314
View details for Web of Science ID 000374412300003
View details for PubMedID 27099173
Although an increasing amount of human genetic variation is being identified and recorded, determining variants within repeated sequences of the human genome remains a challenge. Most population and genome-wide association studies have therefore been unable to consider variation in these regions. Core to the problem is the lack of a sequencing technology that produces reads with sufficient length and accuracy to enable unique mapping. Here, we present a novel methodology of using read clouds, obtained by accurate short-read sequencing of DNA derived from long fragment libraries, to confidently align short reads within repeat regions and enable accurate variant discovery. Our novel algorithm, Random Field Aligner (RFA), captures the relationships among the short reads governed by the long read process via a Markov Random Field. We utilized a modified version of the Illumina TruSeq synthetic long-read protocol, which yielded shallow-sequenced read clouds. We test RFA through extensive simulations and apply it to discover variants on the NA12878 human sample, for which shallow TruSeq read cloud sequencing data are available, and on an invasive breast carcinoma genome that we sequenced using the same method. We demonstrate that RFA facilitates accurate recovery of variation in 155 Mb of the human genome, including 94% of 67 Mb of segmental duplication sequence and 96% of 11 Mb of transcribed sequence, that are currently hidden from short-read technologies.
View details for DOI 10.1101/gr.191189.115
View details for PubMedID 26286554
The effects of genetic variation on gene regulation in the developing mammalian embryo remain largely unexplored. To globally quantify these effects, we crossed two divergent mouse strains and asked how genotype of the mother or of the embryo drives gene expression phenotype genomewide. Embryonic expression of 331 genes depends on the genotype of the mother. Embryonic genotype controls allele-specific expression of 1594 genes and a highly overlapping set of cis-expression quantitative trait loci (eQTL). A marked paucity of trans-eQTL suggests that the widespread expression differences do not propagate through the embryonic gene regulatory network. The cis-eQTL genes exhibit lower-than-average evolutionary conservation and are depleted for developmental regulators, consistent with purifying selection acting on expression phenotype of pattern formation genes. The widespread effect of maternal and embryonic genotype in conjunction with the purifying selection we uncovered suggests that embryogenesis is an important and understudied reservoir of phenotypic variation.
View details for DOI 10.7554/eLife.05538
View details for Web of Science ID 000373792400001
View details for PubMedID 25871848
View details for PubMedCentralID PMC4417935
Evolutionary mechanisms in cancer progression give tumors their individuality. Cancer evolution is different from organismal evolution, however, and we discuss where concepts from evolutionary genetics are useful or limited in facilitating an understanding of cancer. Based on these concepts we construct and apply the simplest plausible model of tumor growth and progression. Simulations using this simple model illustrate the importance of stochastic events early in tumorigenesis, highlight the dominance of exponential growth over linear growth and differentiation, and explain the clonal substructure of tumors.
View details for DOI 10.1016/j.tig.2015.02.001
View details for Web of Science ID 000353089500006
View details for PubMedID 25733351
All cells in an individual are related to one another by a bifurcating lineage tree, in which each node is an ancestral cell that divided into two, each branch connects two nodes, and the root is the zygote. When a somatic mutation occurs in an ancestral cell, all its descendants carry the mutation, which can then serve as a lineage marker for the phylogenetic reconstruction of tumor progression. Using this concept, we investigate cell lineage relationships and genetic heterogeneity of pre-invasive neoplasias compared to invasive carcinomas.We deeply sequenced over a thousand phylogenetically informative somatic variants in 66 morphologically independent samples from six patients that represent a spectrum of normal, early neoplasia, carcinoma in situ, and invasive carcinoma. For each patient, we obtained a highly resolved lineage tree that establishes the phylogenetic relationships among the pre-invasive lesions and with the invasive carcinoma.The trees reveal lineage heterogeneity of pre-invasive lesions, both within the same lesion, and between histologically similar ones. On the basis of the lineage trees, we identified a large number of independent recurrences of PIK3CA H1047 mutations in separate lesions in four of the six patients, often separate from the diagnostic carcinoma.Our analyses demonstrate that multi-sample phylogenetic inference provides insights on the origin of driver mutations, lineage heterogeneity of neoplastic proliferations, and the relationship of genomically aberrant neoplasias with the primary tumors. PIK3CA driver mutations may be comparatively benign inducers of cellular proliferation.
View details for DOI 10.1186/s13073-015-0146-2
View details for PubMedID 25918554
To investigate the epigenetic landscape at the interface between mother and fetus, we provide a comprehensive analysis of parent-of-origin bias in the mouse placenta. Using F1 interspecies hybrids between mus musculus (C57BL/6J) and mus musculus castaneus, we sequenced RNA from 23 individual midgestation placentas, five late stage placentas, and two yolk sac samples and then used SNPs to determine whether transcripts were preferentially generated from the maternal or paternal allele. In the placenta, we find 103 genes that show significant and reproducible parent-of-origin bias, of which 78 are novel candidates. Most (96%) show a strong maternal bias which we demonstrate, via multiple mathematical models, pyrosequencing, and FISH, is not due to maternal decidual contamination. Analysis of the X chromosome also reveals paternal expression of Xist and several genes that escape inactivation, most significantly Alas2, Fhl1, and Slc38a5. Finally, sequencing individual placentas allowed us to reveal notable expression similarity between littermates. In all, we observe a striking preference for maternal transcription in the midgestation mouse placenta and a dynamic imprinting landscape in extraembryonic tissues, reflecting the complex nature of epigenetic pathways in the placenta.
View details for DOI 10.1016/j.ydbio.2014.02.020
View details for PubMedID 24594094
We present the discovery of genes recurrently involved in structural variation in nasopharyngeal carcinoma (NPC) and the identification of a novel type of somatic structural variant. We identified the variants with high complexity mate-pair libraries and a novel computational algorithm specifically designed for tumor-normal comparisons, SMASH. SMASH combines signals from split reads and mate-pair discordance to detect somatic structural variants. We demonstrate a >90% validation rate and a breakpoint reconstruction accuracy of 3 bp by Sanger sequencing. Our approach identified three in-frame gene fusions (YAP1-MAML2, PTPLB-RSRC1, and SP3-PTK2) that had strong levels of expression in corresponding NPC tissues. We found two cases of a novel type of structural variant, which we call "coupled inversion," one of which produced the YAP1-MAML2 fusion. To investigate whether the identified fusion genes are recurrent, we performed fluorescent in situ hybridization (FISH) to screen 196 independent NPC cases. We observed recurrent rearrangements of MAML2 (three cases), PTK2 (six cases), and SP3 (two cases), corresponding to a combined rate of structural variation recurrence of 6% among tested NPC tissues.
View details for DOI 10.1101/gr.156224.113
View details for PubMedID 24214394
Next-generation sequencing technologies provide a powerful tool for studying genome evolution during progression of advanced diseases such as cancer. Although many recent studies have employed new sequencing technologies to detect mutations across multiple, genetically related tumors, current methods do not exploit available phylogenetic information to improve the accuracy of their variant calls. Here, we present a novel algorithm that uses somatic single-nucleotide variations (SNVs) in multiple, related tissue samples as lineage markers for phylogenetic tree reconstruction. Our method then leverages the inferred phylogeny to improve the accuracy of SNV discovery. Experimental analyses demonstrate that our method achieves up to 32% improvement for somatic SNV calling of multiple, related samples over the accuracy of GATK's Unified Genotyper, the state-of-the-art multisample SNV caller.
View details for DOI 10.1089/cmb.2013.0106
View details for PubMedID 24195709
High-occupancy target (HOT) regions are compact genome loci occupied by many different transcription factors (TFs). HOT regions were initially defined in invertebrate model organisms, and we here show that they are a ubiquitous feature of the human gene-regulation landscape.We identified HOT regions by a comprehensive analysis of ChIP-seq data from 96 DNA-associated proteins in 5 human cell lines. Most HOT regions co-localize with RNA polymerase II binding sites, but many are not near the promoters of annotated genes. At HOT promoters, TF occupancy is strongly predictive of transcription preinitiation complex recruitment and moderately predictive of initiating Pol II recruitment, but only weakly predictive of elongating Pol II and RNA transcript abundance. TF occupancy varies quantitatively within human HOT regions; we used this variation to discover novel associations between TFs. The sequence motif associated with any given TF's direct DNA binding is somewhat predictive of its empirical occupancy, but a great deal of occupancy occurs at sites without the TF's motif, implying indirect recruitment by another TF whose motif is present.Mammalian HOT regions are regulatory hubs that integrate the signals from diverse regulatory pathways to quantitatively tune the promoter for RNA polymerase II recruitment.
View details for DOI 10.1186/1471-2164-14-720
View details for Web of Science ID 000328633100002
View details for PubMedID 24138567
Cancer evolution involves cycles of genomic damage, epigenetic deregulation, and increased cellular proliferation that eventually culminate in the carcinoma phenotype. Early neoplasias, which are often found concurrently with carcinomas and are histologically distinguishable from normal breast tissue, are less advanced in phenotype than carcinomas and are thought to represent precursor stages. To elucidate their role in cancer evolution we performed comparative whole-genome sequencing of early neoplasias, matched normal tissue, and carcinomas from six patients, for a total of 31 samples. By using somatic mutations as lineage markers we built trees that relate the tissue samples within each patient. On the basis of these lineage trees we inferred the order, timing, and rates of genomic events. In four out of six cases, an early neoplasia and the carcinoma share a mutated common ancestor with recurring aneuploidies, and in all six cases evolution accelerated in the carcinoma lineage. Transition spectra of somatic mutations are stable and consistent across cases, suggesting that accumulation of somatic mutations is a result of increased ancestral cell division rather than specific mutational mechanisms. In contrast to highly advanced tumors that are the focus of much of the current cancer genome sequencing, neither the early neoplasia genomes nor the carcinomas are enriched with potentially functional somatic point mutations. Aneuploidies that occur in common ancestors of neoplastic and tumor cells are the earliest events that affect a large number of genes and may predispose breast tissue to eventual development of invasive carcinoma.
View details for DOI 10.1101/gr.151670.112
View details for Web of Science ID 000321119900007
View details for PubMedID 23568837
Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%-48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.
View details for DOI 10.1101/gr.148718.112
View details for PubMedID 23478400
View details for PubMedCentralID PMC3638132
The mechanisms by which the p53 tumor suppressor acts remain incompletely understood. To gain new insights into p53 biology, we used high-throughput sequencing to analyze global p53 transcriptional networks in primary mouse embryo fibroblasts in response to DNA damage. Chromatin immunoprecipitation sequencing reveals 4785 p53-bound sites in the genome located near 3193 genes involved in diverse biological processes. RNA sequencing analysis shows that only a subset of p53-bound genes is transcriptionally regulated, yielding a list of 432 p53-bound and regulated genes. Interestingly, we identify a host of autophagy genes as direct p53 target genes. While the autophagy program is regulated predominantly by p53, the p53 family members p63 and p73 contribute to activation of this autophagy gene network. Induction of autophagy genes in response to p53 activation is associated with enhanced autophagy in diverse settings and depends on p53 transcriptional activity. While p53-induced autophagy does not affect cell cycle arrest in response to DNA damage, it is important for both robust p53-dependent apoptosis triggered by DNA damage and transformation suppression by p53. Together, our data highlight an intimate connection between p53 and autophagy through a vast transcriptional network and indicate that autophagy contributes to p53-dependent apoptosis and cancer suppression.
View details for DOI 10.1101/gad.212282.112
View details for PubMedID 23651856
The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.
View details for DOI 10.1038/nature11247
View details for Web of Science ID 000308347000039
View details for PubMedID 22955616
View details for PubMedCentralID PMC3439153
Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the transcription factor binding into a hierarchy and integrated it with other genomic information (for example, microRNA regulation), forming a dense meta-network. Factors at different levels have different properties; for instance, top-level transcription factors more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs (for example, noise-buffering feed-forward loops). Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (that is, differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.
View details for DOI 10.1038/nature11245
View details for Web of Science ID 000308347000042
View details for PubMedID 22955619
Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE (http://encodeproject.org/ENCODE/) and modENCODE (http://www.modencode.org/) portals.
View details for DOI 10.1101/gr.136184.111
View details for Web of Science ID 000308272800021
View details for PubMedID 22955991
View details for PubMedCentralID PMC3431496
Gene regulation at functional elements (e.g., enhancers, promoters, insulators) is governed by an interplay of nucleosome remodeling, histone modifications, and transcription factor binding. To enhance our understanding of gene regulation, the ENCODE Consortium has generated a wealth of ChIP-seq data on DNA-binding proteins and histone modifications. We additionally generated nucleosome positioning data on two cell lines, K562 and GM12878, by MNase digestion and high-depth sequencing. Here we relate 14 chromatin signals (12 histone marks, DNase, and nucleosome positioning) to the binding sites of 119 DNA-binding proteins across a large number of cell lines. We developed a new method for unsupervised pattern discovery, the Clustered AGgregation Tool (CAGT), which accounts for the inherent heterogeneity in signal magnitude, shape, and implicit strand orientation of chromatin marks. We applied CAGT on a total of 5084 data set pairs to obtain an exhaustive catalog of high-resolution patterns of histone modifications and nucleosome positioning signals around bound transcription factors. Our analyses reveal extensive heterogeneity in how histone modifications are deposited, and how nucleosomes are positioned around binding sites. With the exception of the CTCF/cohesin complex, asymmetry of nucleosome positioning is predominant. Asymmetry of histone modifications is also widespread, for all types of chromatin marks examined, including promoter, enhancer, elongation, and repressive marks. The fine-resolution signal shapes discovered by CAGT unveiled novel correlation patterns between chromatin marks, nucleosome positioning, and sequence content. Meta-analyses of the signal profiles revealed a common vocabulary of chromatin signals shared across multiple cell lines and binding proteins.
View details for DOI 10.1101/gr.136366.111
View details for Web of Science ID 000308272800015
View details for PubMedID 22955985
View details for PubMedCentralID PMC3431490
Centrosomes organize the bipolar mitotic spindle, and centrosomal defects cause chromosome instability. Protein phosphorylation modulates centrosome function, and we provide a comprehensive map of phosphorylation on intact yeast centrosomes (18 proteins). Mass spectrometry was used to identify 297 phosphorylation sites on centrosomes from different cell cycle stages. We observed different modes of phosphoregulation via specific protein kinases, phosphorylation site clustering, and conserved phosphorylated residues. Mutating all eight cyclin-dependent kinase (Cdk)-directed sites within the core component, Spc42, resulted in lethality and reduced centrosomal assembly. Alternatively, mutation of one conserved Cdk site within ?-tubulin (Tub4-S360D) caused mitotic delay and aberrant anaphase spindle elongation. Our work establishes the extent and complexity of this prominent posttranslational modification in centrosome biology and provides specific examples of phosphorylation control in centrosome function.
View details for DOI 10.1126/science.1205193
View details for Web of Science ID 000291990000045
View details for PubMedID 21700874
Nucleosomes are the basic packaging units of chromatin, modulating accessibility of regulatory proteins to DNA and thus influencing eukaryotic gene regulation. Elaborate chromatin remodelling mechanisms have evolved that govern nucleosome organization at promoters, regulatory elements, and other functional regions in the genome. Analyses of chromatin landscape have uncovered a variety of mechanisms, including DNA sequence preferences, that can influence nucleosome positions. To identify major determinants of nucleosome organization in the human genome, we used deep sequencing to map nucleosome positions in three primary human cell types and in vitro. A majority of the genome showed substantial flexibility of nucleosome positions, whereas a small fraction showed reproducibly positioned nucleosomes. Certain sites that position in vitro can anchor the formation of nucleosomal arrays that have cell type-specific spacing in vivo. Our results unveil an interplay of sequence-based nucleosome preferences and non-nucleosomal factors in determining nucleosome organization within mammalian cells.
View details for DOI 10.1038/nature10002
View details for Web of Science ID 000291939700050
View details for PubMedID 21602827
The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome.
View details for DOI 10.1371/journal.pbio.1001046
View details for Web of Science ID 000289938900014
Computational efforts to identify functional elements within genomes leverage comparative sequence information by looking for regions that exhibit evidence of selective constraint. One way of detecting constrained elements is to follow a bottom-up approach by computing constraint scores for individual positions of a multiple alignment and then defining constrained elements as segments of contiguous, highly scoring nucleotide positions. Here we present GERP++, a new tool that uses maximum likelihood evolutionary rate estimation for position-specific scoring and, in contrast to previous bottom-up methods, a novel dynamic programming approach to subsequently define constrained elements. GERP++ evaluates a richer set of candidate element breakpoints and ranks them based on statistical significance, eliminating the need for biased heuristic extension techniques. Using GERP++ we identify over 1.3 million constrained elements spanning over 7% of the human genome. We predict a higher fraction than earlier estimates largely due to the annotation of longer constrained elements, which improves one to one correspondence between predicted elements with known functional sequences. GERP++ is an efficient and effective tool to provide both nucleotide- and element-level constraint scores within deep multiple sequence alignments.
View details for DOI 10.1371/journal.pcbi.1001025
View details for Web of Science ID 000285574600013
View details for PubMedID 21152010
Technological advances hold the promise of rapidly catalyzing the discovery of pathogenic variants for genetic disease. However, this possibility is tempered by limitations in interpreting the functional consequences of genetic variation at candidate loci. Here, we present a systematic approach, grounded on physiologically relevant assays, to evaluate the mutational content (125 alleles) of the 14 genes associated with Bardet-Biedl syndrome (BBS). A combination of in vivo assays with subsequent in vitro validation suggests that a significant fraction of BBS-associated mutations have a dominant-negative mode of action. Moreover, we find that a subset of common alleles, previously considered to be benign, are, in fact, detrimental to protein function and can interact with strong rare alleles to modulate disease presentation. These data represent a comprehensive evaluation of genetic load in a multilocus disease. Importantly, superimposition of these results to human genetics data suggests a previously underappreciated complexity in disease architecture that might be shared among diverse clinical phenotypes.
View details for DOI 10.1073/pnas.1000219107
View details for Web of Science ID 000278549300050
View details for PubMedID 20498079
Here, we demonstrate how comparative sequence analysis facilitates genome-wide base-pair-level interpretation of individual genetic variation and address two questions of importance for human personal genomics: first, whether an individual's functional variation comes mostly from noncoding or coding polymorphisms; and, second, whether population-specific or globally-present polymorphisms contribute more to functional variation in any given individual. Neither has been definitively answered by analyses of existing variation data because of a focus on coding polymorphisms, ascertainment biases in favor of common variation, and a lack of base-pair-level resolution for identifying functional variants. We resequenced 575 amplicons within 432 individuals at genomic sites enriched for evolutionary constraint and also analyzed variation within three published human genomes. We find that single-site measures of evolutionary constraint derived from mammalian multiple sequence alignments are strongly predictive of reductions in modern-day genetic diversity across a range of annotation categories and across the allele frequency spectrum from rare (<1%) to high frequency (>10% minor allele frequency). Furthermore, we show that putatively functional variation in an individual genome is dominated by polymorphisms that do not change protein sequence and that originate from our shared ancestral population and commonly segregate in human populations. These observations show that common, noncoding alleles contribute substantially to human phenotypes and that constraint-based analyses will be of value to identify phenotypically relevant variants in individual genomes.
View details for DOI 10.1101/gr.102210.109
View details for Web of Science ID 000275124600002
View details for PubMedID 20067941
ProPhylER (Protein Phylogeny and Evolutionary Rates) is a next-generation curated proteome resource that uses comparative sequence analysis to predict constraint and mutation impact for eukaryotic proteins. Its purpose is to inform any research program for which protein function and structure are relevant, by the predictive power of evolutionary constraint analyses. ProPhylER currently has nearly 9000 clusters of related proteins, including more than 200,000 sequences. It serves data via two interfaces. The "ProPhylER Interface" displays predictive analyses in sequence space; the "CrystalPainter" maps evolutionary constraints onto solved protein structures. Here we summarize ProPhylER's data content and analysis pipeline, demonstrate the use of ProPhylER's interfaces, and evaluate ProPhylER's unique regional analysis of evolutionary constraint. The high accuracy of ProPhylER's regional analysis complements the high resolution of its single-site analysis to effectively guide and inform structure-function investigations and predict the impact of polymorphisms.
View details for DOI 10.1101/gr.097121.109
View details for Web of Science ID 000273249500015
View details for PubMedID 19846609
Polycomb Repressive Complex 2 (PRC2) regulates key developmental genes in embryonic stem (ES) cells and during development. Here we show that Jarid2/Jumonji, a protein enriched in pluripotent cells and a founding member of the Jumonji C (JmjC) domain protein family, is a PRC2 subunit in ES cells. Genome-wide ChIP-seq analyses of Jarid2, Ezh2, and Suz12 binding reveal that Jarid2 and PRC2 occupy the same genomic regions. We further show that Jarid2 promotes PRC2 recruitment to the target genes while inhibiting PRC2 histone methyltransferase activity, suggesting that it acts as a "molecular rheostat" that finely calibrates PRC2 functions at developmental genes. Using Xenopus laevis as a model we demonstrate that Jarid2 knockdown impairs the induction of gastrulation genes in blastula embryos and results in failure of differentiation. Our findings illuminate a mechanism of histone methylation regulation in pluripotent cells and during early cell-fate transitions.
View details for DOI 10.1016/j.cell.2009.12.002
View details for Web of Science ID 000273048700017
View details for PubMedID 20064375
View details for PubMedCentralID PMC2911953
The development of Next Generation Sequencing technologies, capable of sequencing hundreds of millions of short reads (25-70 bp each) in a single run, is opening the door to population genomic studies of non-model species. In this paper we present SHRiMP - the SHort Read Mapping Package: a set of algorithms and methods to map short reads to a genome, even in the presence of a large amount of polymorphism. Our method is based upon a fast read mapping technique, separate thorough alignment methods for regular letter-space as well as AB SOLiD (color-space) reads, and a statistical model for false positive hits. We use SHRiMP to map reads from a newly sequenced Ciona savignyi individual to the reference genome. We demonstrate that SHRiMP can accurately map reads to this highly polymorphic genome, while confirming high heterozygosity of C. savignyi in this second individual. SHRiMP is freely available at http://compbio.cs.toronto.edu/shrimp.
View details for DOI 10.1371/journal.pcbi.1000386
View details for Web of Science ID 000267081300009
View details for PubMedID 19461883
Molecular interactions between protein complexes and DNA mediate essential gene-regulatory functions. Uncovering such interactions by chromatin immunoprecipitation coupled with massively parallel sequencing (ChIP-Seq) has recently become the focus of intense interest. We here introduce quantitative enrichment of sequence tags (QuEST), a powerful statistical framework based on the kernel density estimation approach, which uses ChIP-Seq data to determine positions where protein complexes contact DNA. Using QuEST, we discovered several thousand binding sites for the human transcription factors SRF, GABP and NRSF at an average resolution of about 20 base pairs. MEME motif-discovery tool-based analyses of the QuEST-identified sequences revealed DNA binding by cofactors of SRF, providing evidence that cofactor binding specificity can be obtained from ChIP-Seq data. By combining QuEST analyses with Gene Ontology (GO) annotations and expression data, we illustrate how general functions of transcription factors can be inferred.
View details for DOI 10.1038/NMETH.1246
View details for Web of Science ID 000258912700017
View details for PubMedID 19160518
View details for PubMedCentralID PMC2917543
The urochordate Ciona savignyi is an emerging model organism for the study of chordate evolution, development, and gene regulation. The extreme level of polymorphism in its population has inspired novel approaches in genome assembly, which we here continue to develop. Specifically, we present the reconstruction of all of C. savignyi's chromosomes via the development of a comprehensive genetic map, without a physical map intermediate. The resulting genetic map is complete, having one linkage group for each one of the 14 chromosomes. Eighty-three percent of the reference genome sequence is covered. The chromosomal reconstruction allowed us to investigate the evolution of genome structure in highly polymorphic species, by comparing the genome of C. savignyi to its divergent sister species, Ciona intestinalis. Both genomes have been extensively reshaped by intrachromosomal rearrangements. Interchromosomal changes have been extremely rare. This is in striking contrast to what has been observed in vertebrates, where interchromosomal events are commonplace. These results, when considered in light of the neutral theory, suggest fundamentally different modes of evolution of animal species with large versus small population sizes.
View details for DOI 10.1101/gr.078576.108
View details for Web of Science ID 000258116100018
View details for PubMedID 18519652
Using the massively parallel technique of sequencing by oligonucleotide ligation and detection (SOLiD; Applied Biosystems), we have assessed the in vivo positions of more than 44 million putative nucleosome cores in the multicellular genetic model organism Caenorhabditis elegans. These analyses provide a global view of the chromatin architecture of a multicellular animal at extremely high density and resolution. While we observe some degree of reproducible positioning throughout the genome in our mixed stage population of animals, we note that the major chromatin feature in the worm is a diversity of allowed nucleosome positions at the vast majority of individual loci. While absolute positioning of nucleosomes can vary substantially, relative positioning of nucleosomes (in a repeated array structure likely to be maintained at least in part by steric constraints) appears to be a significant property of chromatin structure. The high density of nucleosomal reads enabled a substantial extension of previous analysis describing the usage of individual oligonucleotide sequences along the span of the nucleosome core and linker. We release this data set, via the UCSC Genome Browser, as a resource for the high-resolution analysis of chromatin conformation and DNA accessibility at individual loci within the C. elegans genome.
View details for DOI 10.1101/gr.076463.108
View details for Web of Science ID 000257249100005
View details for PubMedID 18477713
Otopetrin 1 (Otop1) encodes a multi-transmembrane domain protein with no homology to known transporters, channels, exchangers, or receptors. Otop1 is necessary for the formation of otoconia and otoliths, calcium carbonate biominerals within the inner ear of mammals and teleost fish that are required for the detection of linear acceleration and gravity. Vertebrate Otop1 and its paralogues Otop2 and Otop3 define a new gene family with homology to the invertebrate Domain of Unknown Function 270 genes (DUF270; pfam03189).Multi-species comparison of the predicted primary sequences and predicted secondary structures of 62 vertebrate otopetrin, and arthropod and nematode DUF270 proteins, has established that the genes encoding these proteins constitute a single family that we renamed the Otopetrin Domain Protein (ODP) gene family. Signature features of ODP proteins are three "Otopetrin Domains" that are highly conserved between vertebrates, arthropods and nematodes, and a highly constrained predicted loop structure.Our studies suggest a refined topologic model for ODP insertion into the lipid bilayer of 12 transmembrane domains, and highlight conserved amino-acid residues that will aid in the biochemical examination of ODP family function. The high degree of sequence and structural similarity of the ODP proteins may suggest a conserved role in the intracellular trafficking of calcium and the formation of biominerals.
View details for DOI 10.1186/1471-2148-8-41
View details for Web of Science ID 000254053700001
View details for PubMedID 18254951
A recent comparative analysis of the sequenced genomes of 12 Drosophila species (Drosophila 12 Genomes Consortium, 2007; Stark et al., 2007) reveals a comprehensive picture of the evolution of small animal genomes and greatly improves computational predictions of functional elements in the D. melanogaster reference sequence.
View details for DOI 10.1016/j.cell.2007.12.003
View details for Web of Science ID 000252217200009
View details for PubMedID 18160030
Transcriptional coexpression of interacting gene products is required for complex molecular processes; however, the function and evolution of cis-regulatory elements that orchestrate coexpression remain largely unexplored. We mutagenized 19 regulatory elements that drive coexpression of Ciona muscle genes and obtained quantitative estimates of the cis-regulatory activity of the 77 motifs that comprise these elements. We found that individual motif activity ranges broadly within and among elements, and among different instantiations of the same motif type. The activity of orthologous motifs is strongly constrained, although motif arrangement, type, and activity vary greatly among the elements of different co-regulated genes. Thus, the syntactical rules governing this regulatory function are flexible but become highly constrained evolutionarily once they are established in a particular element.
View details for DOI 10.1126/science.1145893
View details for Web of Science ID 000249467900044
View details for PubMedID 17872446
Agouti-related protein encodes a neuropeptide that stimulates food intake. Agrp expression in the brain is restricted to neurons in the arcuate nucleus of the hypothalamus and is elevated by states of negative energy balance. The molecular mechanisms underlying Agrp regulation, however, remain poorly defined. Using a combination of transgenic and comparative sequence analysis, we have previously identified a 760 bp conserved region upstream of Agrp which contains STAT binding elements that participate in Agrp transcriptional regulation. In this study, we attempt to improve the specificity for detecting conserved elements in this region by comparing genomic sequences from 10 mammalian species. Our analysis reveals a symmetrical organization of conserved sequences upstream of Agrp, which cluster into two inverted repeat elements. Conserved sequences within these elements suggest a role for homeodomain proteins in the regulation of Agrp and provide additional targets for functional evaluation.
View details for DOI 10.1371/journal.pone.0000702
View details for Web of Science ID 000207452400006
View details for PubMedID 17684549
As a consequence of the evolutionary process, data collected from related species tend to be similar. This similarity by descent can obscure subtler signals in the data such as the evidence of constraint on variation due to shared selective pressures. In comparative sequence analysis, for example, sequence similarity is often used to illuminate important regions of the genome, but if the comparison is between closely related species, then similarity is the rule rather than the interesting exception. Furthermore, and perhaps worse yet, the contribution of a divergent third species may be masked by the strong similarity between the other two. Here we propose a remedy that weighs the contribution of each species according to its phylogenetic placement.We first solve the problem of summarizing data related by phylogeny, and we explain why an average should operate on the entire evolutionary trajectory that relates the data. This perspective leads to a new approach in which we define the average in terms of the phylogeny, using the data and a stochastic model to obtain a probability on evolutionary trajectories. With the assumption that the data evolve according to a Brownian motion process on the tree, we show that our evolutionary average can be computed as convex combination of the species data. Thus, our approach, called the BranchManager, defines both an average and a novel taxon weighting scheme. We compare the BranchManager to two other methods, demonstrating why it exhibits desirable properties. In doing so, we devise a framework for comparison and introduce the concept of a representative point at which the average is situated.The BranchManager uses as its representative point the phylogenetic center of mass, a choice which has both intuitive and practical appeal. Because our average is intrinsic to both the dataset and to the phylogeny, we expect it and its corresponding weighting scheme to be useful in all sorts of studies where interspecies data need to be combined. Obvious applications include evolutionary studies of morphology, physiology or behaviour, but quantitative measures such as sequence hydrophobicity and gene expression level are amenable to our approach as well. Other areas of potential impact include motif discovery and vaccine design. A Java implementation of the BranchManager is available for download, as is a script written in the statistical language R.
View details for DOI 10.1186/1471-2105-8-222
View details for Web of Science ID 000248131500001
View details for PubMedID 17594490
We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.
View details for DOI 10.1038/nature05874
View details for Web of Science ID 000247207500034
View details for PubMedID 17571346
View details for PubMedCentralID PMC2212820
A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.
View details for DOI 10.1101/gr.6034307
View details for Web of Science ID 000247226900009
View details for PubMedID 17567995
Whole-genome sequence data from samples of natural populations provide fertile grounds for analyses of intraspecific variation and tests of population genetic theory. We show that the urochordate Ciona savignyi, one of the species of ocean-dwelling broadcast spawners commonly known as sea squirts, exhibits the highest rates of single-nucleotide and structural polymorphism ever comprehensively quantified in a multicellular organism. We demonstrate that the cause for the extreme heterozygosity is a large effective population size, and, consistent with prediction by the neutral theory, we find evidence of strong purifying selection. These results constitute in-depth insight into the dynamics of highly polymorphic genomes and provide important empirical support of population genetic theory as it pertains to population size, heterozygosity, and natural selection.
View details for DOI 10.1073/pnas.0700890104
View details for Web of Science ID 000245331700079
View details for PubMedID 17372217
The sequence of Ciona savignyi was determined using a whole-genome shotgun strategy, but a high degree of polymorphism resulted in a fractured assembly wherein allelic sequences from the same genomic region assembled separately. We designed a multistep strategy to generate a nonredundant reference sequence from the original assembly by reconstructing and aligning the two 'haplomes' (haploid genomes). In the resultant 174 megabase reference sequence, each locus is represented once, misassemblies are corrected, and contiguity and continuity are dramatically improved.
View details for DOI 10.1186/gb-2007-8-3-r41
View details for Web of Science ID 000246081600014
View details for PubMedID 17374142
Agouti (ASIP) and Agouti-related protein (AgRP) are endogenous antagonists of melanocortin receptors that play critical roles in the regulation of pigmentation and energy balance, respectively, and which arose from a common ancestral gene early in vertebrate evolution. The N-terminal domain of ASIP facilitates antagonism by binding to an accessory receptor, but here we show that the N-terminal domain of AgRP has the opposite effect and acts as a prodomain that negatively regulates antagonist function. Computational analysis reveals similar patterns of evolutionary constraint in the ASIP and AgRP C-terminal domains, but fundamental differences between the N-terminal domains. These studies shed light on the relationships between regulation of pigmentation and body weight, and they illustrate how evolutionary structure function analysis can reveal both unique and common mechanisms of action for paralogous gene products.
View details for DOI 10.1016/j.chembiol.2006.10.006
View details for Web of Science ID 000243323600008
View details for PubMedID 17185225
We engage the experimental and computational challenges of de novo regulatory module discovery in a complex and largely unstudied metazoan genome. Our analysis is based on the comprehensive characterization of regulatory elements of 20 muscle genes in the chordate, Ciona savignyi. Three independent types of data we generate contribute to the characterization of a muscle-specific regulatory module: (1) Positive elements (PEs), short sequences sufficient for strong muscle expression that are identified in a high-resolution in vivo analysis; (2) CisModules (CMs), candidate regulatory modules defined by clusters of overrepresented motifs predicted de novo; and (3) Conserved elements (CEs), short noncoding sequences of strong conservation between C. savignyi and C. intestinalis. We estimate the accuracy of the computational predictions by an analysis of the intersection of these data. As final biological validation of the discovered muscle regulatory module, we implement a novel algorithm to search the genome for instances of the module and identify seven novel enhancers.
View details for DOI 10.1101/gr.4062605
View details for Web of Science ID 000232436800001
View details for PubMedID 16169925
Comparisons of orthologous genomic DNA sequences can be used to characterize regions that have been subject to purifying selection and are enriched for functional elements. We here present the results of such an analysis on an alignment of sequences from 29 mammalian species. The alignment captures approximately 3.9 neutral substitutions per site and spans approximately 1.9 Mbp of the human genome. We identify constrained elements from 3 bp to over 1 kbp in length, covering approximately 5.5% of the human locus. Our estimate for the total amount of nonexonic constraint experienced by this locus is roughly twice that for exonic constraint. Constrained elements tend to cluster, and we identify large constrained regions that correspond well with known functional elements. While constraint density inversely correlates with mobile element density, we also show the presence of unambiguously constrained elements overlapping mammalian ancestral repeats. In addition, we describe a number of elements in this region that have undergone intense purifying selection throughout mammalian evolution, and we show that these important elements are more numerous than previously thought. These results were obtained with Genomic Evolutionary Rate Profiling (GERP), a statistically rigorous and biologically transparent framework for constrained element identification. GERP identifies regions at high resolution that exhibit nucleotide substitution deficits, and measures these deficits as "rejected substitutions". Rejected substitutions reflect the intensity of past purifying selection and are used to rank and characterize constrained elements. We anticipate that GERP and the types of analyses it facilitates will provide further insights and improved annotation for the human genome as mammalian genome sequence data become richer.
View details for Web of Science ID 000230424000001
View details for PubMedID 15965027
We find that the degree of impairment of protein function by missense variants is predictable by comparative sequence analysis alone. The applicable range of impairment is not confined to binary predictions that distinguish normal from deleterious variants, but extends continuously from mild to severe effects. The accuracy of predictions is strongly dependent on sequence variation and is highest when diverse orthologs are available. High predictive accuracy is achieved by quantification of the physicochemical characteristics in each position of the protein, based on observed evolutionary variation. The strong relationship between physicochemical characteristics of a missense variant and impairment of protein function extends to human disease. By using four diverse proteins for which sufficient comparative sequence data are available, we show that grades of disease, or likelihood of developing cancer, correlate strongly with physicochemical constraint violation by causative amino acid variants.
View details for Web of Science ID 000230424000009
View details for PubMedID 15965030
The ability to discriminate between deleterious and neutral amino acid substitutions in the genes of patients remains a significant challenge in human genetics. The increasing availability of genomic sequence data from multiple vertebrate species allows inclusion of sequence conservation and physicochemical properties of residues to be used for functional prediction. In this study, the RET receptor tyrosine kinase serves as a model disease gene in which a broad spectrum (> or = 116) of disease-associated mutations has been identified among patients with Hirschsprung disease and multiple endocrine neoplasia type 2. We report the alignment of the human RET protein sequence with the orthologous sequences of 12 non-human vertebrates (eight mammalian, one avian, and three teleost species), their comparative analysis, the evolutionary topology of the RET protein, and predicted tolerance for all published missense mutations. We show that, although evolutionary conservation alone provides significant information to predict the effect of a RET mutation, a model that combines comparative sequence data with analysis of physiochemical properties in a quantitative framework provides far greater accuracy. Although the ability to discern the impact of a mutation is imperfect, our analyses permit substantial discrimination between predicted functional classes of RET mutations and disease severity even for a multigenic disease such as Hirschsprung disease.
View details for Web of Science ID 000230049500031
View details for PubMedID 15956201
As whole-genome sequencing efforts extend beyond more traditional model organisms to include a deep diversity of species, comparative genomic analyses will be further empowered to reveal insights into the human genome and its evolution. The discovery and annotation of functional genomic elements is a necessary step toward a detailed understanding of our biology, and sequence comparisons have proven to be an integral tool for that task. This review is structured to broadly reflect the statistical challenges in discriminating these functional elements from the bulk of the genome that has evolved neutrally. Specifically, we review the comparative genomics literature in terms of specificity, sensitivity, and phylogenetic scope, as well as the trade-offs that relate these factors in standard analyses. We consider the impact of an expanding diversity of orthologous sequences on our ability to resolve functional elements. This impact is assessed through both recent comparative analyses of deep alignments and mathematical modeling.
View details for DOI 10.1146/annurev.genom.6.080604.162146
View details for Web of Science ID 000232441500008
View details for PubMedID 16124857
Alignment and comparison of related genome sequences is a powerful method to identify regions likely to contain functional elements. Such analyses are data intensive, requiring the inclusion of genomic multiple sequence alignments, sequence annotations, and scores describing regional attributes of columns in the alignment. Visualization and browsing of results can be difficult, and there are currently limited software options for performing this task.The Application for Browsing Constraints (ABC) is interactive Java software for intuitive and efficient exploration of multiple sequence alignments and data typically associated with alignments. It is used to move quickly from a summary view of the entire alignment via arbitrary levels of resolution to individual alignment columns. It allows for the simultaneous display of quantitative data, (e.g., sequence similarity or evolutionary rates) and annotation data (e.g. the locations of genes, repeats, and constrained elements). It can be used to facilitate basic comparative sequence tasks, such as export of data in plain-text formats, visualization of phylogenetic trees, and generation of alignment summary graphics.The ABC is a lightweight, stand-alone, and flexible graphical user interface for browsing genomic multiple sequence alignments of specific loci, up to hundreds of kilobases or a few megabases in length. It is coded in Java for cross-platform use and the program and source code are freely available under the General Public License. Documentation and a sample data set are also available http://mendel.stanford.edu/sidowlab/downloads.html.
View details for DOI 10.1186/1471-2105-5-192
View details for Web of Science ID 000226622100001
View details for PubMedID 15588288
We show that sequence comparisons at different levels of resolution can efficiently guide functional analyses of regulatory regions in the ascidians Ciona savignyi and Ciona intestinalis. Sequence alignments of several tissue-specific genes guided discovery of minimal regulatory regions that are active in whole-embryo reporter assays. Using the Troponin I (TnI) locus as a case study, we show that more refined local sequence analyses can then be used to reveal functional substructure within a regulatory region. A high-resolution saturation mutagenesis in conjunction with comparative sequence analyses defined essential sequence elements within the TnI regulatory region. Finally, we found a significant, quantitative relationship between function and sequence divergence of noncoding functional elements. This work demonstrates the power of comparative sequence analysis between the two Ciona species for guiding gene regulatory experiments.
View details for DOI 10.1101/gr.2964504
View details for Web of Science ID 000225550400009
View details for PubMedID 15545496
We present an analysis of rates and patterns of microevolutionary phenomena that have shaped the human, mouse, and rat genomes since their last common ancestor. We find evidence for a shift in the mutational spectrum between the mouse and rat lineages, with the net effect being a relative increase in GC content in the rat genome. Our estimate for the neutral point substitution rate separating the two rodents is 0.196 substitutions per site, and 0.65 substitutions per site for the tree relating all three mammals. Small insertions and deletions of 1-10 bp in length ("microindels") occur at approximately 5% of the point substitution rate. Inferred regional correlations in evolutionary rates between lineages and between types of sites support the idea that rates of evolution are influenced by local genomic or cell biological context. No substantial correlations between rates of point substitutions and rates of microindels are found, however, implying that the influences that affect these processes are distinct. Finally, we have identified those regions in the human genome that are evolving slowly, which are likely to include functional elements important to human biology. At least 5% of the human genome is under substantial constraint, most of which is noncoding.
View details for DOI 10.1101/gr.2034704
View details for Web of Science ID 000220629900005
View details for PubMedID 15059994
The laboratory rat (Rattus norvegicus) is an indispensable tool in experimental medicine and drug development, having made inestimable contributions to human health. We report here the genome sequence of the Brown Norway (BN) rat strain. The sequence represents a high-quality 'draft' covering over 90% of the genome. The BN rat sequence is the third complete mammalian genome to be deciphered, and three-way comparisons with the human and mouse genomes resolve details of mammalian evolution. This first comprehensive analysis includes genes and proteins and their relation to human disease, repeated sequences, comparative genome-wide studies of mammalian orthologous chromosomal regions and rearrangement breakpoints, reconstruction of ancestral karyotypes and the events leading to existing species, rates of variation, and lineage-specific and lineage-independent evolutionary events such as expansion of gene families, orthology relations and protein evolution.
View details for DOI 10.1038/nature02426
View details for Web of Science ID 000220540100032
View details for PubMedID 15057822
We have built a whole-genome multiple alignment of the three currently available mammalian genomes using a fully automated pipeline that combines the local/global approach of the Berkeley Genome Pipeline and the LAGAN program. The strategy is based on progressive alignment and consists of two main steps: (1) alignment of the mouse and rat genomes, and (2) alignment of human to either the mouse-rat alignments from step 1, or the remaining unaligned mouse and rat sequences. The resulting alignments demonstrate high sensitivity, with 87% of all human gene-coding areas aligned in both mouse and rat. The specificity is also high: <7% of the rat contigs are aligned to multiple places in human, and 97% of all alignments with human sequence >100 kb agree with a three-way synteny map built independently, using predicted exons in the three genomes. At the nucleotide level <1% of the rat nucleotides are mapped to multiple places in the human sequence in the alignment, and 96.5% of human nucleotides within all alignments agree with the synteny map. The alignments are publicly available online, with visualization through the novel Multi-VISTA browser that we also present.
View details for DOI 10.1101/gr.2067704
View details for Web of Science ID 000220629900022
View details for PubMedID 15060011
View details for Web of Science ID 000224116500028
Comparative sequence analysis is contributing to the identification and characterization of genomic regulatory regions with functional roles. It is effective because functionally important regions tend to evolve at a slower rate than do less important regions. The choice of species for comparative analysis is crucial: shared ancestry of a clade of species facilitates the discovery of genomic features important to that clade, whereas increased sequence divergence improves the resolution at which features can be discovered. Recent studies suggest that comparative analyses are useful for all branches of life and that, in the near future, large-scale mammalian comparative sequence analysis will provide the best approach for the comprehensive discovery of human regulatory elements.
View details for DOI 10.1016/j.gde.2003.10.001
View details for Web of Science ID 000187248400009
View details for PubMedID 14638322
Comparative sequence analyses on a collection of carefully chosen mammalian genomes could facilitate identification of functional elements within the human genome and allow quantification of evolutionary constraint at the single nucleotide level. High-resolution quantification would be informative for determining the distribution of important positions within functional elements and for evaluating the relative importance of nucleotide sites that carry single nucleotide polymorphisms (SNPs). Because the level of resolution in comparative sequence analyses is a direct function of sequence diversity, we propose that the information content of a candidate mammalian genome be defined as the sequence divergence it would add relative to already-sequenced genomes. We show that reliable estimates of genomic sequence divergence can be obtained from small genomic regions. On the basis of a multiple sequence alignment of approximately 1.4 megabases each from eight mammals, we generate such estimates for five unsequenced mammals. Estimates of the neutral divergence in these data suggest that a small number of diverse mammalian genomes in addition to human, mouse, and rat would allow single nucleotide resolution in comparative sequence analyses.
View details for DOI 10.1101/gr.1064503
View details for Web of Science ID 000182645500007
View details for PubMedID 12727901
To compare entire genomes from different species, biologists increasingly need alignment methods that are efficient enough to handle long sequences, and accurate enough to correctly align the conserved biological features between distant species. We present LAGAN, a system for rapid global alignment of two homologous genomic sequences, and Multi-LAGAN, a system for multiple global alignment of genomic sequences. We tested our systems on a data set consisting of greater than 12 Mb of high-quality sequence from 12 vertebrate species. All the sequence was derived from the genomic region orthologous to an approximately 1.5-Mb region on human chromosome 7q31.3. We found that both LAGAN and Multi-LAGAN compare favorably with other leading alignment methods in correctly aligning protein-coding exons, especially between distant homologs such as human and chicken, or human and fugu. Multi-LAGAN produced the most accurate alignments, while requiring just 75 minutes on a personal computer to obtain the multiple alignment of all 12 sequences. Multi-LAGAN is a practical method for generating multiple alignments of long genomic sequences at any evolutionary distance. Our systems are publicly available at http://lagan.stanford.edu.
View details for DOI 10.1101/gr.926603
View details for Web of Science ID 000182046300018
View details for PubMedID 12654723
The neurodegenerative disease Niemann-Pick Type C2 (NPC2) results from mutations in the NPC2 (HE1) gene that cause abnormally high cholesterol accumulation in cells. We find that purified NPC2, a secreted soluble protein, binds cholesterol specifically with a much higher affinity (K(d) = 30-50 nM) than previously reported. Genetic and biochemical studies identified single amino acid changes that prevent both cholesterol binding and the restoration of normal cholesterol levels in mutant cells. The amino acids that affect cholesterol binding surround a hydrophobic pocket in the NPC2 protein structure, identifying a candidate sterol-binding location. On the basis of evolutionary analysis and mutagenesis, three other regions of the NPC2 protein emerged as important, including one required for efficient secretion.
View details for DOI 10.1073/pnas.0530027100
View details for Web of Science ID 000181365000065
View details for PubMedID 12591949
Early vertebrate evolution is characterized by a significant increase of organismal complexity over a relatively short time span. We present quantitative evidence for a high rate of increase in morphological complexity during early vertebrate evolution. Possible molecular evolutionary mechanisms that underlie this increase in complexity fall into a small number of categories, one of which is gene duplication and subsequent structural or regulatory neofunctionalization. We discuss analyses of two gene families whose regulatory and structural evolution shed light on the connection between gene duplication and increases in organismal complexity.
View details for PubMedID 12836684
Comparative sequence analyses of eukaryotic genes and genomic regions are beginning to provide a wealth of information that is directly relevant to human biology. Functional changes that set us apart from apes are identifiable, as are functional constraints in proteins and genomic elements that arose in our relatively distant phylogenetic past.
View details for Web of Science ID 000178461900004
View details for PubMedID 12372296
Likelihood estimates of local rates of evolution within proteins reveal that selective constraints on structure and function are quantitatively stable over billions of years of divergence. The stability of constraints produces an intramolecular clock that gives each protein a characteristic pattern of evolutionary rates along its sequence. This pattern allows the identification of constrained regions and, because the rate of evolution is a quantitative measure of the strength of the constraint, of their functional importance. We show that results from such analyses, which require only sequence alignments, are consistent with experimental and mutational data. The methodology has significant predictive power and may be used to guide structure--function studies for any protein represented by a modest number of homologs in sequence databases.
View details for DOI 10.1073/pnas.042692299
View details for Web of Science ID 000174284600059
View details for PubMedID 11880638
Vertebrate genomes contain multiple copies of related genes that arose through gene duplication. In the past it has been proposed that these duplicated genes were retained because of acquisition of novel beneficial functions. A more recent model, the duplication-degeneration-complementation hypothesis (DDC), posits that the functions of a single gene may become separately allocated among the duplicated genes, rendering both duplicates essential. Thus far, empirical evidence for this model has been limited to the engrailed and sox family of developmental regulators, and it has been unclear whether it may also apply to ubiquitously expressed genes with essential functions for cell survival. Here we describe the cloning of three zebrafish alpha subunits of the Na(+),K(+)-ATPase and a comprehensive evolutionary analysis of this gene family. The predicted amino acid sequences are extremely well conserved among vertebrates. The evolutionary relationships and the map positions of these genes and of other alpha-like sequences indicate that both tandem and ploidy duplications contributed to the expansion of this gene family in the teleost lineage. The duplications are accompanied by acquisition of clear functional specialization, consistent with the DDC model of genome evolution.
View details for Web of Science ID 000171456000004
View details for PubMedID 11591639
The recessive aphakia (ak) mouse mutant is characterized by bilateral microphthalmia due to a failure of lens morphogenesis. We fine-mapped the ak locus to the interval between D19Umi1 and D19Mit9, developed new polymorphic markers, and mapped candidate genes by construction of a BAC contig. The Pitx3 gene, known to be expressed in lens primordia, shows zero recombination with the ak mutation on our intersubspecific intercross panel representing 1170 meioses. A recent report described a deletion in the intergenic region between Gbf1 and Pitx3 as the possible ak mutation. Our results differ in that we find not only the distant intergenic deletion, but also a much larger deletion directly in the Pitx3 gene, eliminating exon 1 and extending into intron 1 and the promoter region. Pitx3 transcript levels are severely reduced in ak/ak mice from E11.5 to newborn (5 +/- 1% of the wildtype levels at E13.5), while an involvement of the flanking Gbf1 and Cig30 genes in the aberrant lens development is highly unlikely based on expression analysis. We conclude that the ak mutation consists of two deletions, the larger of which removes part of Pitx3, indicating a crucial role of this gene in early lens development.
View details for Web of Science ID 000167553700007
View details for PubMedID 11247667
Early outgrowth of the vertebrate embryonic limb requires signalling by the apical ectodermal ridge (AER) to the progress zone (PZ), which in response proliferates and lays down the pattern of the presumptive limb in a proximal to distal progression. Signals from the PZ maintain the AER until the anlagen for the distal phalanges have been formed. The semidominant mouse mutant dactylaplasia (Dac) disrupts the maintenance of the AER, leading to truncation of distal structures of the developing footplate, or autopod. Adult Dac homozygotes thus lack hands and feet except for malformed single digits, whereas heterozygotes lack phalanges of the three middle digits. Dac resembles the human autosomal dominant split hand/foot malformation (SHFM) diseases. One of these, SHFM3, maps to chromosome 10q24 (Refs 6,7), which is syntenic to the Dac region on chromosome 19, and may disrupt the orthologue of Dac. We report here the positional cloning of Dac and show that it belongs to the F-box/WD40 gene family, which encodes adapters that target specific proteins for destruction by presenting them to the ubiquitination machinery. In conjuction with recent biochemical studies, this report demonstrates the importance of this gene family in vertebrate embryonic development.
View details for Web of Science ID 000082337300026
View details for PubMedID 10471509