Research & Scholarship
Current Research and Scholarly Interests
Please refer to our web sites for current research:
Independent Studies (15)
- Biomedical Informatics Teaching Methods
BIOMEDIN 290 (Aut, Win, Spr, Sum)
- Directed Reading and Research
BIOMEDIN 299 (Aut, Win, Spr, Sum)
- Directed Reading in Cancer Biology
CBIO 299 (Aut, Win, Spr, Sum)
- Directed Reading in Genetics
GENE 299 (Aut, Win, Spr, Sum)
- Directed Reading in Pathology
PATH 299 (Aut, Win, Spr, Sum)
- Early Clinical Experience in Pathology
PATH 280 (Aut, Win, Spr, Sum)
- Graduate Research
CBIO 399 (Aut, Win, Spr, Sum)
- Graduate Research
GENE 399 (Aut, Win, Spr, Sum)
- Graduate Research
PATH 399 (Aut, Win, Spr, Sum)
- Medical Scholars Research
BIOMEDIN 370 (Aut, Win, Spr, Sum)
- Medical Scholars Research
GENE 370 (Aut, Win, Spr, Sum)
- Medical Scholars Research
PATH 370 (Aut, Win, Spr, Sum)
- Supervised Study
GENE 260 (Aut, Win, Spr, Sum)
- Undergraduate Research
GENE 199 (Aut, Win, Spr, Sum)
- Undergraduate Research
PATH 199 (Aut, Win, Spr, Sum)
- Biomedical Informatics Teaching Methods
- Prior Year Courses
Inference of Tumor Phylogenies with Improved Somatic Mutation Discovery
JOURNAL OF COMPUTATIONAL BIOLOGY
2013; 20 (11): 933-944
Next-generation sequencing technologies provide a powerful tool for studying genome evolution during progression of advanced diseases such as cancer. Although many recent studies have employed new sequencing technologies to detect mutations across multiple, genetically related tumors, current methods do not exploit available phylogenetic information to improve the accuracy of their variant calls. Here, we present a novel algorithm that uses somatic single-nucleotide variations (SNVs) in multiple, related tissue samples as lineage markers for phylogenetic tree reconstruction. Our method then leverages the inferred phylogeny to improve the accuracy of SNV discovery. Experimental analyses demonstrate that our method achieves up to 32% improvement for somatic SNV calling of multiple, related samples over the accuracy of GATK's Unified Genotyper, the state-of-the-art multisample SNV caller.
View details for DOI 10.1089/cmb.2013.0106
View details for Web of Science ID 000326577600008
View details for PubMedID 24195709
Transcription-factor occupancy at HOT regions quantitatively predicts RNA polymerase recruitment in five human cell lines
High-occupancy target (HOT) regions are compact genome loci occupied by many different transcription factors (TFs). HOT regions were initially defined in invertebrate model organisms, and we here show that they are a ubiquitous feature of the human gene-regulation landscape.We identified HOT regions by a comprehensive analysis of ChIP-seq data from 96 DNA-associated proteins in 5 human cell lines. Most HOT regions co-localize with RNA polymerase II binding sites, but many are not near the promoters of annotated genes. At HOT promoters, TF occupancy is strongly predictive of transcription preinitiation complex recruitment and moderately predictive of initiating Pol II recruitment, but only weakly predictive of elongating Pol II and RNA transcript abundance. TF occupancy varies quantitatively within human HOT regions; we used this variation to discover novel associations between TFs. The sequence motif associated with any given TF's direct DNA binding is somewhat predictive of its empirical occupancy, but a great deal of occupancy occurs at sites without the TF's motif, implying indirect recruitment by another TF whose motif is present.Mammalian HOT regions are regulatory hubs that integrate the signals from diverse regulatory pathways to quantitatively tune the promoter for RNA polymerase II recruitment.
View details for DOI 10.1186/1471-2164-14-720
View details for Web of Science ID 000328633100002
View details for PubMedID 24138567
Genome evolution during progression to breast cancer
2013; 23 (7): 1097-1108
Cancer evolution involves cycles of genomic damage, epigenetic deregulation, and increased cellular proliferation that eventually culminate in the carcinoma phenotype. Early neoplasias, which are often found concurrently with carcinomas and are histologically distinguishable from normal breast tissue, are less advanced in phenotype than carcinomas and are thought to represent precursor stages. To elucidate their role in cancer evolution we performed comparative whole-genome sequencing of early neoplasias, matched normal tissue, and carcinomas from six patients, for a total of 31 samples. By using somatic mutations as lineage markers we built trees that relate the tissue samples within each patient. On the basis of these lineage trees we inferred the order, timing, and rates of genomic events. In four out of six cases, an early neoplasia and the carcinoma share a mutated common ancestor with recurring aneuploidies, and in all six cases evolution accelerated in the carcinoma lineage. Transition spectra of somatic mutations are stable and consistent across cases, suggesting that accumulation of somatic mutations is a result of increased ancestral cell division rather than specific mutational mechanisms. In contrast to highly advanced tumors that are the focus of much of the current cancer genome sequencing, neither the early neoplasia genomes nor the carcinomas are enriched with potentially functional somatic point mutations. Aneuploidies that occur in common ancestors of neoplastic and tumor cells are the earliest events that affect a large number of genes and may predispose breast tissue to eventual development of invasive carcinoma.
View details for DOI 10.1101/gr.151670.112
View details for Web of Science ID 000321119900007
View details for PubMedID 23568837
The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes.
2013; 23 (5): 749-761
Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%-48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.
View details for DOI 10.1101/gr.148718.112
View details for PubMedID 23478400
Architecture of the human regulatory network derived from ENCODE data
2012; 489 (7414): 91-100
Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the transcription factor binding into a hierarchy and integrated it with other genomic information (for example, microRNA regulation), forming a dense meta-network. Factors at different levels have different properties; for instance, top-level transcription factors more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs (for example, noise-buffering feed-forward loops). Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (that is, differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.
View details for DOI 10.1038/nature11245
View details for Web of Science ID 000308347000042
View details for PubMedID 22955619
An integrated encyclopedia of DNA elements in the human genome
2012; 489 (7414): 57-74
The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.
View details for DOI 10.1038/nature11247
View details for Web of Science ID 000308347000039
View details for PubMedID 22955616
ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia
2012; 22 (9): 1813-1831
Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE (http://encodeproject.org/ENCODE/) and modENCODE (http://www.modencode.org/) portals.
View details for DOI 10.1101/gr.136184.111
View details for Web of Science ID 000308272800021
View details for PubMedID 22955991
Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements
2012; 22 (9): 1735-1747
Gene regulation at functional elements (e.g., enhancers, promoters, insulators) is governed by an interplay of nucleosome remodeling, histone modifications, and transcription factor binding. To enhance our understanding of gene regulation, the ENCODE Consortium has generated a wealth of ChIP-seq data on DNA-binding proteins and histone modifications. We additionally generated nucleosome positioning data on two cell lines, K562 and GM12878, by MNase digestion and high-depth sequencing. Here we relate 14 chromatin signals (12 histone marks, DNase, and nucleosome positioning) to the binding sites of 119 DNA-binding proteins across a large number of cell lines. We developed a new method for unsupervised pattern discovery, the Clustered AGgregation Tool (CAGT), which accounts for the inherent heterogeneity in signal magnitude, shape, and implicit strand orientation of chromatin marks. We applied CAGT on a total of 5084 data set pairs to obtain an exhaustive catalog of high-resolution patterns of histone modifications and nucleosome positioning signals around bound transcription factors. Our analyses reveal extensive heterogeneity in how histone modifications are deposited, and how nucleosomes are positioned around binding sites. With the exception of the CTCF/cohesin complex, asymmetry of nucleosome positioning is predominant. Asymmetry of histone modifications is also widespread, for all types of chromatin marks examined, including promoter, enhancer, elongation, and repressive marks. The fine-resolution signal shapes discovered by CAGT unveiled novel correlation patterns between chromatin marks, nucleosome positioning, and sequence content. Meta-analyses of the signal profiles revealed a common vocabulary of chromatin signals shared across multiple cell lines and binding proteins.
View details for DOI 10.1101/gr.136366.111
View details for Web of Science ID 000308272800015
View details for PubMedID 22955985
Determinants of nucleosome organization in primary human cells
2011; 474 (7352): 516-U148
Nucleosomes are the basic packaging units of chromatin, modulating accessibility of regulatory proteins to DNA and thus influencing eukaryotic gene regulation. Elaborate chromatin remodelling mechanisms have evolved that govern nucleosome organization at promoters, regulatory elements, and other functional regions in the genome. Analyses of chromatin landscape have uncovered a variety of mechanisms, including DNA sequence preferences, that can influence nucleosome positions. To identify major determinants of nucleosome organization in the human genome, we used deep sequencing to map nucleosome positions in three primary human cell types and in vitro. A majority of the genome showed substantial flexibility of nucleosome positions, whereas a small fraction showed reproducibly positioned nucleosomes. Certain sites that position in vitro can anchor the formation of nucleosomal arrays that have cell type-specific spacing in vivo. Our results unveil an interplay of sequence-based nucleosome preferences and non-nucleosomal factors in determining nucleosome organization within mammalian cells.
View details for DOI 10.1038/nature10002
View details for Web of Science ID 000291939700050
View details for PubMedID 21602827
Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP plus
PLOS COMPUTATIONAL BIOLOGY
2010; 6 (12)
Computational efforts to identify functional elements within genomes leverage comparative sequence information by looking for regions that exhibit evidence of selective constraint. One way of detecting constrained elements is to follow a bottom-up approach by computing constraint scores for individual positions of a multiple alignment and then defining constrained elements as segments of contiguous, highly scoring nucleotide positions. Here we present GERP++, a new tool that uses maximum likelihood evolutionary rate estimation for position-specific scoring and, in contrast to previous bottom-up methods, a novel dynamic programming approach to subsequently define constrained elements. GERP++ evaluates a richer set of candidate element breakpoints and ranks them based on statistical significance, eliminating the need for biased heuristic extension techniques. Using GERP++ we identify over 1.3 million constrained elements spanning over 7% of the human genome. We predict a higher fraction than earlier estimates largely due to the annotation of longer constrained elements, which improves one to one correspondence between predicted elements with known functional sequences. GERP++ is an efficient and effective tool to provide both nucleotide- and element-level constraint scores within deep multiple sequence alignments.
View details for DOI 10.1371/journal.pcbi.1001025
View details for Web of Science ID 000285574600013
View details for PubMedID 21152010
Evolutionary constraint facilitates interpretation of genetic variation in resequenced human genomes
2010; 20 (3): 301-310
Here, we demonstrate how comparative sequence analysis facilitates genome-wide base-pair-level interpretation of individual genetic variation and address two questions of importance for human personal genomics: first, whether an individual's functional variation comes mostly from noncoding or coding polymorphisms; and, second, whether population-specific or globally-present polymorphisms contribute more to functional variation in any given individual. Neither has been definitively answered by analyses of existing variation data because of a focus on coding polymorphisms, ascertainment biases in favor of common variation, and a lack of base-pair-level resolution for identifying functional variants. We resequenced 575 amplicons within 432 individuals at genomic sites enriched for evolutionary constraint and also analyzed variation within three published human genomes. We find that single-site measures of evolutionary constraint derived from mammalian multiple sequence alignments are strongly predictive of reductions in modern-day genetic diversity across a range of annotation categories and across the allele frequency spectrum from rare (<1%) to high frequency (>10% minor allele frequency). Furthermore, we show that putatively functional variation in an individual genome is dominated by polymorphisms that do not change protein sequence and that originate from our shared ancestral population and commonly segregate in human populations. These observations show that common, noncoding alleles contribute substantially to human phenotypes and that constraint-based analyses will be of value to identify phenotypically relevant variants in individual genomes.
View details for DOI 10.1101/gr.102210.109
View details for Web of Science ID 000275124600002
View details for PubMedID 20067941
ProPhylER: A curated online resource for protein function and structure based on evolutionary constraint analyses
2010; 20 (1): 142-154
ProPhylER (Protein Phylogeny and Evolutionary Rates) is a next-generation curated proteome resource that uses comparative sequence analysis to predict constraint and mutation impact for eukaryotic proteins. Its purpose is to inform any research program for which protein function and structure are relevant, by the predictive power of evolutionary constraint analyses. ProPhylER currently has nearly 9000 clusters of related proteins, including more than 200,000 sequences. It serves data via two interfaces. The "ProPhylER Interface" displays predictive analyses in sequence space; the "CrystalPainter" maps evolutionary constraints onto solved protein structures. Here we summarize ProPhylER's data content and analysis pipeline, demonstrate the use of ProPhylER's interfaces, and evaluate ProPhylER's unique regional analysis of evolutionary constraint. The high accuracy of ProPhylER's regional analysis complements the high resolution of its single-site analysis to effectively guide and inform structure-function investigations and predict the impact of polymorphisms.
View details for DOI 10.1101/gr.097121.109
View details for Web of Science ID 000273249500015
View details for PubMedID 19846609
Jarid2/Jumonji Coordinates Control of PRC2 Enzymatic Activity and Target Gene Occupancy in Pluripotent Cells
2009; 139 (7): 1290-1302
Polycomb Repressive Complex 2 (PRC2) regulates key developmental genes in embryonic stem (ES) cells and during development. Here we show that Jarid2/Jumonji, a protein enriched in pluripotent cells and a founding member of the Jumonji C (JmjC) domain protein family, is a PRC2 subunit in ES cells. Genome-wide ChIP-seq analyses of Jarid2, Ezh2, and Suz12 binding reveal that Jarid2 and PRC2 occupy the same genomic regions. We further show that Jarid2 promotes PRC2 recruitment to the target genes while inhibiting PRC2 histone methyltransferase activity, suggesting that it acts as a "molecular rheostat" that finely calibrates PRC2 functions at developmental genes. Using Xenopus laevis as a model we demonstrate that Jarid2 knockdown impairs the induction of gastrulation genes in blastula embryos and results in failure of differentiation. Our findings illuminate a mechanism of histone methylation regulation in pluripotent cells and during early cell-fate transitions.
View details for DOI 10.1016/j.cell.2009.12.002
View details for Web of Science ID 000273048700017
View details for PubMedID 20064375
Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data
2008; 5 (9): 829-834
Molecular interactions between protein complexes and DNA mediate essential gene-regulatory functions. Uncovering such interactions by chromatin immunoprecipitation coupled with massively parallel sequencing (ChIP-Seq) has recently become the focus of intense interest. We here introduce quantitative enrichment of sequence tags (QuEST), a powerful statistical framework based on the kernel density estimation approach, which uses ChIP-Seq data to determine positions where protein complexes contact DNA. Using QuEST, we discovered several thousand binding sites for the human transcription factors SRF, GABP and NRSF at an average resolution of about 20 base pairs. MEME motif-discovery tool-based analyses of the QuEST-identified sequences revealed DNA binding by cofactors of SRF, providing evidence that cofactor binding specificity can be obtained from ChIP-Seq data. By combining QuEST analyses with Gene Ontology (GO) annotations and expression data, we illustrate how general functions of transcription factors can be inferred.
View details for DOI 10.1038/NMETH.1246
View details for Web of Science ID 000258912700017
View details for PubMedID 19160518
The C-savignyi genetic map and its integration with the reference sequence facilitates insights into chordate genome evolution
2008; 18 (8): 1369-1379
The urochordate Ciona savignyi is an emerging model organism for the study of chordate evolution, development, and gene regulation. The extreme level of polymorphism in its population has inspired novel approaches in genome assembly, which we here continue to develop. Specifically, we present the reconstruction of all of C. savignyi's chromosomes via the development of a comprehensive genetic map, without a physical map intermediate. The resulting genetic map is complete, having one linkage group for each one of the 14 chromosomes. Eighty-three percent of the reference genome sequence is covered. The chromosomal reconstruction allowed us to investigate the evolution of genome structure in highly polymorphic species, by comparing the genome of C. savignyi to its divergent sister species, Ciona intestinalis. Both genomes have been extensively reshaped by intrachromosomal rearrangements. Interchromosomal changes have been extremely rare. This is in striking contrast to what has been observed in vertebrates, where interchromosomal events are commonplace. These results, when considered in light of the neutral theory, suggest fundamentally different modes of evolution of animal species with large versus small population sizes.
View details for DOI 10.1101/gr.078576.108
View details for Web of Science ID 000258116100018
View details for PubMedID 18519652
A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning
2008; 18 (7): 1051-1063
Using the massively parallel technique of sequencing by oligonucleotide ligation and detection (SOLiD; Applied Biosystems), we have assessed the in vivo positions of more than 44 million putative nucleosome cores in the multicellular genetic model organism Caenorhabditis elegans. These analyses provide a global view of the chromatin architecture of a multicellular animal at extremely high density and resolution. While we observe some degree of reproducible positioning throughout the genome in our mixed stage population of animals, we note that the major chromatin feature in the worm is a diversity of allowed nucleosome positions at the vast majority of individual loci. While absolute positioning of nucleosomes can vary substantially, relative positioning of nucleosomes (in a repeated array structure likely to be maintained at least in part by steric constraints) appears to be a significant property of chromatin structure. The high density of nucleosomal reads enabled a substantial extension of previous analysis describing the usage of individual oligonucleotide sequences along the span of the nucleosome core and linker. We release this data set, via the UCSC Genome Browser, as a resource for the high-resolution analysis of chromatin conformation and DNA accessibility at individual loci within the C. elegans genome.
View details for DOI 10.1101/gr.076463.108
View details for Web of Science ID 000257249100005
View details for PubMedID 18477713
Fruit fly family fun
2007; 131 (7): 1222-1223
A recent comparative analysis of the sequenced genomes of 12 Drosophila species (Drosophila 12 Genomes Consortium, 2007; Stark et al., 2007) reveals a comprehensive picture of the evolution of small animal genomes and greatly improves computational predictions of functional elements in the D. melanogaster reference sequence.
View details for DOI 10.1016/j.cell.2007.12.003
View details for Web of Science ID 000252217200009
View details for PubMedID 18160030
Functional architecture and evolution of transcriptional elements that drive gene coexpression
2007; 317 (5844): 1557-1560
Transcriptional coexpression of interacting gene products is required for complex molecular processes; however, the function and evolution of cis-regulatory elements that orchestrate coexpression remain largely unexplored. We mutagenized 19 regulatory elements that drive coexpression of Ciona muscle genes and obtained quantitative estimates of the cis-regulatory activity of the 77 motifs that comprise these elements. We found that individual motif activity ranges broadly within and among elements, and among different instantiations of the same motif type. The activity of orthologous motifs is strongly constrained, although motif arrangement, type, and activity vary greatly among the elements of different co-regulated genes. Thus, the syntactical rules governing this regulatory function are flexible but become highly constrained evolutionarily once they are established in a particular element.
View details for DOI 10.1126/science.1145893
View details for Web of Science ID 000249467900044
View details for PubMedID 17872446
Mammalian Comparative Sequence Analysis of the Agrp Locus
2007; 2 (8)
Agouti-related protein encodes a neuropeptide that stimulates food intake. Agrp expression in the brain is restricted to neurons in the arcuate nucleus of the hypothalamus and is elevated by states of negative energy balance. The molecular mechanisms underlying Agrp regulation, however, remain poorly defined. Using a combination of transgenic and comparative sequence analysis, we have previously identified a 760 bp conserved region upstream of Agrp which contains STAT binding elements that participate in Agrp transcriptional regulation. In this study, we attempt to improve the specificity for detecting conserved elements in this region by comparing genomic sequences from 10 mammalian species. Our analysis reveals a symmetrical organization of conserved sequences upstream of Agrp, which cluster into two inverted repeat elements. Conserved sequences within these elements suggest a role for homeodomain proteins in the regulation of Agrp and provide additional targets for functional evaluation.
View details for DOI 10.1371/journal.pone.0000702
View details for Web of Science ID 000207452400006
View details for PubMedID 17684549
Extreme genomic variation in a natural population
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2007; 104 (13): 5698-5703
Whole-genome sequence data from samples of natural populations provide fertile grounds for analyses of intraspecific variation and tests of population genetic theory. We show that the urochordate Ciona savignyi, one of the species of ocean-dwelling broadcast spawners commonly known as sea squirts, exhibits the highest rates of single-nucleotide and structural polymorphism ever comprehensively quantified in a multicellular organism. We demonstrate that the cause for the extreme heterozygosity is a large effective population size, and, consistent with prediction by the neutral theory, we find evidence of strong purifying selection. These results constitute in-depth insight into the dynamics of highly polymorphic genomes and provide important empirical support of population genetic theory as it pertains to population size, heterozygosity, and natural selection.
View details for DOI 10.1073/pnas.0700890104
View details for Web of Science ID 000245331700079
View details for PubMedID 17372217
A haplome alignment and reference sequence of the highly polymorphic Ciona savignyi genome
2007; 8 (3)
The sequence of Ciona savignyi was determined using a whole-genome shotgun strategy, but a high degree of polymorphism resulted in a fractured assembly wherein allelic sequences from the same genomic region assembled separately. We designed a multistep strategy to generate a nonredundant reference sequence from the original assembly by reconstructing and aligning the two 'haplomes' (haploid genomes). In the resultant 174 megabase reference sequence, each locus is represented once, misassemblies are corrected, and contiguity and continuity are dramatically improved.
View details for DOI 10.1186/gb-2007-8-3-r41
View details for Web of Science ID 000246081600014
View details for PubMedID 17374142
De novo discovery of a tissue-specific gene regulatory module in a chordate
2005; 15 (10): 1315-1324
We engage the experimental and computational challenges of de novo regulatory module discovery in a complex and largely unstudied metazoan genome. Our analysis is based on the comprehensive characterization of regulatory elements of 20 muscle genes in the chordate, Ciona savignyi. Three independent types of data we generate contribute to the characterization of a muscle-specific regulatory module: (1) Positive elements (PEs), short sequences sufficient for strong muscle expression that are identified in a high-resolution in vivo analysis; (2) CisModules (CMs), candidate regulatory modules defined by clusters of overrepresented motifs predicted de novo; and (3) Conserved elements (CEs), short noncoding sequences of strong conservation between C. savignyi and C. intestinalis. We estimate the accuracy of the computational predictions by an analysis of the intersection of these data. As final biological validation of the discovered muscle regulatory module, we implement a novel algorithm to search the genome for instances of the module and identify seven novel enhancers.
View details for DOI 10.1101/gr.4062605
View details for Web of Science ID 000232436800001
View details for PubMedID 16169925
Distribution and intensity of constraint in mammalian genomic sequence
2005; 15 (7): 901-913
Comparisons of orthologous genomic DNA sequences can be used to characterize regions that have been subject to purifying selection and are enriched for functional elements. We here present the results of such an analysis on an alignment of sequences from 29 mammalian species. The alignment captures approximately 3.9 neutral substitutions per site and spans approximately 1.9 Mbp of the human genome. We identify constrained elements from 3 bp to over 1 kbp in length, covering approximately 5.5% of the human locus. Our estimate for the total amount of nonexonic constraint experienced by this locus is roughly twice that for exonic constraint. Constrained elements tend to cluster, and we identify large constrained regions that correspond well with known functional elements. While constraint density inversely correlates with mobile element density, we also show the presence of unambiguously constrained elements overlapping mammalian ancestral repeats. In addition, we describe a number of elements in this region that have undergone intense purifying selection throughout mammalian evolution, and we show that these important elements are more numerous than previously thought. These results were obtained with Genomic Evolutionary Rate Profiling (GERP), a statistically rigorous and biologically transparent framework for constrained element identification. GERP identifies regions at high resolution that exhibit nucleotide substitution deficits, and measures these deficits as "rejected substitutions". Rejected substitutions reflect the intensity of past purifying selection and are used to rank and characterize constrained elements. We anticipate that GERP and the types of analyses it facilitates will provide further insights and improved annotation for the human genome as mammalian genome sequence data become richer.
View details for Web of Science ID 000230424000001
View details for PubMedID 15965027
Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity
2005; 15 (7): 978-986
We find that the degree of impairment of protein function by missense variants is predictable by comparative sequence analysis alone. The applicable range of impairment is not confined to binary predictions that distinguish normal from deleterious variants, but extends continuously from mild to severe effects. The accuracy of predictions is strongly dependent on sequence variation and is highest when diverse orthologs are available. High predictive accuracy is achieved by quantification of the physicochemical characteristics in each position of the protein, based on observed evolutionary variation. The strong relationship between physicochemical characteristics of a missense variant and impairment of protein function extends to human disease. By using four diverse proteins for which sufficient comparative sequence data are available, we show that grades of disease, or likelihood of developing cancer, correlate strongly with physicochemical constraint violation by causative amino acid variants.
View details for Web of Science ID 000230424000009
View details for PubMedID 15965030
Trade-offs in detecting evolutionarily constrained sequence by comparative genomics
ANNUAL REVIEW OF GENOMICS AND HUMAN GENETICS
2005; 6: 143-164
As whole-genome sequencing efforts extend beyond more traditional model organisms to include a deep diversity of species, comparative genomic analyses will be further empowered to reveal insights into the human genome and its evolution. The discovery and annotation of functional genomic elements is a necessary step toward a detailed understanding of our biology, and sequence comparisons have proven to be an integral tool for that task. This review is structured to broadly reflect the statistical challenges in discriminating these functional elements from the bulk of the genome that has evolved neutrally. Specifically, we review the comparative genomics literature in terms of specificity, sensitivity, and phylogenetic scope, as well as the trade-offs that relate these factors in standard analyses. We consider the impact of an expanding diversity of orthologous sequences on our ability to resolve functional elements. This impact is assessed through both recent comparative analyses of deep alignments and mathematical modeling.
View details for DOI 10.1146/annurev.genom.6.080604.162146
View details for Web of Science ID 000232441500008
View details for PubMedID 16124857
ABC: software for interactive browsing of genomic multiple sequence alignment data
Alignment and comparison of related genome sequences is a powerful method to identify regions likely to contain functional elements. Such analyses are data intensive, requiring the inclusion of genomic multiple sequence alignments, sequence annotations, and scores describing regional attributes of columns in the alignment. Visualization and browsing of results can be difficult, and there are currently limited software options for performing this task.The Application for Browsing Constraints (ABC) is interactive Java software for intuitive and efficient exploration of multiple sequence alignments and data typically associated with alignments. It is used to move quickly from a summary view of the entire alignment via arbitrary levels of resolution to individual alignment columns. It allows for the simultaneous display of quantitative data, (e.g., sequence similarity or evolutionary rates) and annotation data (e.g. the locations of genes, repeats, and constrained elements). It can be used to facilitate basic comparative sequence tasks, such as export of data in plain-text formats, visualization of phylogenetic trees, and generation of alignment summary graphics.The ABC is a lightweight, stand-alone, and flexible graphical user interface for browsing genomic multiple sequence alignments of specific loci, up to hundreds of kilobases or a few megabases in length. It is coded in Java for cross-platform use and the program and source code are freely available under the General Public License. Documentation and a sample data set are also available http://mendel.stanford.edu/sidowlab/downloads.html.
View details for DOI 10.1186/1471-2105-5-192
View details for Web of Science ID 000226622100001
View details for PubMedID 15588288
Noncoding regulatory sequences of Gona exhibit strong correspondence between evolutionary constraint and functional importance
2004; 14 (12): 2448-2456
We show that sequence comparisons at different levels of resolution can efficiently guide functional analyses of regulatory regions in the ascidians Ciona savignyi and Ciona intestinalis. Sequence alignments of several tissue-specific genes guided discovery of minimal regulatory regions that are active in whole-embryo reporter assays. Using the Troponin I (TnI) locus as a case study, we show that more refined local sequence analyses can then be used to reveal functional substructure within a regulatory region. A high-resolution saturation mutagenesis in conjunction with comparative sequence analyses defined essential sequence elements within the TnI regulatory region. Finally, we found a significant, quantitative relationship between function and sequence divergence of noncoding functional elements. This work demonstrates the power of comparative sequence analysis between the two Ciona species for guiding gene regulatory experiments.
View details for DOI 10.1101/gr.2964504
View details for Web of Science ID 000225550400009
View details for PubMedID 15545496
Characterization of evolutionary rates and constraints in three mammalian genomes
2004; 14 (4): 539-548
We present an analysis of rates and patterns of microevolutionary phenomena that have shaped the human, mouse, and rat genomes since their last common ancestor. We find evidence for a shift in the mutational spectrum between the mouse and rat lineages, with the net effect being a relative increase in GC content in the rat genome. Our estimate for the neutral point substitution rate separating the two rodents is 0.196 substitutions per site, and 0.65 substitutions per site for the tree relating all three mammals. Small insertions and deletions of 1-10 bp in length ("microindels") occur at approximately 5% of the point substitution rate. Inferred regional correlations in evolutionary rates between lineages and between types of sites support the idea that rates of evolution are influenced by local genomic or cell biological context. No substantial correlations between rates of point substitutions and rates of microindels are found, however, implying that the influences that affect these processes are distinct. Finally, we have identified those regions in the human genome that are evolving slowly, which are likely to include functional elements important to human biology. At least 5% of the human genome is under substantial constraint, most of which is noncoding.
View details for DOI 10.1101/gr.2034704
View details for Web of Science ID 000220629900005
View details for PubMedID 15059994
Automated whole-genome multiple alignment of rat, mouse, and human
2004; 14 (4): 685-692
We have built a whole-genome multiple alignment of the three currently available mammalian genomes using a fully automated pipeline that combines the local/global approach of the Berkeley Genome Pipeline and the LAGAN program. The strategy is based on progressive alignment and consists of two main steps: (1) alignment of the mouse and rat genomes, and (2) alignment of human to either the mouse-rat alignments from step 1, or the remaining unaligned mouse and rat sequences. The resulting alignments demonstrate high sensitivity, with 87% of all human gene-coding areas aligned in both mouse and rat. The specificity is also high: <7% of the rat contigs are aligned to multiple places in human, and 97% of all alignments with human sequence >100 kb agree with a three-way synteny map built independently, using predicted exons in the three genomes. At the nucleotide level <1% of the rat nucleotides are mapped to multiple places in the human sequence in the alignment, and 96.5% of human nucleotides within all alignments agree with the synteny map. The alignments are publicly available online, with visualization through the novel Multi-VISTA browser that we also present.
View details for DOI 10.1101/gr.2067704
View details for Web of Science ID 000220629900022
View details for PubMedID 15060011
- Genome sequence of the brown Norway rat yields insights into mammalian evolution Nature 2004; 428
Genomic regulatory regions: insights from comparative sequence analysis
CURRENT OPINION IN GENETICS & DEVELOPMENT
2003; 13 (6): 604-610
Comparative sequence analysis is contributing to the identification and characterization of genomic regulatory regions with functional roles. It is effective because functionally important regions tend to evolve at a slower rate than do less important regions. The choice of species for comparative analysis is crucial: shared ancestry of a clade of species facilitates the discovery of genomic features important to that clade, whereas increased sequence divergence improves the resolution at which features can be discovered. Recent studies suggest that comparative analyses are useful for all branches of life and that, in the near future, large-scale mammalian comparative sequence analysis will provide the best approach for the comprehensive discovery of human regulatory elements.
View details for DOI 10.1016/j.gde.2003.10.001
View details for Web of Science ID 000187248400009
View details for PubMedID 14638322
Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes
2003; 13 (5): 813-820
Comparative sequence analyses on a collection of carefully chosen mammalian genomes could facilitate identification of functional elements within the human genome and allow quantification of evolutionary constraint at the single nucleotide level. High-resolution quantification would be informative for determining the distribution of important positions within functional elements and for evaluating the relative importance of nucleotide sites that carry single nucleotide polymorphisms (SNPs). Because the level of resolution in comparative sequence analyses is a direct function of sequence diversity, we propose that the information content of a candidate mammalian genome be defined as the sequence divergence it would add relative to already-sequenced genomes. We show that reliable estimates of genomic sequence divergence can be obtained from small genomic regions. On the basis of a multiple sequence alignment of approximately 1.4 megabases each from eight mammals, we generate such estimates for five unsequenced mammals. Estimates of the neutral divergence in these data suggest that a small number of diverse mammalian genomes in addition to human, mouse, and rat would allow single nucleotide resolution in comparative sequence analyses.
View details for DOI 10.1101/gr.1064503
View details for Web of Science ID 000182645500007
View details for PubMedID 12727901
LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA
2003; 13 (4): 721-731
To compare entire genomes from different species, biologists increasingly need alignment methods that are efficient enough to handle long sequences, and accurate enough to correctly align the conserved biological features between distant species. We present LAGAN, a system for rapid global alignment of two homologous genomic sequences, and Multi-LAGAN, a system for multiple global alignment of genomic sequences. We tested our systems on a data set consisting of greater than 12 Mb of high-quality sequence from 12 vertebrate species. All the sequence was derived from the genomic region orthologous to an approximately 1.5-Mb region on human chromosome 7q31.3. We found that both LAGAN and Multi-LAGAN compare favorably with other leading alignment methods in correctly aligning protein-coding exons, especially between distant homologs such as human and chicken, or human and fugu. Multi-LAGAN produced the most accurate alignments, while requiring just 75 minutes on a personal computer to obtain the multiple alignment of all 12 sequences. Multi-LAGAN is a practical method for generating multiple alignments of long genomic sequences at any evolutionary distance. Our systems are publicly available at http://lagan.stanford.edu.
View details for DOI 10.1101/gr.926603
View details for Web of Science ID 000182046300018
View details for PubMedID 12654723
The integrity of a cholesterol-binding pocket in Niemann-Pick C2 protein is necessary to control lysosome cholesterol levels
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2003; 100 (5): 2518-2525
The neurodegenerative disease Niemann-Pick Type C2 (NPC2) results from mutations in the NPC2 (HE1) gene that cause abnormally high cholesterol accumulation in cells. We find that purified NPC2, a secreted soluble protein, binds cholesterol specifically with a much higher affinity (K(d) = 30-50 nM) than previously reported. Genetic and biochemical studies identified single amino acid changes that prevent both cholesterol binding and the restoration of normal cholesterol levels in mutant cells. The amino acids that affect cholesterol binding surround a hydrophobic pocket in the NPC2 protein structure, identifying a candidate sterol-binding location. On the basis of evolutionary analysis and mutagenesis, three other regions of the NPC2 protein emerged as important, including one required for efficient secretion.
View details for DOI 10.1073/pnas.0530027100
View details for Web of Science ID 000181365000065
View details for PubMedID 12591949
Functional evolution in the ancestral lineage of vertebrates or when genomic complexity was wagging its morphological tail.
Journal of structural and functional genomics
2003; 3 (1-4): 45-52
Early vertebrate evolution is characterized by a significant increase of organismal complexity over a relatively short time span. We present quantitative evidence for a high rate of increase in morphological complexity during early vertebrate evolution. Possible molecular evolutionary mechanisms that underlie this increase in complexity fall into a small number of categories, one of which is gene duplication and subsequent structural or regulatory neofunctionalization. We discuss analyses of two gene families whose regulatory and structural evolution shed light on the connection between gene duplication and increases in organismal complexity.
View details for PubMedID 12836684
Sequence first. Ask questions later.
2002; 111 (1): 13-16
Comparative sequence analyses of eukaryotic genes and genomic regions are beginning to provide a wealth of information that is directly relevant to human biology. Functional changes that set us apart from apes are identifiable, as are functional constraints in proteins and genomic elements that arose in our relatively distant phylogenetic past.
View details for Web of Science ID 000178461900004
View details for PubMedID 12372296
Inference of functional regions in proteins by quantification of evolutionary constraints
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2002; 99 (5): 2912-2917
Likelihood estimates of local rates of evolution within proteins reveal that selective constraints on structure and function are quantitatively stable over billions of years of divergence. The stability of constraints produces an intramolecular clock that gives each protein a characteristic pattern of evolutionary rates along its sequence. This pattern allows the identification of constrained regions and, because the rate of evolution is a quantitative measure of the strength of the constraint, of their functional importance. We show that results from such analyses, which require only sequence alignments, are consistent with experimental and mutational data. The methodology has significant predictive power and may be used to guide structure--function studies for any protein represented by a modest number of homologs in sequence databases.
View details for DOI 10.1073/pnas.042692299
View details for Web of Science ID 000174284600059
View details for PubMedID 11880638
A novel member of the F-box/WD40 gene family, encoding dactylin, is disrupted in the mouse dactylaplasia mutant
1999; 23 (1): 104-107
Early outgrowth of the vertebrate embryonic limb requires signalling by the apical ectodermal ridge (AER) to the progress zone (PZ), which in response proliferates and lays down the pattern of the presumptive limb in a proximal to distal progression. Signals from the PZ maintain the AER until the anlagen for the distal phalanges have been formed. The semidominant mouse mutant dactylaplasia (Dac) disrupts the maintenance of the AER, leading to truncation of distal structures of the developing footplate, or autopod. Adult Dac homozygotes thus lack hands and feet except for malformed single digits, whereas heterozygotes lack phalanges of the three middle digits. Dac resembles the human autosomal dominant split hand/foot malformation (SHFM) diseases. One of these, SHFM3, maps to chromosome 10q24 (Refs 6,7), which is syntenic to the Dac region on chromosome 19, and may disrupt the orthologue of Dac. We report here the positional cloning of Dac and show that it belongs to the F-box/WD40 gene family, which encodes adapters that target specific proteins for destruction by presenting them to the ubiquitination machinery. In conjuction with recent biochemical studies, this report demonstrates the importance of this gene family in vertebrate embryonic development.
View details for Web of Science ID 000082337300026
View details for PubMedID 10471509