Dr. Carlos D. Bustamante is an internationally recognized leader in the application of data science and genomics technology to problems in medicine, agriculture, and biology. He received his Ph.D. in Biology and MS in Statistics from Harvard University (2001), was on the faculty at Cornell University (2002-9), and was named a MacArthur Fellow in 2010. He is currently Professor of Biomedical Data Science, Genetics, and (by courtesy) Biology at Stanford University. Dr. Bustamante has a passion for building new academic units, non-profits, and companies to solve pressing scientific challenges. He is Founding Director of the Stanford Center for Computational, Evolutionary, and Human Genomics (CEHG) and Inaugural Chair of the Department of Biomedical Data Science. He is the Owner and President of CDB Consulting, LTD. and also a Director at Eden Roc Biotech, founder of Arc-Bio (formerly IdentifyGenomics and BigData Bio), and an SAB member of Embark Veterinary, the Mars/IBM Food Safety Board, and Digital Ventures.

Academic Appointments

Administrative Appointments

  • Founding Director, Stanford Center for Computational, Evolutionary, and Human Genetics (CEHG) (2012 - 2017)
  • Inaugural Chair, Department of Biomedical Data Science (2015 - Present)

Honors & Awards

  • Marshall Sherfield Fellow, Marshall Aid Commemoration Commission (2001-2)
  • Sloan Research Fellow in Molecular Biology, Sloan Foundation (2007-9)
  • Provost Award for Distinguished Research, Cornell University (2008)
  • MacArthur Fellow, John D. and Catherine T. MacArthur Foundation (2010)

Boards, Advisory Committees, Professional Organizations

  • Editorial Boards, Genome Research (2008 - Present)
  • Advisory Board, Slim Initiative for Genomic Medicine in the Americas (2010 - Present)
  • Editorial Board, Human Biology (2010 - Present)
  • Advisory Board, External Evaluation Committee NIDDK T2D GENES project (2011 - Present)
  • Advisory Board, National Human Genome Research Institute Council (2011 - Present)
  • Advisory Board, Online Mendelian Inheritance in Man (OMIM) (2013 - Present)
  • Advisory Board, NIH Council of Councils (2013 - Present)
  • Advisory Board, National Geographic Genographic Project (2013 - Present)
  • Editorial Board, American Journal of Human Genetics (2013 - Present)
  • Senior Editor, Evolution,PLoS Genetics (2013 - Present)

Professional Education

  • B.A., Harvard University, Biology (1997)
  • M.S., Harvard University, Statistics (2001)
  • Ph.D., Harvard University, Biology (2001)
  • Postdoc, University of Oxford, Mathematical Genetics (2002)

Research & Scholarship

Current Research and Scholarly Interests

My research focuses on analyzing genome wide patterns of variation within and between species to address fundamental questions in biology, anthropology, and medicine. My group works on a variety of organisms and model systems ranging from humans and other primates to domesticated plant and animals. Much of our research is at the interface of computational biology, mathematical genetics, and evolutionary genomics.

Clinical Trials

  • Personal Genomics for Preventive Cardiology Not Recruiting

    The purpose of this study is to see if providing information to a person on their inherited (genetic) risk of cardiovascular disease (CVD) helps to motivate that person to change their diet, lifestyle or medication regimen to alter their risk.

    Stanford is currently not accepting patients for this trial. For more information, please contact Josh Knowles, 650-804-2526.

2017-18 Courses

Graduate and Fellowship Programs


All Publications

  • Dispersals and genetic adaptation of Bantu-speaking populations in Africa and North America SCIENCE Patin, E., Lopez, M., Grollemund, R., Verdu, P., Harmant, C., Quach, H., Laval, G., Perry, G. H., Barreiro, L. B., Froment, A., Heyer, E., Massougbodji, A., Fortes-Lima, C., Migot-Nabias, F., Bellis, G., Dugoujon, J., Pereira, J. B., Fernandes, V., Pereira, L., Van der Veen, L., Mouguiama-Daouda, P., Bustamante, C. D., Hombert, J., Quintana-Murci, L. 2017; 356 (6337): 543-546


    Bantu languages are spoken by about 310 million Africans, yet the genetic history of Bantu-speaking populations remains largely unexplored. We generated genomic data for 1318 individuals from 35 populations in western central Africa, where Bantu languages originated. We found that early Bantu speakers first moved southward, through the equatorial rainforest, before spreading toward eastern and southern Africa. We also found that genetic adaptation of Bantu speakers was facilitated by admixture with local populations, particularly for the HLA and LCT loci. Finally, we identified a major contribution of western central African Bantu speakers to the ancestry of African Americans, whose genomes present no strong signals of natural selection. Together, these results highlight the contribution of Bantu-speaking peoples to the complex genetic history of Africans and African Americans.

    View details for DOI 10.1126/science.aal1988

    View details for Web of Science ID 000400545700039

    View details for PubMedID 28473590

  • Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data BIOINFORMATICS Shringarpure, S. S., Mathias, R. A., Hernandez, R. D., O'Connor, T. D., Szpiech, Z. A., Torres, R., De La Vega, F. M., Bustamante, C. D., Barnes, K. C., Taub, M. A. 2017; 33 (8): 1147-1153
  • Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations AMERICAN JOURNAL OF HUMAN GENETICS Martin, A. R., Gignoux, C. R., Walters, R. K., Wojcik, G. L., Neale, B. M., Gravel, S., Daly, M. J., Bustamante, C. D., Kenny, E. E. 2017; 100 (4): 635-649


    The vast majority of genome-wide association studies (GWASs) are performed in Europeans, and their transferability to other populations is dependent on many factors (e.g., linkage disequilibrium, allele frequencies, genetic architecture). As medical genomics studies become increasingly large and diverse, gaining insights into population history and consequently the transferability of disease risk measurement is critical. Here, we disentangle recent population history in the widely used 1000 Genomes Project reference panel, with an emphasis on populations underrepresented in medical studies. To examine the transferability of single-ancestry GWASs, we used published summary statistics to calculate polygenic risk scores for eight well-studied phenotypes. We identify directional inconsistencies in all scores; for example, height is predicted to decrease with genetic distance from Europeans, despite robust anthropological evidence that West Africans are as tall as Europeans on average. To gain deeper quantitative insights into GWAS transferability, we developed a complex trait coalescent-based simulation framework considering effects of polygenicity, causal allele frequency divergence, and heritability. As expected, correlations between true and inferred risk are typically highest in the population from which summary statistics were derived. We demonstrate that scores inferred from European GWASs are biased by genetic drift in other populations even when choosing the same causal variants and that biases in any direction are possible and unpredictable. This work cautions that summarizing findings from large-scale GWASs may have limited portability to other populations using standard approaches and highlights the need for generalized risk prediction methods and the inclusion of more diverse individuals in medical genomics.

    View details for DOI 10.1016/j.ajhg.2017.03.004

    View details for Web of Science ID 000398389600006

    View details for PubMedID 28366442

  • Population genetic analysis of the DARC locus (Duffy) reveals adaptation from standing variation associated with malaria resistance in humans PLOS GENETICS McManus, K. F., Taravella, A. M., Henn, B. M., Bustamante, C. D., Sikora, M., Cornejo, O. E. 2017; 13 (3)


    The human DARC (Duffy antigen receptor for chemokines) gene encodes a membrane-bound chemokine receptor crucial for the infection of red blood cells by Plasmodium vivax, a major causative agent of malaria. Of the three major allelic classes segregating in human populations, the FY*O allele has been shown to protect against P. vivax infection and is at near fixation in sub-Saharan Africa, while FY*B and FY*A are common in Europe and Asia, respectively. Due to the combination of strong geographic differentiation and association with malaria resistance, DARC is considered a canonical example of positive selection in humans. Despite this, details of the timing and mode of selection at DARC remain poorly understood. Here, we use sequencing data from over 1,000 individuals in twenty-one human populations, as well as ancient human genomes, to perform a fine-scale investigation of the evolutionary history of DARC. We estimate the time to most recent common ancestor (TMRCA) of the most common FY*O haplotype to be 42 kya (95% CI: 34-49 kya). We infer the FY*O null mutation swept to fixation in Africa from standing variation with very low initial frequency (0.1%) and a selection coefficient of 0.043 (95% CI:0.011-0.18), which is among the strongest estimated in the human genome. We estimate the TMRCA of the FY*A mutation in non-Africans to be 57 kya (95% CI: 48-65 kya) and infer that, prior to the sweep of FY*O, all three alleles were segregating in Africa, as highly diverged populations from Asia and ≠Khomani San hunter-gatherers share the same FY*A haplotypes. We test multiple models of admixture that may account for this observation and reject recent Asian or European admixture as the cause.

    View details for DOI 10.1371/journal.pgen.1006560

    View details for Web of Science ID 000398043000036

    View details for PubMedID 28282382

  • Strategies for Enriching Variant Coverage in Candidate Disease Loci on a Multiethnic Genotyping Array PLOS ONE Bien, S. A., Wojcik, G. L., Zubair, N., Gignoux, C. R., Martin, A. R., Kocarnik, J. M., Martin, L. W., Buyske, S., Haessler, J., Walker, R. W., Cheng, I., Graff, M., Xia, L., Franceschini, N., Matise, T., James, R., Hindorff, L., Le Marchand, L., North, K. E., Haiman, C. A., Peters, U., Loos, R. J., Kooperberg, C. L., Bustamante, C. D., Kenny, E. E., Carlson, C. S. 2016; 11 (12)


    Investigating genetic architecture of complex traits in ancestrally diverse populations is imperative to understand the etiology of disease. However, the current paucity of genetic research in people of African and Latin American ancestry, Hispanic and indigenous peoples in the United States is likely to exacerbate existing health disparities for many common diseases. The Population Architecture using Genomics and Epidemiology, Phase II (PAGE II), Study was initiated in 2013 by the National Human Genome Research Institute to expand our understanding of complex trait loci in ethnically diverse and well characterized study populations. To meet this goal, the Multi-Ethnic Genotyping Array (MEGA) was designed to substantially improve fine-mapping and functional discovery by increasing variant coverage across multiple ethnicities at known loci for metabolic, cardiovascular, renal, inflammatory, anthropometric, and a variety of lifestyle traits. Studying the frequency distribution of clinically relevant mutations, putative risk alleles, and known functional variants across multiple populations will provide important insight into the genetic architecture of complex diseases and facilitate the discovery of novel, sometimes population-specific, disease associations. DNA samples from 51,650 self-identified African ancestry (17,328), Hispanic/Latino (22,379), Asian/Pacific Islander (8,640), and American Indian (653) and an additional 2,650 participants of either South Asian or European ancestry, and other reference panels have been genotyped on MEGA by PAGE II. MEGA was designed as a new resource for studying ancestrally diverse populations. Here, we describe the methodology for selecting trait-specific content for use in multi-ethnic populations and how enriching MEGA for this content may contribute to deeper biological understanding of the genetic etiology of complex disease.

    View details for DOI 10.1371/journal.pone.0167758

    View details for Web of Science ID 000392754300044

    View details for PubMedID 27973554

    View details for PubMedCentralID PMC5156387

  • A continuum of admixture in the Western Hemisphere revealed by the African Diaspora genome NATURE COMMUNICATIONS Mathias, R. A., Taub, M. A., Gignoux, C. R., Fu, W., Musharoff, S., O'Connor, T. D., Vergara, C., Torgerson, D. G., Pino-Yanes, M., Shringarpure, S. S., Huang, L., Rafaels, N., Boorgula, M. P., Johnston, H. R., Ortega, V. E., Levin, A. M., Song, W., Torres, R., Padhukasahasram, B., Eng, C., Mejia-Mejia, D., Ferguson, T., Qin, Z. S., Scott, A. F., Yazdanbakhsh, M., Wilson, J. G., Marrugo, J., Lange, L. A., Kumar, R., Avila, P. C., Williams, L. K., Watson, H., Ware, L. B., Olopade, C., Olopade, O., Oliveira, R., Ober, C., Nicolae, D. L., Meyers, D., Mayorga, A., Knight-Madden, J., Hartert, T., Hansel, N. N., Foreman, M. G., Ford, J. G., Faruque, M. U., Dunston, G. M., Caraballo, L., Burchard, E. G., Bleecker, E., Araujo, M. I., Herrera-Paz, E. F., Gietzen, K., Grus, W. E., Bamshad, M., Bustamante, C. D., Kenny, E. E., Hernandez, R. D., Beaty, T. H., Ruczinski, I., Akey, J., Barnes, K. C. 2016; 7


    The African Diaspora in the Western Hemisphere represents one of the largest forced migrations in history and had a profound impact on genetic diversity in modern populations. To date, the fine-scale population structure of descendants of the African Diaspora remains largely uncharacterized. Here we present genetic variation from deeply sequenced genomes of 642 individuals from North and South American, Caribbean and West African populations, substantially increasing the lexicon of human genomic variation and suggesting much variation remains to be discovered in African-admixed populations in the Americas. We summarize genetic variation in these populations, quantifying the postcolonial sex-biased European gene flow across multiple regions. Moreover, we refine estimates on the burden of deleterious variants carried across populations and how this varies with African ancestry. Our data are an important resource for empowering disease mapping studies in African-admixed individuals and will facilitate gene discovery for diseases disproportionately affecting individuals of African ancestry.

    View details for DOI 10.1038/ncomms12522

    View details for Web of Science ID 000385544300002

    View details for PubMedID 27725671

    View details for PubMedCentralID PMC5062574

  • REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants AMERICAN JOURNAL OF HUMAN GENETICS Ioannidis, N. M., Rothstein, J. H., Pejaver, V., Middha, S., McDonnell, S. K., Baheti, S., Musolf, A., Li, Q., Holzinger, E., Karyadi, D., Cannon-Albright, L. A., Teerlink, C. C., Stanford, J. L., Isaacs, W. B., Xu, J., Cooney, K. A., Lange, E. M., Schleutker, J., Carpten, J. D., Powell, I. J., Cussenot, O., Cancel-Tassin, G., Giles, G. G., MacInnis, R. J., Maier, C., Hsieh, C., Wiklund, F., Catalona, W. J., Foulkes, W. D., Mandal, D., Eeles, R. A., Kote-Jarai, Z., Bustamante, C. D., Schaid, D. J., Hastie, T., Ostrander, E. A., Bailey-Wilson, J. E., Radivojac, P., Thibodeau, S. N., Whittemore, A. S., Sieh, W. 2016; 99 (4): 877-885


    The vast majority of coding variants are rare, and assessment of the contribution of rare variants to complex traits is hampered by low statistical power and limited functional data. Improved methods for predicting the pathogenicity of rare coding variants are needed to facilitate the discovery of disease variants from exome sequencing studies. We developed REVEL (rare exome variant ensemble learner), an ensemble method for predicting the pathogenicity of missense variants on the basis of individual tools: MutPred, FATHMM, VEST, PolyPhen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons. REVEL was trained with recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools. When applied to two independent test sets, REVEL had the best overall performance (p < 10(-12)) as compared to any individual tool and seven ensemble methods: MetaSVM, MetaLR, KGGSeq, Condel, CADD, DANN, and Eigen. Importantly, REVEL also had the best performance for distinguishing pathogenic from rare neutral variants with allele frequencies <0.5%. The area under the receiver operating characteristic curve (AUC) for REVEL was 0.046-0.182 higher in an independent test set of 935 recent SwissVar disease variants and 123,935 putatively neutral exome sequencing variants and 0.027-0.143 higher in an independent test set of 1,953 pathogenic and 2,406 benign variants recently reported in ClinVar than the AUCs for other ensemble methods. We provide pre-computed REVEL scores for all possible human missense variants to facilitate the identification of pathogenic variants in the sea of rare variants discovered as sequencing studies expand in scale.

    View details for DOI 10.1016/j.ajhg.2016.08.016

    View details for Web of Science ID 000385333700007

    View details for PubMedID 27666373

    View details for PubMedCentralID PMC5065685

  • Multidimensional structure-function relationships in human beta-cardiac myosin from population-scale genetic variation PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Homburger, J. R., Green, E. M., Caleshu, C., Sunitha, M. S., Taylor, R. E., Ruppel, K. M., Metpally, R. P., Colan, S. D., Michels, M., Day, S. M., Olivotto, I., Bustamante, C. D., Dewey, F. E., Ho, C. Y., Spudich, J. A., Ashley, E. A. 2016; 113 (24): 6701-6706


    Myosin motors are the fundamental force-generating elements of muscle contraction. Variation in the human β-cardiac myosin heavy chain gene (MYH7) can lead to hypertrophic cardiomyopathy (HCM), a heritable disease characterized by cardiac hypertrophy, heart failure, and sudden cardiac death. How specific myosin variants alter motor function or clinical expression of disease remains incompletely understood. Here, we combine structural models of myosin from multiple stages of its chemomechanical cycle, exome sequencing data from two population cohorts of 60,706 and 42,930 individuals, and genetic and phenotypic data from 2,913 patients with HCM to identify regions of disease enrichment within β-cardiac myosin. We first developed computational models of the human β-cardiac myosin protein before and after the myosin power stroke. Then, using a spatial scan statistic modified to analyze genetic variation in protein 3D space, we found significant enrichment of disease-associated variants in the converter, a kinetic domain that transduces force from the catalytic domain to the lever arm to accomplish the power stroke. Focusing our analysis on surface-exposed residues, we identified a larger region significantly enriched for disease-associated variants that contains both the converter domain and residues on a single flat surface on the myosin head described as the myosin mesa. Notably, patients with HCM with variants in the enriched regions have earlier disease onset than patients who have HCM with variants elsewhere. Our study provides a model for integrating protein structure, large-scale genetic sequencing, and detailed phenotypic data to reveal insight into time-shifted protein structures and genetic disease.

    View details for DOI 10.1073/pnas.1606950113

    View details for Web of Science ID 000377948800046

    View details for PubMedID 27247418

    View details for PubMedCentralID PMC4914177

  • Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences NATURE GENETICS Poznik, G. D., Xue, Y., Mendez, F. L., Willems, T. F., Massaia, A., Sayres, M. A., Ayub, Q., McCarthy, S. A., Narechania, A., Kashin, S., Chen, Y., Banerjee, R., Rodriguez-Flores, J. L., Cerezo, M., Shao, H., Gymrek, M., Malhotra, A., Louzada, S., DeSalle, R., Ritchie, G. R., Cerveira, E., Fitzgerald, T. W., Garrison, E., Marcketta, A., Mittelman, D., Romanovitch, M., Zhang, C., Zheng-Bradley, X., Abecasis, G. R., McCarroll, S. A., Flicek, P., Underhill, P. A., Coin, L., Zerbino, D. R., Yang, F., Lee, C., Clarke, L., Auton, A., Erlich, Y., Handsaker, R. E., Bustamante, C. D., Tyler-Smith, C. 2016; 48 (6): 593-?


    We report the sequences of 1,244 human Y chromosomes randomly ascertained from 26 worldwide populations by the 1000 Genomes Project. We discovered more than 65,000 variants, including single-nucleotide variants, multiple-nucleotide variants, insertions and deletions, short tandem repeats, and copy number variants. Of these, copy number variants contribute the greatest predicted functional impact. We constructed a calibrated phylogenetic tree on the basis of binary single-nucleotide variants and projected the more complex variants onto it, estimating the number of mutations for each class. Our phylogeny shows bursts of extreme expansion in male numbers that have occurred independently among each of the five continental superpopulations examined, at times of known migrations and technological innovations.

    View details for DOI 10.1038/ng.3559

    View details for Web of Science ID 000376744200005

    View details for PubMedID 27111036

    View details for PubMedCentralID PMC4884158

  • Mechanisms Underlying Adaptation to Life in Hydrogen Sulfide-Rich Environments MOLECULAR BIOLOGY AND EVOLUTION Kelley, J. L., Arias-Rodriguez, L., Martin, D. P., Yee, M., Bustamante, C. D., Tobler, M. 2016; 33 (6): 1419-1434


    Hydrogen sulfide (H2S) is a potent toxicant interfering with oxidative phosphorylation in mitochondria and creating extreme environmental conditions in aquatic ecosystems. The mechanistic basis of adaptation to perpetual exposure to H2S remains poorly understood. We investigated evolutionarily independent lineages of livebearing fishes that have colonized and adapted to springs rich in H2S and compared their genome-wide gene expression patterns with closely related lineages from adjacent, nonsulfidic streams. Significant differences in gene expression were uncovered between all sulfidic and nonsulfidic population pairs. Variation in the number of differentially expressed genes among population pairs corresponded to differences in divergence times and rates of gene flow, which is consistent with neutral drift driving a substantial portion of gene expression variation among populations. Accordingly, there was little evidence for convergent evolution shaping large-scale gene expression patterns among independent sulfide spring populations. Nonetheless, we identified a small number of genes that was consistently differentially expressed in the same direction in all sulfidic and nonsulfidic population pairs. Functional annotation of shared differentially expressed genes indicated upregulation of genes associated with enzymatic H2S detoxification and transport of oxidized sulfur species, oxidative phosphorylation, energy metabolism, and pathways involved in responses to oxidative stress. Overall, our results suggest that modification of processes associated with H2S detoxification and toxicity likely complement each other to mediate elevated H2S tolerance in sulfide spring fishes. Our analyses allow for the development of novel hypotheses about biochemical and physiological mechanisms of adaptation to extreme environments.

    View details for DOI 10.1093/molbev/msw020

    View details for Web of Science ID 000376170300003

    View details for PubMedID 26861137

    View details for PubMedCentralID PMC4868117

  • Efficient analysis of large datasets and sex bias with ADMIXTURE BMC BIOINFORMATICS Shringarpure, S. S., Bustamante, C. D., Lange, K., Alexander, D. H. 2016; 17


    A number of large genomic datasets are being generated for studies of human ancestry and diseases. The ADMIXTURE program is commonly used to infer individual ancestry from genomic data.We describe two improvements to the ADMIXTURE software. The first enables ADMIXTURE to infer ancestry for a new set of individuals using cluster allele frequencies from a reference set of individuals. Using data from the 1000 Genomes Project, we show that this allows ADMIXTURE to infer ancestry for 10,920 individuals in a few hours (a 5 × speedup). This mode also allows ADMIXTURE to correctly estimate individual ancestry and allele frequencies from a set of related individuals. The second modification allows ADMIXTURE to correctly handle X-chromosome (and other haploid) data from both males and females. We demonstrate increased power to detect sex-biased admixture in African-American individuals from the 1000 Genomes project using this extension.These modifications make ADMIXTURE more efficient and versatile, allowing users to extract more information from large genomic datasets.

    View details for DOI 10.1186/s12859-016-1082-x

    View details for Web of Science ID 000376258900001

    View details for PubMedID 27216439

    View details for PubMedCentralID PMC4877806

  • The Great Migration and African-American Genomic Diversity PLOS GENETICS Baharian, S., Barakatt, M., Gignoux, C. R., Shringarpure, S., Errington, J., Blot, W. J., Bustamante, C. D., Kenny, E. E., Williams, S. M., Aldrich, M. C., Gravel, S. 2016; 12 (5)


    We present a comprehensive assessment of genomic diversity in the African-American population by studying three genotyped cohorts comprising 3,726 African-Americans from across the United States that provide a representative description of the population across all US states and socioeconomic status. An estimated 82.1% of ancestors to African-Americans lived in Africa prior to the advent of transatlantic travel, 16.7% in Europe, and 1.2% in the Americas, with increased African ancestry in the southern United States compared to the North and West. Combining demographic models of ancestry and those of relatedness suggests that admixture occurred predominantly in the South prior to the Civil War and that ancestry-biased migration is responsible for regional differences in ancestry. We find that recent migrations also caused a strong increase in genetic relatedness among geographically distant African-Americans. Long-range relatedness among African-Americans and between African-Americans and European-Americans thus track north- and west-bound migration routes followed during the Great Migration of the twentieth century. By contrast, short-range relatedness patterns suggest comparable mobility of ∼15-16km per generation for African-Americans and European-Americans, as estimated using a novel analytical model of isolation-by-distance.

    View details for DOI 10.1371/journal.pgen.1006059

    View details for Web of Science ID 000377197100057

    View details for PubMedID 27232753

    View details for PubMedCentralID PMC4883799

  • A research roadmap for next-generation sequencing informatics SCIENCE TRANSLATIONAL MEDICINE Altman, R. B., Prabhu, S., Sidow, A., Zook, J. M., Goldfeder, R., Litwack, D., Ashley, E., Asimenos, G., Bustamante, C. D., Donigan, K., Giacomini, K. M., Johansen, E., Khuri, N., Lee, E., Liang, X. S., Salit, M., Serang, O., Tezak, Z., Wall, D. P., Mansfield, E., Kass-Hout, T. 2016; 8 (335)


    Next-generation sequencing technologies are fueling a wave of new diagnostic tests. Progress on a key set of nine research challenge areas will help generate the knowledge required to advance effectively these diagnostics to the clinic.

    View details for DOI 10.1126/scitranslmed.aaf7314

    View details for Web of Science ID 000374412300003

    View details for PubMedID 27099173

  • The Time Scale of Recombination Rate Evolution in Great Apes. Molecular biology and evolution Stevison, L. S., Woerner, A. E., Kidd, J. M., Kelley, J. L., Veeramah, K. R., McManus, K. F., Bustamante, C. D., Hammer, M. F., Wall, J. D. 2016; 33 (4): 928-945


    We present three linkage-disequilibrium (LD)-based recombination maps generated using whole-genome sequence data from 10 Nigerian chimpanzees, 13 bonobos, and 15 western gorillas, collected as part of the Great Ape Genome Project (Prado-Martinez J, et al. 2013. Great ape genetic diversity and population history. Nature 499:471-475). We also identified species-specific recombination hotspots in each group using a modified LDhot framework, which greatly improves statistical power to detect hotspots at varying strengths. We show that fewer hotspots are shared among chimpanzee subspecies than within human populations, further narrowing the time scale of complete hotspot turnover. Further, using species-specific PRDM9 sequences to predict potential binding sites (PBS), we show higher predicted PRDM9 binding in recombination hotspots as compared to matched cold spot regions in multiple great ape species, including at least one chimpanzee subspecies. We found that correlations between broad-scale recombination rates decline more rapidly than nucleotide divergence between species. We also compared the skew of recombination rates at centromeres and telomeres between species and show a skew from chromosome means extending as far as 10-15 Mb from chromosome ends. Further, we examined broad-scale recombination rate changes near a translocation in gorillas and found minimal differences as compared to other great ape species perhaps because the coordinates relative to the chromosome ends were unaffected. Finally, on the basis of multiple linear regression analysis, we found that various correlates of recombination rate persist throughout the African great apes including repeats, diversity, and divergence. Our study is the first to analyze within- and between-species genome-wide recombination rate variation in several close relatives.

    View details for DOI 10.1093/molbev/msv331

    View details for PubMedID 26671457

  • Demographically-Based Evaluation of Genomic Regions under Selection in Domestic Dogs PLOS GENETICS Freedman, A. H., Schweizer, R. M., Ortega-Del Vecchyo, D., Han, E., Davis, B. W., Gronau, I., Silva, P. M., Galaverni, M., Fan, Z., Marx, P., Lorente-Galdos, B., Ramirez, O., Hormozdiari, F., Alkan, C., Vila, C., Squire, K., Geffen, E., Kusak, J., Boyko, A. R., Parker, H. G., Lee, C., Tadigotla, V., Siepel, A., Bustamante, C. D., Harkins, T. T., Nelson, S. F., Marques-Bonet, T., Ostrander, E. A., Wayne, R. K., Novembre, J. 2016; 12 (3)


    Controlling for background demographic effects is important for accurately identifying loci that have recently undergone positive selection. To date, the effects of demography have not yet been explicitly considered when identifying loci under selection during dog domestication. To investigate positive selection on the dog lineage early in the domestication, we examined patterns of polymorphism in six canid genomes that were previously used to infer a demographic model of dog domestication. Using an inferred demographic model, we computed false discovery rates (FDR) and identified 349 outlier regions consistent with positive selection at a low FDR. The signals in the top 100 regions were frequently centered on candidate genes related to brain function and behavior, including LHFPL3, CADM2, GRIK3, SH3GL2, MBP, PDE7B, NTAN1, and GLRA1. These regions contained significant enrichments in behavioral ontology categories. The 3rd top hit, CCRN4L, plays a major role in lipid metabolism, that is supported by additional metabolism related candidates revealed in our scan, including SCP2D1 and PDXC1. Comparing our method to an empirical outlier approach that does not directly account for demography, we found only modest overlaps between the two methods, with 60% of empirical outliers having no overlap with our demography-based outlier detection approach. Demography-aware approaches have lower-rates of false discovery. Our top candidates for selection, in addition to expanding the set of neurobehavioral candidate genes, include genes related to lipid metabolism, suggesting a dietary target of selection that was important during the period when proto-dogs hunted and fed alongside hunter-gatherers.

    View details for DOI 10.1371/journal.pgen.1005851

    View details for Web of Science ID 000373268900006

    View details for PubMedID 26943675

    View details for PubMedCentralID PMC4778760

  • GBStools: A Statistical Method for Estimating Allelic Dropout in Reduced Representation Sequencing Data PLOS GENETICS Cooke, T. F., Yee, M., Muzzio, M., Sockell, A., Bell, R., Cornejo, O. E., Kelley, J. L., Bailliet, G., Bravi, C. M., Bustamante, C. D., Kenny, E. E. 2016; 12 (2)
  • GBStools: A Statistical Method for Estimating Allelic Dropout in Reduced Representation Sequencing Data. PLoS genetics Cooke, T. F., Yee, M., Muzzio, M., Sockell, A., Bell, R., Cornejo, O. E., Kelley, J. L., Bailliet, G., Bravi, C. M., Bustamante, C. D., Kenny, E. E. 2016; 12 (2)


    Reduced representation sequencing methods such as genotyping-by-sequencing (GBS) enable low-cost measurement of genetic variation without the need for a reference genome assembly. These methods are widely used in genetic mapping and population genetics studies, especially with non-model organisms. Variant calling error rates, however, are higher in GBS than in standard sequencing, in particular due to restriction site polymorphisms, and few computational tools exist that specifically model and correct these errors. We developed a statistical method to remove errors caused by restriction site polymorphisms, implemented in the software package GBStools. We evaluated it in several simulated data sets, varying in number of samples, mean coverage and population mutation rate, and in two empirical human data sets (N = 8 and N = 63 samples). In our simulations, GBStools improved genotype accuracy more than commonly used filters such as Hardy-Weinberg equilibrium p-values. GBStools is most effective at removing genotype errors in data sets over 100 samples when coverage is 40X or higher, and the improvement is most pronounced in species with high genomic diversity. We also demonstrate the utility of GBS and GBStools for human population genetic inference in Argentine populations and reveal widely varying individual ancestry proportions and an excess of singletons, consistent with recent population growth.

    View details for DOI 10.1371/journal.pgen.1005631

    View details for PubMedID 26828719

    View details for PubMedCentralID PMC4734769

  • Distance from sub-Saharan Africa predicts mutational load in diverse human genomes PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Henn, B. M., Botigue, L. R., Peischl, S., Dupanloup, I., Lipatov, M., Maples, B. K., Martin, A. R., Musharoff, S., Cann, H., Snyder, M. P., Excoffier, L., Kidd, J. M., Bustamante, C. D. 2016; 113 (4): E440-E449
  • An Efficient Multiple-Testing Adjustment for eQTL Studies that Accounts for Linkage Disequilibrium between Variants AMERICAN JOURNAL OF HUMAN GENETICS Davis, J. R., Fresard, L., Knowles, D. A., Pala, M., Bustamante, C. D., Battle, A., Montgomery, S. B. 2016; 98 (1): 216-224
  • The African Turquoise Killifish Genome Provides Insights into Evolution and Genetic Architecture of Lifespan CELL Valenzano, D. R., Benayoun, B. A., Singh, P. P., Zhang, E., Etter, P. D., Hu, C., Clement-Ziza, M., Willemsen, D., Cui, R., Harel, I., Machado, B. E., Yee, M., Sharp, S. C., Bustamante, C. D., Beyer, A., Johnson, E. A., Brunet, A. 2015; 163 (6): 1539-1554


    Lifespan is a remarkably diverse trait ranging from a few days to several hundred years in nature, but the mechanisms underlying the evolution of lifespan differences remain elusive. Here we de novo assemble a reference genome for the naturally short-lived African turquoise killifish, providing a unique resource for comparative and experimental genomics. The identification of genes under positive selection in this fish reveals potential candidates to explain its compressed lifespan. Several aging genes are under positive selection in this short-lived fish and long-lived species, raising the intriguing possibility that the same gene could underlie evolution of both compressed and extended lifespans. Comparative genomics and linkage analysis identify candidate genes associated with lifespan differences between various turquoise killifish strains. Remarkably, these genes are clustered on the sex chromosome, suggesting that short lifespan might have co-evolved with sex determination. Our study provides insights into the evolutionary forces that shape lifespan in nature.

    View details for DOI 10.1016/j.cell.2015.11.008

    View details for Web of Science ID 000366044800024

    View details for PubMedID 26638078

    View details for PubMedCentralID PMC4684691

  • Genomic Insights into the Ancestry and Demographic History of South America. PLoS genetics Homburger, J. R., Moreno-Estrada, A., Gignoux, C. R., Nelson, D., Sanchez, E., Ortiz-Tello, P., Pons-Estel, B. A., Acevedo-Vasquez, E., Miranda, P., Langefeld, C. D., Gravel, S., Alarcón-Riquelme, M. E., Bustamante, C. D. 2015; 11 (12)


    South America has a complex demographic history shaped by multiple migration and admixture events in pre- and post-colonial times. Settled over 14,000 years ago by Native Americans, South America has experienced migrations of European and African individuals, similar to other regions in the Americas. However, the timing and magnitude of these events resulted in markedly different patterns of admixture throughout Latin America. We use genome-wide SNP data for 437 admixed individuals from 5 countries (Colombia, Ecuador, Peru, Chile, and Argentina) to explore the population structure and demographic history of South American Latinos. We combined these data with population reference panels from Africa, Asia, Europe and the Americas to perform global ancestry analysis and infer the subcontinental origin of the European and Native American ancestry components of the admixed individuals. By applying ancestry-specific PCA analyses we find that most of the European ancestry in South American Latinos is from the Iberian Peninsula; however, many individuals trace their ancestry back to Italy, especially within Argentina. We find a strong gradient in the Native American ancestry component of South American Latinos associated with country of origin and the geography of local indigenous populations. For example, Native American genomic segments in Peruvians show greater affinities with Andean indigenous peoples like Quechua and Aymara, whereas Native American haplotypes from Colombians tend to cluster with Amazonian and coastal tribes from northern South America. Using ancestry tract length analysis we modeled post-colonial South American migration history as the youngest in Latin America during European colonization (9-14 generations ago), with an additional strong pulse of European migration occurring between 3 and 9 generations ago. These genetic footprints can impact our understanding of population-level differences in biomedical traits and, thus, inform future medical genetic studies in the region.

    View details for DOI 10.1371/journal.pgen.1005602

    View details for PubMedID 26636962

    View details for PubMedCentralID PMC4670080

  • Genomic Insights into the Ancestry and Demographic History of South America PLOS GENETICS Homburger, J. R., Moreno-Estrada, A., Gignoux, C. R., Nelson, D., Sanchez, E., Ortiz-Tello, P., Pons-Estel, B. A., Acevedo-Vasquez, E., Miranda, P., Langefeld, C. D., Gravel, S., Alarcon-Riquelme, M. E., Bustamante, C. D. 2015; 11 (12)
  • Chemically tunable mucin chimeras assembled on living cells PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Kramer, J. R., Onoa, B., Bustamante, C., Bertozzi, C. R. 2015; 112 (41): 12574-12579

    View details for DOI 10.1073/pnas.1516127112

    View details for Web of Science ID 000363130900023

    View details for PubMedID 26420872

  • Genomic evidence for the Pleistocene and recent population history of Native Americans SCIENCE Raghavan, M., Steinruecken, M., Harris, K., Schiffels, S., Rasmussen, S., DeGiorgio, M., Albrechtsen, A., Valdiosera, C., Avila-Arcos, M. C., Malaspinas, A., Eriksson, A., Moltke, I., Metspalu, M., Homburger, J. R., Wall, J., Cornejo, O. E., Moreno-Mayar, J. V., Korneliussen, T. S., Pierre, T., Rasmussen, M., Campos, P. F., Damgaard, P. D., Allentoft, M. E., Lindo, J., Metspalu, E., Rodriguez-Varela, R., Mansilla, J., Henrickson, C., Seguin-Orlando, A., Malmstrom, H., Stafford, T., Shringarpure, S. S., Moreno-Estrada, A., Karmin, M., Tambets, K., Bergstrom, A., Xue, Y., Warmuth, V., Friend, A. D., Singarayer, J., Valdes, P., Balloux, F., Leboreiro, I., Vera, J. L., Rangel-Villalobos, H., Pettener, D., Luiselli, D., Davis, L. G., Heyer, E., Zollikofer, C. P., de Leon, M. S., Smith, C. I., Grimes, V., Pike, K., Deal, M., Fuller, B. T., Arriaza, B., Standen, V., Luz, M. F., Ricaut, F., Guidon, N., Osipova, L., Voevoda, M. I., Posukh, O. L., Balanovsky, O., Lavryashina, M., Bogunov, Y., Khusnutdinova, E., Gubina, M., Balanovska, E., Fedorova, S., Litvinov, S., Malyarchuk, B., Derenko, M., Mosher, M. J., Archer, D., Cybulski, J., Petzelt, B., Mitchell, J., Worl, R., Norman, P. J., Parham, P., Kemp, B. M., Kivisild, T., Tyler-Smith, C., Sandhu, M. S., Crawford, M., Villems, R., Smith, D. G., Waters, M. R., Goebel, T., Johnson, J. R., Malhi, R. S., Jakobsson, M., Meltzer, D. J., Manica, A., Durbin, R., Bustamante, C. D., Song, Y. S., Nielsen, R., Willerslev, E. 2015; 349 (6250)
  • The ancestry and affiliations of Kennewick Man NATURE Rasmussen, M., Sikora, M., Albrechtsen, A., Korneliussen, T. S., Moreno-Mayar, J. V., Poznik, G. D., Zollikofer, C. P., de Leon, M. S., Allentoft, M. E., Moltke, I., Jonsson, K., Valdiosera, C., Malhi, R. S., Orlando, L., Bustamante, C. D., Stafford, T. W., Meltzer, D. J., Nielsen, R., Willerslev, E. 2015; 523 (7561): 455-U159
  • The ancestry and affiliations of Kennewick Man. Nature Rasmussen, M., Sikora, M., Albrechtsen, A., Korneliussen, T. S., Moreno-Mayar, J. V., Poznik, G. D., Zollikofer, C. P., Ponce De León, M. S., Allentoft, M. E., Moltke, I., Jónsson, H., Valdiosera, C., Malhi, R. S., Orlando, L., Bustamante, C. D., Stafford, T. W., Meltzer, D. J., Nielsen, R., Willerslev, E. 2015; 523 (7561): 455-458


    Kennewick Man, referred to as the Ancient One by Native Americans, is a male human skeleton discovered in Washington state (USA) in 1996 and initially radiocarbon dated to 8,340-9,200 calibrated years before present (BP). His population affinities have been the subject of scientific debate and legal controversy. Based on an initial study of cranial morphology it was asserted that Kennewick Man was neither Native American nor closely related to the claimant Plateau tribes of the Pacific Northwest, who claimed ancestral relationship and requested repatriation under the Native American Graves Protection and Repatriation Act (NAGPRA). The morphological analysis was important to judicial decisions that Kennewick Man was not Native American and that therefore NAGPRA did not apply. Instead of repatriation, additional studies of the remains were permitted. Subsequent craniometric analysis affirmed Kennewick Man to be more closely related to circumpacific groups such as the Ainu and Polynesians than he is to modern Native Americans. In order to resolve Kennewick Man's ancestry and affiliations, we have sequenced his genome to ∼1× coverage and compared it to worldwide genomic data including for the Ainu and Polynesians. We find that Kennewick Man is closer to modern Native Americans than to any other population worldwide. Among the Native American groups for whom genome-wide data are available for comparison, several seem to be descended from a population closely related to that of Kennewick Man, including the Confederated Tribes of the Colville Reservation (Colville), one of the five tribes claiming Kennewick Man. We revisit the cranial analyses and find that, as opposed to genome-wide comparisons, it is not possible on that basis to affiliate Kennewick Man to specific contemporary groups. We therefore conclude based on genetic comparisons that Kennewick Man shows continuity with Native North Americans over at least the last eight millennia.

    View details for DOI 10.1038/nature14625

    View details for PubMedID 26087396

    View details for PubMedCentralID PMC4878456

  • Achieving high-sensitivity for clinical applications using augmented exome sequencing GENOME MEDICINE Patwardhan, A., Harris, J., Leng, N., Bartha, G., Church, D. M., Luo, S., Haudenschild, C., Pratt, M., Zook, J., Salit, M., Tirch, J., Morra, M., Chervitz, S., Li, M., Clark, M., Garcia, S., Chandratillake, G., Kirk, S., Ashley, E., Snyder, M., Altman, R., Bustamante, C., Butte, A. J., West, J., Chen, R. 2015; 7

    View details for DOI 10.1186/s13073-015-0197-4

    View details for Web of Science ID 000359428300001

    View details for PubMedID 26269718

  • Inexpensive and Highly Reproducible Cloud-Based Variant Calling of 2,535 Human Genomes PLOS ONE Shringarpure, S. S., Carroll, A., De La Vega, F. M., Bustamante, C. D. 2015; 10 (6)

    View details for DOI 10.1371/journal.pone.0129277

    View details for Web of Science ID 000356933800023

    View details for PubMedID 26110529

  • ClinGen - The Clinical Genome Resource NEW ENGLAND JOURNAL OF MEDICINE Rehm, H. L., Berg, J. S., Brooks, L. D., Bustamante, C. D., Evans, J. P., Landrum, M. J., Ledbetter, D. H., Maglott, D. R., Martin, C. L., Nussbaum, R. L., Plon, S. E., Ramos, E. M., Sherry, S. T., Watson, M. S. 2015; 372 (23): 2235-2242

    View details for DOI 10.1056/NEJMsr1406261

    View details for Web of Science ID 000355955100013

    View details for PubMedID 26014595

  • Genome-wide association study and admixture mapping reveal new loci associated with total IgE levels in Latinos JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY Pino-Yanes, M., Gignoux, C. R., Galanter, J. M., Levin, A. M., Campbell, C. D., Eng, C., Huntsman, S., Nishimura, K. K., Gourraud, P., Mohajeri, K., O'Roak, B. J., Hu, D., Mathias, R. A., Nguyen, E. A., Roth, L. A., Padhukasahasram, B., Moreno-Estrada, A., Sandoval, K., Winkler, C. A., Lurmann, F., Davis, A., Farber, H. J., Meade, K., Avila, P. C., Serebrisky, D., Chapela, R., Ford, J. G., LeNoir, M. A., Thyne, S. M., Brigino-Buenaventura, E., Borrell, L. N., Rodriguez-Cintron, W., Sen, S., Kumar, R., Rodriguez-Santana, J. R., Bustamante, C. D., Martinez, F. D., Raby, B. A., Weiss, S. T., Nicolae, D. L., Ober, C., Meyers, D. A., Bleecker, E. R., Mack, S. J., Hernandez, R. D., Eichler, E. E., Barnes, K. C., Williams, L. K., Torgerson, D. G., Burchard, E. G. 2015; 135 (6): 1502-1510


    IgE is a key mediator of allergic inflammation, and its levels are frequently increased in patients with allergic disorders.We sought to identify genetic variants associated with IgE levels in Latinos.We performed a genome-wide association study and admixture mapping of total IgE levels in 3334 Latinos from the Genes-environments & Admixture in Latino Americans (GALA II) study. Replication was evaluated in 454 Latinos, 1564 European Americans, and 3187 African Americans from independent studies.We confirmed associations of 6 genes identified by means of previous genome-wide association studies and identified a novel genome-wide significant association of a polymorphism in the zinc finger protein 365 gene (ZNF365) with total IgE levels (rs200076616, P = 2.3 × 10(-8)). We next identified 4 admixture mapping peaks (6p21.32-p22.1, 13p22-31, 14q23.2, and 22q13.1) at which local African, European, and/or Native American ancestry was significantly associated with IgE levels. The most significant peak was 6p21.32-p22.1, where Native American ancestry was associated with lower IgE levels (P = 4.95 × 10(-8)). All but 22q13.1 were replicated in an independent sample of Latinos, and 2 of the peaks were replicated in African Americans (6p21.32-p22.1 and 14q23.2). Fine mapping of 6p21.32-p22.1 identified 6 genome-wide significant single nucleotide polymorphisms in Latinos, 2 of which replicated in European Americans. Another single nucleotide polymorphism was peak-wide significant within 14q23.2 in African Americans (rs1741099, P = 3.7 × 10(-6)) and replicated in non-African American samples (P = .011).We confirmed genetic associations at 6 genes and identified novel associations within ZNF365, HLA-DQA1, and 14q23.2. Our results highlight the importance of studying diverse multiethnic populations to uncover novel loci associated with total IgE levels.

    View details for DOI 10.1016/j.jaci.2014.10.033

    View details for Web of Science ID 000355933400013

    View details for PubMedID 25488688

    View details for PubMedCentralID PMC4458233

  • Beyond the reference genome. Nature biotechnology Bustamante, C. D., Rasmussen, M. 2015; 33 (6): 605-606

    View details for DOI 10.1038/nbt.3249

    View details for PubMedID 26057977

  • Genome-wide ancestry of 17th-century enslaved Africans from the Caribbean PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Schroeder, H., Avila-Arcos, M. C., Malaspinas, A., Poznik, G. D., Sandoval-Velasco, M., Carpenter, M. L., Moreno-Mayar, J. V., Sikora, M., Johnson, P. L., Allentoft, M. E., Samaniego, J. A., Haviser, J. B., Dee, M. W., Stafford, T. W., Salas, A., Orlando, L., Willerslev, E., Bustamante, C. D., Gilbert, M. T. 2015; 112 (12): 3669-3673


    Between 1500 and 1850, more than 12 million enslaved Africans were transported to the New World. The vast majority were shipped from West and West-Central Africa, but their precise origins are largely unknown. We used genome-wide ancient DNA analyses to investigate the genetic origins of three enslaved Africans whose remains were recovered on the Caribbean island of Saint Martin. We trace their origins to distinct subcontinental source populations within Africa, including Bantu-speaking groups from northern Cameroon and non-Bantu speakers living in present-day Nigeria and Ghana. To our knowledge, these findings provide the first direct evidence for the ethnic origins of enslaved Africans, at a time for which historical records are scarce, and demonstrate that genomic data provide another type of record that can shed new light on long-standing historical questions.

    View details for DOI 10.1073/pnas.1421784112

    View details for Web of Science ID 000351477000038

    View details for PubMedID 25755263

    View details for PubMedCentralID PMC4378422

  • Inexpensive and Highly Reproducible Cloud-Based Variant Calling of 2,535 Human Genomes. PloS one Shringarpure, S. S., Carroll, A., De La Vega, F. M., Bustamante, C. D. 2015; 10 (6)


    Population scale sequencing of whole human genomes is becoming economically feasible; however, data management and analysis remains a formidable challenge for many research groups. Large sequencing studies, like the 1000 Genomes Project, have improved our understanding of human demography and the effect of rare genetic variation in disease. Variant calling on datasets of hundreds or thousands of genomes is time-consuming, expensive, and not easily reproducible given the myriad components of a variant calling pipeline. Here, we describe a cloud-based pipeline for joint variant calling in large samples using the Real Time Genomics population caller. We deployed the population caller on the Amazon cloud with the DNAnexus platform in order to achieve low-cost variant calling. Using our pipeline, we were able to identify 68.3 million variants in 2,535 samples from Phase 3 of the 1000 Genomes Project. By performing the variant calling in a parallel manner, the data was processed within 5 days at a compute cost of $7.33 per sample (a total cost of $18,590 for completed jobs and $21,805 for all jobs). Analysis of cost dependence and running time on the data size suggests that, given near linear scalability, cloud computing can be a cheap and efficient platform for analyzing even larger sequencing studies in the future.

    View details for DOI 10.1371/journal.pone.0129277

    View details for PubMedID 26110529

    View details for PubMedCentralID PMC4482534

  • Achieving high-sensitivity for clinical applications using augmented exome sequencing. Genome medicine Patwardhan, A., Harris, J., Leng, N., Bartha, G., Church, D. M., Luo, S., Haudenschild, C., Pratt, M., Zook, J., Salit, M., Tirch, J., Morra, M., Chervitz, S., Li, M., Clark, M., Garcia, S., Chandratillake, G., Kirk, S., Ashley, E., Snyder, M., Altman, R., Bustamante, C., Butte, A. J., West, J., Chen, R. 2015; 7 (1): 71-?


    Whole exome sequencing is increasingly used for the clinical evaluation of genetic disease, yet the variation of coverage and sensitivity over medically relevant parts of the genome remains poorly understood. Several sequencing-based assays continue to provide coverage that is inadequate for clinical assessment.Using sequence data obtained from the NA12878 reference sample and pre-defined lists of medically-relevant protein-coding and noncoding sequences, we compared the breadth and depth of coverage obtained among four commercial exome capture platforms and whole genome sequencing. In addition, we evaluated the performance of an augmented exome strategy, ACE, that extends coverage in medically relevant regions and enhances coverage in areas that are challenging to sequence. Leveraging reference call-sets, we also examined the effects of improved coverage on variant detection sensitivity.We observed coverage shortfalls with each of the conventional exome-capture and whole-genome platforms across several medically interpretable genes. These gaps included areas of the genome required for reporting recently established secondary findings (ACMG) and known disease-associated loci. The augmented exome strategy recovered many of these gaps, resulting in improved coverage in these areas. At clinically-relevant coverage levels (100 % bases covered at ≥20×), ACE improved coverage among genes in the medically interpretable genome (>90 % covered relative to 10-78 % with other platforms), the set of ACMG secondary finding genes (91 % covered relative to 4-75 % with other platforms) and a subset of variants known to be associated with human disease (99 % covered relative to 52-95 % with other platforms). Improved coverage translated into improvements in sensitivity, with ACE variant detection sensitivities (>97.5 % SNVs, >92.5 % InDels) exceeding that observed with conventional whole-exome and whole-genome platforms.Clinicians should consider analytical performance when making clinical assessments, given that even a few missed variants can lead to reporting false negative results. An augmented exome strategy provides a level of coverage not achievable with other platforms, thus addressing concerns regarding the lack of sensitivity in clinically important regions. In clinical applications where comprehensive coverage of medically interpretable areas of the genome requires higher localized sequencing depth, an augmented exome approach offers both cost and performance advantages over other sequencing-based tests.

    View details for DOI 10.1186/s13073-015-0197-4

    View details for PubMedID 26269718

    View details for PubMedCentralID PMC4534066

  • Population Genomic Analysis Reveals a Rich Speciation and Demographic History of Orang-utans (Pongo pygmaeus and Pongo abelii) PLOS ONE Ma, X., Kelley, J. L., Eilertson, K., Musharoff, S., Degenhardt, J. D., Martins, A. L., Vinar, T., Kosiol, C., Siepel, A., Gutenkunst, R. N., Bustamante, C. D. 2013; 8 (10)

    View details for DOI 10.1371/journal.pone.0077175

    View details for Web of Science ID 000326037000047

    View details for PubMedID 24194868

  • Gene flow from North Africa contributes to differential human genetic diversity in southern Europe PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Botigue, L. R., Henn, B. M., Gravel, S., Maples, B. K., Gignoux, C. R., Corona, E., Atzmon, G., Burns, E., Ostrer, H., Flores, C., Bertranpetit, J., Comas, D., Bustamante, C. D. 2013; 110 (29): 11791-11796


    Human genetic diversity in southern Europe is higher than in other regions of the continent. This difference has been attributed to postglacial expansions, the demic diffusion of agriculture from the Near East, and gene flow from Africa. Using SNP data from 2,099 individuals in 43 populations, we show that estimates of recent shared ancestry between Europe and Africa are substantially increased when gene flow from North Africans, rather than Sub-Saharan Africans, is considered. The gradient of North African ancestry accounts for previous observations of low levels of sharing with Sub-Saharan Africa and is independent of recent gene flow from the Near East. The source of genetic diversity in southern Europe has important biomedical implications; we find that most disease risk alleles from genome-wide association studies follow expected patterns of divergence between Europe and North Africa, with the principal exception of multiple sclerosis.

    View details for DOI 10.1073/pnas.1306223110

    View details for Web of Science ID 000322086100040

    View details for PubMedID 23733930

  • Analysis of the genetic basis of disease in the context of worldwide human relationships and migration. PLoS genetics Corona, E., Chen, R., Sikora, M., Morgan, A. A., Patel, C. J., Ramesh, A., Bustamante, C. D., Butte, A. J. 2013; 9 (5)


    Genetic diversity across different human populations can enhance understanding of the genetic basis of disease. We calculated the genetic risk of 102 diseases in 1,043 unrelated individuals across 51 populations of the Human Genome Diversity Panel. We found that genetic risk for type 2 diabetes and pancreatic cancer decreased as humans migrated toward East Asia. In addition, biliary liver cirrhosis, alopecia areata, bladder cancer, inflammatory bowel disease, membranous nephropathy, systemic lupus erythematosus, systemic sclerosis, ulcerative colitis, and vitiligo have undergone genetic risk differentiation. This analysis represents a large-scale attempt to characterize genetic risk differentiation in the context of migration. We anticipate that our findings will enable detailed analysis pertaining to the driving forces behind genetic risk differentiation.

    View details for DOI 10.1371/journal.pgen.1003447

    View details for PubMedID 23717210

    View details for PubMedCentralID PMC3662561

  • Evolutionary and Population Genomics of the Cavity Causing Bacteria Streptococcus mutans MOLECULAR BIOLOGY AND EVOLUTION Cornejo, O. E., Lefebure, T., Bitar, P. D., Lang, P., Richards, V. P., Eilertson, K., Thuy Do, T., Beighton, D., Zeng, L., Ahn, S., Burne, R. A., Siepel, A., Bustamante, C. D., Stanhope, M. J. 2013; 30 (4): 881-893


    Streptococcus mutans is widely recognized as one of the key etiological agents of human dental caries. Despite its role in this important disease, our present knowledge of gene content variability across the species and its relationship to adaptation is minimal. Estimates of its demographic history are not available. In this study, we generated genome sequences of 57 S. mutans isolates, as well as representative strains of the most closely related species to S. mutans (S. ratti, S. macaccae, and S. criceti), to identify the overall structure and potential adaptive features of the dispensable and core components of the genome. We also performed population genetic analyses on the core genome of the species aimed at understanding the demographic history, and impact of selection shaping its genetic variation. The maximum gene content divergence among strains was approximately 23%, with the majority of strains diverging by 5-15%. The core genome consisted of 1,490 genes and the pan-genome approximately 3,296. Maximum likelihood analysis of the synonymous site frequency spectrum (SFS) suggested that the S. mutans population started expanding exponentially approximately 10,000 years ago (95% confidence interval [CI]: 3,268-14,344 years ago), coincidental with the onset of human agriculture. Analysis of the replacement SFS indicated that a majority of these substitutions are under strong negative selection, and the remainder evolved neutrally. A set of 14 genes was identified as being under positive selection, most of which were involved in either sugar metabolism or acid tolerance. Analysis of the core genome suggested that among 73 genes present in all isolates of S. mutans but absent in other species of the mutans taxonomic group, the majority can be associated with metabolic processes that could have contributed to the successful adaptation of S. mutans to its new niche, the human mouth, and with the dietary changes that accompanied the origin of agriculture.

    View details for DOI 10.1093/molbev/mss278

    View details for Web of Science ID 000317002300017

    View details for PubMedID 23228887

  • High-throughput two-dimensional root system phenotyping platform facilitates genetic analysis of root growth and development PLANT CELL AND ENVIRONMENT Clark, R. T., Famoso, A. N., Zhao, K., Shaff, J. E., Craft, E. J., Bustamante, C. D., McCouch, S. R., Aneshansley, D. J., Kochian, L. V. 2013; 36 (2): 454-466


    High-throughput phenotyping of root systems requires a combination of specialized techniques and adaptable plant growth, root imaging and software tools. A custom phenotyping platform was designed to capture images of whole root systems, and novel software tools were developed to process and analyse these images. The platform and its components are adaptable to a wide range root phenotyping studies using diverse growth systems (hydroponics, paper pouches, gel and soil) involving several plant species, including, but not limited to, rice, maize, sorghum, tomato and Arabidopsis. The RootReader2D software tool is free and publicly available and was designed with both user-guided and automated features that increase flexibility and enhance efficiency when measuring root growth traits from specific roots or entire root systems during large-scale phenotyping studies. To demonstrate the unique capabilities and high-throughput capacity of this phenotyping platform for studying root systems, genome-wide association studies on rice (Oryza sativa) and maize (Zea mays) root growth were performed and root traits related to aluminium (Al) tolerance were analysed on the parents of the maize nested association mapping (NAM) population.

    View details for DOI 10.1111/j.1365-3040.2012.02587.x

    View details for Web of Science ID 000312997700017

    View details for PubMedID 22860896

  • Population genomic analysis reveals a rich speciation and demographic history of orang-utans (Pongo pygmaeus and Pongo abelii). PloS one Ma, X., Kelley, J. L., Eilertson, K., Musharoff, S., Degenhardt, J. D., Martins, A. L., Vinar, T., Kosiol, C., Siepel, A., Gutenkunst, R. N., Bustamante, C. D. 2013; 8 (10)


    To gain insights into evolutionary forces that have shaped the history of Bornean and Sumatran populations of orang-utans, we compare patterns of variation across more than 11 million single nucleotide polymorphisms found by previous mitochondrial and autosomal genome sequencing of 10 wild-caught orang-utans. Our analysis of the mitochondrial data yields a far more ancient split time between the two populations (~3.4 million years ago) than estimates based on autosomal data (0.4 million years ago), suggesting a complex speciation process with moderate levels of primarily male migration. We find that the distribution of selection coefficients consistent with the observed frequency spectrum of autosomal non-synonymous polymorphisms in orang-utans is similar to the distribution in humans. Our analysis indicates that 35% of genes have evolved under detectable negative selection. Overall, our findings suggest that purifying natural selection, genetic drift, and a complex demographic history are the dominant drivers of genome evolution for the two orang-utan populations.

    View details for DOI 10.1371/journal.pone.0077175

    View details for PubMedID 24194868

  • SnIPRE: Selection Inference Using a Poisson Random Effects Model PLOS COMPUTATIONAL BIOLOGY Eilertson, K. E., Booth, J. G., Bustamante, C. D. 2012; 8 (12)


    We present an approach for identifying genes under natural selection using polymorphism and divergence data from synonymous and non-synonymous sites within genes. A generalized linear mixed model is used to model the genome-wide variability among categories of mutations and estimate its functional consequence. We demonstrate how the model's estimated fixed and random effects can be used to identify genes under selection. The parameter estimates from our generalized linear model can be transformed to yield population genetic parameter estimates for quantities including the average selection coefficient for new mutations at a locus, the synonymous and non-synynomous mutation rates, and species divergence times. Furthermore, our approach incorporates stochastic variation due to the evolutionary process and can be fit using standard statistical software. The model is fit in both the empirical Bayes and Bayesian settings using the lme4 package in R, and Markov chain Monte Carlo methods in WinBUGS. Using simulated data we compare our method to existing approaches for detecting genes under selection: the McDonald-Kreitman test, and two versions of the Poisson random field based method MKprf. Overall, we find our method universally outperforms existing methods for detecting genes subject to selection using polymorphism and divergence data.

    View details for DOI 10.1371/journal.pcbi.1002806

    View details for Web of Science ID 000312901500013

    View details for PubMedID 23236270

  • The Possibility of De Novo Assembly of the Genome and Population Genomics of the Mangrove Rivulus, Kryptolebias marmoratus Annual Meeting of the Society-for-Integrative-and-Comparative-Biology (SICB)/Symposium on Mangrove Killifish - An Exemplar of Integrative Biology Kelley, J. L., Yee, M., Lee, C., Levandowsky, E., Shah, M., Harkins, T., Earley, R. L., Bustamante, C. D. OXFORD UNIV PRESS INC. 2012: 737–42


    How organisms adapt to the range of environments they encounter is a fundamental question in biology. Elucidating the genetic basis of adaptation is a difficult task, especially when the targets of selection are not known. Emerging sequencing technologies and assembly algorithms facilitate the genomic dissection of adaptation and population differentiation in a vast array of organisms. Here we describe the attributes of Kryptolebias marmoratus, one of two known self-fertilizing hermaphroditic vertebrates that make this fish an attractive genetic system and a model for understanding the genomics of adaptation. Long periods of selfing have resulted in populations composed of many distinct naturally homozygous strains with a variety of identifiable, and apparently heritable, phenotypes. There also is strong population genetic structure across a diverse range of mangrove habitats, making this a tractable system in which to study differentiation both within and among populations. The ability to rear K. marmoratus in the laboratory contributes further to its value as a model for understanding the genetic drivers for adaptation. To date, microsatellite markers distinguish wild isogenic strains but the naturally high homozygosity improves the quality of de novo assembly of the genome and facilitates the identification of genetic variants associated with phenotypes. Gene annotation can be accomplished with RNA-sequencing data in combination with de novo genome assembly. By combining genomic information with extensive laboratory-based phenotyping, it becomes possible to map genetic variants underlying differences in behavioral, life-history, and other potentially adaptive traits. Emerging genomic technologies provide the required resources for establishing K. marmoratus as a new model organism for behavioral genetics and evolutionary genetics research.

    View details for DOI 10.1093/icb/ics094

    View details for Web of Science ID 000311645400003

    View details for PubMedID 22723055

    View details for PubMedCentralID PMC3501098

  • Limited Evidence for Classic Selective Sweeps in African Populations GENETICS Granka, J. M., Henn, B. M., Gignoux, C. R., Kidd, J. M., Bustamante, C. D., Feldman, M. W. 2012; 192 (3): 1049-?


    While hundreds of loci have been identified as reflecting strong-positive selection in human populations, connections between candidate loci and specific selective pressures often remain obscure. This study investigates broader patterns of selection in African populations, which are underrepresented despite their potential to offer key insights into human adaptation. We scan for hard selective sweeps using several haplotype and allele-frequency statistics with a data set of nearly 500,000 genome-wide single-nucleotide polymorphisms in 12 highly diverged African populations that span a range of environments and subsistence strategies. We find that positive selection does not appear to be a strong determinant of allele-frequency differentiation among these African populations. Haplotype statistics do identify putatively selected regions that are shared across African populations. However, as assessed by extensive simulations, patterns of haplotype sharing between African populations follow neutral expectations and suggest that tails of the empirical distributions contain false-positive signals. After highlighting several genomic regions where positive selection can be inferred with higher confidence, we use a novel method to identify biological functions enriched among populations' empirical tail genomic windows, such as immune response in agricultural groups. In general, however, it seems that current methods for selection scans are poorly suited to populations that, like the African populations in this study, are affected by ascertainment bias and have low levels of linkage disequilibrium, possibly old selective sweeps, and potentially reduced phasing accuracy. Additionally, population history can confound the interpretation of selection statistics, suggesting that greater care is needed in attributing broad genetic patterns to human adaptation.

    View details for DOI 10.1534/genetics.112.144071

    View details for Web of Science ID 000310793900019

    View details for PubMedID 22960214

  • North African Populations Carry the Signature of Admixture with Neandertals PLOS ONE Sanchez-Quinto, F., Botigue, L. R., Civit, S., Arenas, C., Avila-Arcos, M. C., Bustamante, C. D., Comas, D., Lalueza-Fox, C. 2012; 7 (10)


    One of the main findings derived from the analysis of the Neandertal genome was the evidence for admixture between Neandertals and non-African modern humans. An alternative scenario is that the ancestral population of non-Africans was closer to Neandertals than to Africans because of ancient population substructure. Thus, the study of North African populations is crucial for testing both hypotheses. We analyzed a total of 780,000 SNPs in 125 individuals representing seven different North African locations and searched for their ancestral/derived state in comparison to different human populations and Neandertals. We found that North African populations have a significant excess of derived alleles shared with Neandertals, when compared to sub-Saharan Africans. This excess is similar to that found in non-African humans, a fact that can be interpreted as a sign of Neandertal admixture. Furthermore, the Neandertal's genetic signal is higher in populations with a local, pre-Neolithic North African ancestry. Therefore, the detected ancient admixture is not due to recent Near Eastern or European migrations. Sub-Saharan populations are the only ones not affected by the admixture event with Neandertals.

    View details for DOI 10.1371/journal.pone.0047765

    View details for Web of Science ID 000311146900109

    View details for PubMedID 23082212

  • Population Genetic Inference from Personal Genome Data: Impact of Ancestry and Admixture on Human Genomic Variation AMERICAN JOURNAL OF HUMAN GENETICS Kidd, J. M., Gravel, S., Byrnes, J., Moreno-Estrada, A., Musharoff, S., Bryc, K., Degenhardt, J. D., Brisbin, A., Sheth, V., Chen, R., McLaughlin, S. F., Peckham, H. E., Omberg, L., Chung, C. A., Stanley, S., Pearlstein, K., Levandowsky, E., Acevedo-Acevedo, S., Auton, A., Keinan, A., Acuna-Alonzo, V., Barquera-Lozano, R., Canizales-Quinteros, S., Eng, C., Burchard, E. G., Russell, A., Reynolds, A., Clark, A. G., Reese, M. G., Lincoln, S. E., Butte, A. T., De La Vega, F. M., Bustamante, C. D. 2012; 91 (4): 660-671


    Full sequencing of individual human genomes has greatly expanded our understanding of human genetic variation and population history. Here, we present a systematic analysis of 50 human genomes from 11 diverse global populations sequenced at high coverage. Our sample includes 12 individuals who have admixed ancestry and who have varying degrees of recent (within the last 500 years) African, Native American, and European ancestry. We found over 21 million single-nucleotide variants that contribute to a 1.75-fold range in nucleotide heterozygosity across diverse human genomes. This heterozygosity ranged from a high of one heterozygous site per kilobase in west African genomes to a low of 0.57 heterozygous sites per kilobase in segments inferred to have diploid Native American ancestry from the genomes of Mexican and Puerto Rican individuals. We show evidence of all three continental ancestries in the genomes of Mexican, Puerto Rican, and African American populations, and the genome-wide statistics are highly consistent across individuals from a population once ancestry proportions have been accounted for. Using a generalized linear model, we identified subtle variations across populations in the proportion of neutral versus deleterious variation and found that genome-wide statistics vary in admixed populations even once ancestry proportions have been factored in. We further infer that multiple periods of gene flow shaped the diversity of admixed populations in the Americas-70% of the European ancestry in today's African Americans dates back to European gene flow happening only 7-8 generations ago.

    View details for DOI 10.1016/j.ajhg.2012.08.025

    View details for Web of Science ID 000309568500008

    View details for PubMedID 23040495

  • The genetic prehistory of southern Africa NATURE COMMUNICATIONS Pickrell, J. K., Patterson, N., Barbieri, C., Berthold, F., Gerlach, L., Gueldemann, T., Kure, B., Mpoloka, S. W., Nakagawa, H., Naumann, C., Lipson, M., Loh, P., Lachance, J., Mountain, J., Bustamante, C. D., Berger, B., Tishkoff, S. A., Henn, B. M., Stoneking, M., Reich, D., Pakendorf, B. 2012; 3

    View details for DOI 10.1038/ncomms2140

    View details for Web of Science ID 000313514100053

  • North African Jewish and non-Jewish populations form distinctive, orthogonal clusters PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Campbell, C. L., Palamara, P. F., Dubrovsky, M., Botigue, L. R., Fellous, M., Atzmon, G., Oddoux, C., Pearlman, A., Hao, L., Henn, B. M., Burns, E., Bustamante, C. D., Comas, D., friedman, e., Pe'er, I., Ostrer, H. 2012; 109 (34): 13865-13870


    North African Jews constitute the second largest Jewish Diaspora group. However, their relatedness to each other; to European, Middle Eastern, and other Jewish Diaspora groups; and to their former North African non-Jewish neighbors has not been well defined. Here, genome-wide analysis of five North African Jewish groups (Moroccan, Algerian, Tunisian, Djerban, and Libyan) and comparison with other Jewish and non-Jewish groups demonstrated distinctive North African Jewish population clusters with proximity to other Jewish populations and variable degrees of Middle Eastern, European, and North African admixture. Two major subgroups were identified by principal component, neighbor joining tree, and identity-by-descent analysis-Moroccan/Algerian and Djerban/Libyan-that varied in their degree of European admixture. These populations showed a high degree of endogamy and were part of a larger Ashkenazi and Sephardic Jewish group. By principal component analysis, these North African groups were orthogonal to contemporary populations from North and South Morocco, Western Sahara, Tunisia, Libya, and Egypt. Thus, this study is compatible with the history of North African Jews-founding during Classical Antiquity with proselytism of local populations, followed by genetic isolation with the rise of Christianity and then Islam, and admixture following the emigration of Sephardic Jews during the Inquisition.

    View details for DOI 10.1073/pnas.1204840109

    View details for Web of Science ID 000308085200081

    View details for PubMedID 22869716

  • PCAdmix: Principal Components-Based Assignment of Ancestry along Each Chromosome in Individuals with Admixed Ancestry from Two or More Populations HUMAN BIOLOGY Brisbin, A., Bryc, K., Byrnes, J., Zakharia, F., Omberg, L., Degenhardt, J., Reynolds, A., Ostrer, H., Mezey, J. G., Bustamante, C. D. 2012; 84 (4): 343-364


    Identifying ancestry along each chromosome in admixed individuals provides a wealth of information for understanding the population genetic history of admixture events and is valuable for admixture mapping and identifying recent targets of selection. We present PCAdmix (available at ), a Principal Components-based algorithm for determining ancestry along each chromosome from a high-density, genome-wide set of phased single-nucleotide polymorphism (SNP) genotypes of admixed individuals. We compare our method to HAPMIX on simulated data from two ancestral populations, and we find high concordance between the methods. Our method also has better accuracy than LAMP when applied to three-population admixture, a situation as yet unaddressed by HAPMIX. Finally, we apply our method to a data set of four Latino populations with European, African, and Native American ancestry. We find evidence of assortative mating in each of the four populations, and we identify regions of shared ancestry that may be recent targets of selection and could serve as candidate regions for admixture-based association mapping.

    View details for Web of Science ID 000313648400001

    View details for PubMedID 23249312

  • Variation of BMP3 Contributes to Dog Breed Skull Diversity PLOS GENETICS Schoenebeck, J. J., Hutchinson, S. A., Byers, A., Beale, H. C., Carrington, B., Faden, D. L., Rimbault, M., Decker, B., Kidd, J. M., Sood, R., Boyko, A. R., Fondon, J. W., Wayne, R. K., Bustamante, C. D., Ciruna, B., Ostrander, E. A. 2012; 8 (8)


    Since the beginnings of domestication, the craniofacial architecture of the domestic dog has morphed and radiated to human whims. By beginning to define the genetic underpinnings of breed skull shapes, we can elucidate mechanisms of morphological diversification while presenting a framework for understanding human cephalic disorders. Using intrabreed association mapping with museum specimen measurements, we show that skull shape is regulated by at least five quantitative trait loci (QTLs). Our detailed analysis using whole-genome sequencing uncovers a missense mutation in BMP3. Validation studies in zebrafish show that Bmp3 function in cranial development is ancient. Our study reveals the causal variant for a canine QTL contributing to a major morphologic trait.

    View details for DOI 10.1371/journal.pgen.1002849

    View details for Web of Science ID 000308529300012

    View details for PubMedID 22876193

    View details for PubMedCentralID PMC3410846

  • Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes SCIENCE Tennessen, J. A., Bigham, A. W., O'Connor, T. D., Fu, W., Kenny, E. E., Gravel, S., Mcgee, S., Do, R., Liu, X., Jun, G., Kang, H. M., Jordan, D., Leal, S. M., Gabriel, S., Rieder, M. J., Abecasis, G., Altshuler, D., Nickerson, D. A., Boerwinkle, E., Sunyaev, S., Bustamante, C. D., Bamshad, M. J., Akey, J. M. 2012; 337 (6090): 64-69


    As a first step toward understanding how rare variants contribute to risk for complex diseases, we sequenced 15,585 human protein-coding genes to an average median depth of 111× in 2440 individuals of European (n = 1351) and African (n = 1088) ancestry. We identified over 500,000 single-nucleotide variants (SNVs), the majority of which were rare (86% with a minor allele frequency less than 0.5%), previously unknown (82%), and population-specific (82%). On average, 2.3% of the 13,595 SNVs each person carried were predicted to affect protein function of ~313 genes per genome, and ~95.7% of SNVs predicted to be functionally important were rare. This excess of rare functional variants is due to the combined effects of explosive, recent accelerated population growth and weak purifying selection. Furthermore, we show that large sample sizes will be required to associate rare variants with complex traits.

    View details for DOI 10.1126/science.1219240

    View details for Web of Science ID 000306053100043

    View details for PubMedID 22604720

  • Randomized Trial of Personal Genomics for Preventive Cardiology Design and Challenges CIRCULATION-CARDIOVASCULAR GENETICS Knowles, J. W., Assimes, T. L., Kiernan, M., Pavlovic, A., Goldstein, B. A., Yank, V., McConnell, M. V., Absher, D., Bustamante, C., Ashley, E. A., Ioannidis, J. P. 2012; 5 (3): 368-376
  • Melanesian Blond Hair Is Caused by an Amino Acid Change in TYRP1 SCIENCE Kenny, E. E., Timpson, N. J., Sikora, M., Yee, M., Moreno-Estrada, A., Eng, C., Huntsman, S., Burchard, E. G., Stoneking, M., Bustamante, C. D., Myles, S. 2012; 336 (6081): 554-554


    Naturally blond hair is rare in humans and found almost exclusively in Europe and Oceania. Here, we identify an arginine-to-cysteine change at a highly conserved residue in tyrosinase-related protein 1 (TYRP1) as a major determinant of blond hair in Solomon Islanders. This missense mutation is predicted to affect catalytic activity of TYRP1 and causes blond hair through a recessive mode of inheritance. The mutation is at a frequency of 26% in the Solomon Islands, is absent outside of Oceania, represents a strong common genetic effect on a complex human phenotype, and highlights the importance of examining genetic associations worldwide.

    View details for DOI 10.1126/science.1217849

    View details for Web of Science ID 000303498800036

    View details for PubMedID 22556244

    View details for PubMedCentralID PMC3481182

  • Type 2 Diabetes Risk Alleles Demonstrate Extreme Directional Differentiation among Human Populations, Compared to Other Diseases PLOS GENETICS Chen, R., Corona, E., Sikora, M., Dudley, J. T., Morgan, A. A., Moreno-Estrada, A., Nilsen, G. B., Ruau, D., Lincoln, S. E., Bustamante, C. D., Butte, A. J. 2012; 8 (4): 100-115


    Many disease-susceptible SNPs exhibit significant disparity in ancestral and derived allele frequencies across worldwide populations. While previous studies have examined population differentiation of alleles at specific SNPs, global ethnic patterns of ensembles of disease risk alleles across human diseases are unexamined. To examine these patterns, we manually curated ethnic disease association data from 5,065 papers on human genetic studies representing 1,495 diseases, recording the precise risk alleles and their measured population frequencies and estimated effect sizes. We systematically compared the population frequencies of cross-ethnic risk alleles for each disease across 1,397 individuals from 11 HapMap populations, 1,064 individuals from 53 HGDP populations, and 49 individuals with whole-genome sequences from 10 populations. Type 2 diabetes (T2D) demonstrated extreme directional differentiation of risk allele frequencies across human populations, compared with null distributions of European-frequency matched control genomic alleles and risk alleles for other diseases. Most T2D risk alleles share a consistent pattern of decreasing frequencies along human migration into East Asia. Furthermore, we show that these patterns contribute to disparities in predicted genetic risk across 1,397 HapMap individuals, T2D genetic risk being consistently higher for individuals in the African populations and lower in the Asian populations, irrespective of the ethnicity considered in the initial discovery of risk alleles. We observed a similar pattern in the distribution of T2D Genetic Risk Scores, which are associated with an increased risk of developing diabetes in the Diabetes Prevention Program cohort, for the same individuals. This disparity may be attributable to the promotion of energy storage and usage appropriate to environments and inconsistent energy intake. Our results indicate that the differential frequencies of T2D risk alleles may contribute to the observed disparity in T2D incidence rates across ethnic populations.

    View details for DOI 10.1371/journal.pgen.1002621

    View details for Web of Science ID 000303441800007

    View details for PubMedID 22511877

    View details for PubMedCentralID PMC3325177

  • Detecting and annotating genetic variations using the HugeSeq pipeline NATURE BIOTECHNOLOGY Lam, H. Y., Pan, C., Clark, M. J., Lacroute, P., Chen, R., Haraksingh, R., O'Huallachain, M., Gerstein, M. B., Kidd, J. M., Bustamante, C. D., Snyder, M. 2012; 30 (3): 226-229

    View details for Web of Science ID 000301303800013

    View details for PubMedID 22398614

  • New insights into the Tyrolean Iceman's origin and phenotype as inferred by whole-genome sequencing NATURE COMMUNICATIONS Keller, A., Graefen, A., Ball, M., Matzas, M., Boisguerin, V., Maixner, F., Leidinger, P., Backes, C., Khairat, R., Forster, M., Stade, B., Franke, A., Mayer, J., Spangler, J., McLaughlin, S., Shah, M., Lee, C., Harkins, T. T., Sartori, A., Moreno-Estrada, A., Henn, B., Sikora, M., Semino, O., Chiaroni, J., Rootsi, S., Myres, N. M., Cabrera, V. M., Underhill, P. A., Bustamante, C. D., Vigl, E. E., Samadelli, M., Cipollini, G., Haas, J., Katus, H., O'Connor, B. D., Carlson, M. R., Meder, B., Blin, N., Meese, E., Pusch, C. M., Zink, A. 2012; 3


    The Tyrolean Iceman, a 5,300-year-old Copper age individual, was discovered in 1991 on the Tisenjoch Pass in the Italian part of the Ötztal Alps. Here we report the complete genome sequence of the Iceman and show 100% concordance between the previously reported mitochondrial genome sequence and the consensus sequence generated from our genomic data. We present indications for recent common ancestry between the Iceman and present-day inhabitants of the Tyrrhenian Sea, that the Iceman probably had brown eyes, belonged to blood group O and was lactose intolerant. His genetic predisposition shows an increased risk for coronary heart disease and may have contributed to the development of previously reported vascular calcifications. Sequences corresponding to ~60% of the genome of Borrelia burgdorferi are indicative of the earliest human case of infection with the pathogen for Lyme borreliosis.

    View details for DOI 10.1038/ncomms1701

    View details for Web of Science ID 000302060100039

    View details for PubMedID 22426219

  • The genetic prehistory of southern Africa. Nature communications Pickrell, J. K., Patterson, N., Barbieri, C., Berthold, F., Gerlach, L., Güldemann, T., Kure, B., Mpoloka, S. W., Nakagawa, H., Naumann, C., Lipson, M., Loh, P., Lachance, J., Mountain, J., Bustamante, C. D., Berger, B., Tishkoff, S. A., Henn, B. M., Stoneking, M., Reich, D., Pakendorf, B. 2012; 3: 1143-?


    Southern and eastern African populations that speak non-Bantu languages with click consonants are known to harbour some of the most ancient genetic lineages in humans, but their relationships are poorly understood. Here, we report data from 23 populations analysed at over half a million single-nucleotide polymorphisms, using a genome-wide array designed for studying human history. The southern African Khoisan fall into two genetic groups, loosely corresponding to the northwestern and southeastern Kalahari, which we show separated within the last 30,000 years. We find that all individuals derive at least a few percent of their genomes from admixture with non-Khoisan populations that began ∼1,200 years ago. In addition, the East African Hadza and Sandawe derive a fraction of their ancestry from admixture with a population related to the Khoisan, supporting the hypothesis of an ancient link between southern and eastern Africa.

    View details for DOI 10.1038/ncomms2140

    View details for PubMedID 23072811

  • Genomic Ancestry of North Africans Supports Back-to-Africa Migrations PLOS GENETICS Henn, B. M., Botigue, L. R., Gravel, S., Wang, W., Brisbin, A., Byrnes, J. K., Fadhlaoui-Zid, K., Zalloua, P. A., Moreno-Estrada, A., Bertranpetit, J., Bustamante, C. D., Comas, D. 2012; 8 (1)


    North African populations are distinct from sub-Saharan Africans based on cultural, linguistic, and phenotypic attributes; however, the time and the extent of genetic divergence between populations north and south of the Sahara remain poorly understood. Here, we interrogate the multilayered history of North Africa by characterizing the effect of hypothesized migrations from the Near East, Europe, and sub-Saharan Africa on current genetic diversity. We present dense, genome-wide SNP genotyping array data (730,000 sites) from seven North African populations, spanning from Egypt to Morocco, and one Spanish population. We identify a gradient of likely autochthonous Maghrebi ancestry that increases from east to west across northern Africa; this ancestry is likely derived from "back-to-Africa" gene flow more than 12,000 years ago (ya), prior to the Holocene. The indigenous North African ancestry is more frequent in populations with historical Berber ethnicity. In most North African populations we also see substantial shared ancestry with the Near East, and to a lesser extent sub-Saharan Africa and Europe. To estimate the time of migration from sub-Saharan populations into North Africa, we implement a maximum likelihood dating method based on the distribution of migrant tracts. In order to first identify migrant tracts, we assign local ancestry to haplotypes using a novel, principal component-based analysis of three ancestral populations. We estimate that a migration of western African origin into Morocco began about 40 generations ago (approximately 1,200 ya); a migration of individuals with Nilotic ancestry into Egypt occurred about 25 generations ago (approximately 750 ya). Our genomic data reveal an extraordinarily complex history of migrations, involving at least five ancestral populations, into North Africa.

    View details for DOI 10.1371/journal.pgen.1002397

    View details for Web of Science ID 000300223400001

    View details for PubMedID 22253600

    View details for PubMedCentralID PMC3257290

  • Mutation Hot Spots in Yeast Caused by Long-Range Clustering of Homopolymeric Sequences CELL REPORTS Ma, X., Rogacheva, M. V., Nishant, K. T., Zanders, S., Bustamante, C. D., Alani, E. 2012; 1 (1): 36-42


    Evolutionary theory assumes that mutations occur randomly in the genome; however, studies performed in a variety of organisms indicate the existence of context-dependent mutation biases. Sources of mutagenesis variation across large genomic contexts (e.g., hundreds of bases) have not been identified. Here, we use high-coverage whole-genome sequencing of a conditional mismatch repair mutant line of diploid yeast to identify mutations that accumulated after 160 generations of growth. The vast majority of the mutations accumulated as insertion/deletions (in/dels) in homopolymeric [poly(dA:dT)] and repetitive DNA tracts. Surprisingly, the likelihood of an in/del mutation in a given poly(dA:dT) tract is increased by the presence of nearby poly(dA:dT) tracts in up to a 1,000 bp region centered on the given tract. Our work suggests that specific mutation hot spots can contribute disproportionately to the genetic variation that is introduced into populations and provides long-range genomic sequence context that contributes to mutagenesis.

    View details for DOI 10.1016/j.celrep.2011.10.003

    View details for Web of Science ID 000309709500006

    View details for PubMedID 22832106

  • An Aboriginal Australian Genome Reveals Separate Human Dispersals into Asia SCIENCE Rasmussen, M., Guo, X., Wang, Y., Lohmueller, K. E., Rasmussen, S., Albrechtsen, A., Skotte, L., Lindgreen, S., Metspalu, M., Jombart, T., Kivisild, T., Zhai, W., Eriksson, A., Manica, A., Orlando, L., De La Vega, F. M., Tridico, S., Metspalu, E., Nielsen, K., Avila-Arcos, M. C., Moreno-Mayar, J. V., Muller, C., Dortch, J., Gilbert, M. T., Lund, O., Wesolowska, A., Karmin, M., Weinert, L. A., Wang, B., Li, J., Tai, S., Xiao, F., Hanihara, T., van Driem, G., Jha, A. R., Ricaut, F., de Knijff, P., Migliano, A. B., Romero, I. G., Kristiansen, K., Lambert, D. M., Brunak, S., Forster, P., Brinkmann, B., Nehlich, O., Bunce, M., Richards, M., Gupta, R., Bustamante, C. D., Krogh, A., Foley, R. A., Lahr, M. M., Balloux, F., Sicheritz-Ponten, T., Villems, R., Nielsen, R., Wang, J., Willerslev, E. 2011; 334 (6052): 94-98


    We present an Aboriginal Australian genomic sequence obtained from a 100-year-old lock of hair donated by an Aboriginal man from southern Western Australia in the early 20th century. We detect no evidence of European admixture and estimate contamination levels to be below 0.5%. We show that Aboriginal Australians are descendants of an early human dispersal into eastern Asia, possibly 62,000 to 75,000 years ago. This dispersal is separate from the one that gave rise to modern Asians 25,000 to 38,000 years ago. We also find evidence of gene flow between populations of the two dispersal waves prior to the divergence of Native Americans from modern Asian ancestors. Our findings support the hypothesis that present-day Aboriginal Australians descend from the earliest humans to occupy Australia, likely representing one of the oldest continuous populations outside Africa.

    View details for DOI 10.1126/science.1211177

    View details for Web of Science ID 000295580300047

    View details for PubMedID 21940856

  • SnapShot: Human Biomedical Genomics CELL Kenny, E. E., Bustamante, C. D. 2011; 147 (1): 248-U262

    View details for DOI 10.1016/j.cell.2011.09.020

    View details for Web of Science ID 000295396700031

    View details for PubMedID 21962520

  • Reply to Ge and Sang: A single origin of domesticated rice PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Molina, J., Sikora, M., Garud, N., Flowers, J. M., Rubinstein, S., Reynolds, A., Huang, P., Jackson, S. A., Schaal, B. A., Bustamante, C. D., Boyko, A. R., Purugganan, M. D. 2011; 108 (39): E756-E756
  • Phased Whole-Genome Genetic Risk in a Family Quartet Using a Major Allele Reference Sequence PLOS GENETICS Dewey, F. E., Chen, R., Cordero, S. P., Ormond, K. E., Caleshu, C., Karczewski, K. J., Whirl-Carrillo, M., Wheeler, M. T., Dudley, J. T., Byrnes, J. K., Cornejo, O. E., Knowles, J. W., Woon, M., Sangkuhl, K., Gong, L., Thorn, C. F., Hebert, J. M., Capriotti, E., David, S. P., Pavlovic, A., West, A., Thakuria, J. V., Ball, M. P., Zaranek, A. W., Rehm, H. L., Church, G. M., West, J. S., Bustamante, C. D., Snyder, M., Altman, R. B., Klein, T. E., Butte, A. J., Ashley, E. A. 2011; 7 (9)


    Whole-genome sequencing harbors unprecedented potential for characterization of individual and family genetic variation. Here, we develop a novel synthetic human reference sequence that is ethnically concordant and use it for the analysis of genomes from a nuclear family with history of familial thrombophilia. We demonstrate that the use of the major allele reference sequence results in improved genotype accuracy for disease-associated variant loci. We infer recombination sites to the lowest median resolution demonstrated to date (< 1,000 base pairs). We use family inheritance state analysis to control sequencing error and inform family-wide haplotype phasing, allowing quantification of genome-wide compound heterozygosity. We develop a sequence-based methodology for Human Leukocyte Antigen typing that contributes to disease risk prediction. Finally, we advance methods for analysis of disease and pharmacogenomic risk across the coding and non-coding genome that incorporate phased variant data. We show these methods are capable of identifying multigenic risk for inherited thrombophilia and informing the appropriate pharmacological therapy. These ethnicity-specific, family-based approaches to interpretation of genetic variation are emblematic of the next generation of genetic risk assessment using whole-genome sequencing.

    View details for DOI 10.1371/journal.pgen.1002280

    View details for Web of Science ID 000295419100031

    View details for PubMedID 21935354

    View details for PubMedCentralID PMC3174201

  • Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa NATURE COMMUNICATIONS Zhao, K., Tung, C., Eizenga, G. C., Wright, M. H., Ali, M. L., Price, A. H., Norton, G. J., Islam, M. R., Reynolds, A., Mezey, J., McClung, A. M., Bustamante, C. D., McCouch, S. R. 2011; 2


    Asian rice, Oryza sativa is a cultivated, inbreeding species that feeds over half of the world's population. Understanding the genetic basis of diverse physiological, developmental, and morphological traits provides the basis for improving yield, quality and sustainability of rice. Here we show the results of a genome-wide association study based on genotyping 44,100 SNP variants across 413 diverse accessions of O. sativa collected from 82 countries that were systematically phenotyped for 34 traits. Using cross-population-based mapping strategies, we identified dozens of common variants influencing numerous complex traits. Significant heterogeneity was observed in the genetic architecture associated with subpopulation structure and response to environment. This work establishes an open-source translational research platform for genome-wide association studies in rice that directly links molecular variation in genes and metabolic pathways with the germplasm resources needed to accelerate varietal development and crop improvement.

    View details for DOI 10.1038/ncomms1467

    View details for Web of Science ID 000294807200009

    View details for PubMedID 21915109

    View details for PubMedCentralID PMC3195253

  • A genome-wide perspective on the evolutionary history of enigmatic wolf-like canids GENOME RESEARCH vonHoldt, B. M., Pollinger, J. P., Earl, D. A., Knowles, J. C., Boyko, A. R., Parker, H., Geffen, E., Pilot, M., Jedrzejewski, W., Jedrzejewska, B., Sidorovich, V., Greco, C., Randi, E., Musiani, M., Kays, R., Bustamante, C. D., Ostrander, E. A., Novembre, J., Wayne, R. K. 2011; 21 (8): 1294-1305


    High-throughput genotyping technologies developed for model species can potentially increase the resolution of demographic history and ancestry in wild relatives. We use a SNP genotyping microarray developed for the domestic dog to assay variation in over 48K loci in wolf-like species worldwide. Despite the high mobility of these large carnivores, we find distinct hierarchical population units within gray wolves and coyotes that correspond with geographic and ecologic differences among populations. Further, we test controversial theories about the ancestry of the Great Lakes wolf and red wolf using an analysis of haplotype blocks across all 38 canid autosomes. We find that these enigmatic canids are highly admixed varieties derived from gray wolves and coyotes, respectively. This divergent genomic history suggests that they do not have a shared recent ancestry as proposed by previous researchers. Interspecific hybridization, as well as the process of evolutionary divergence, may be responsible for the observed phenotypic distinction of both forms. Such admixture complicates decisions regarding endangered species restoration and protection.

    View details for DOI 10.1101/gr.116301.110

    View details for Web of Science ID 000293335700009

    View details for PubMedID 21566151

  • Genetic Architecture of Aluminum Tolerance in Rice (Oryza sativa) Determined through Genome-Wide Association Analysis and QTL Mapping PLOS GENETICS Famoso, A. N., Zhao, K., Clark, R. T., Tung, C., Wright, M. H., Bustamante, C., Kochian, L. V., McCouch, S. R. 2011; 7 (8)


    Aluminum (Al) toxicity is a primary limitation to crop productivity on acid soils, and rice has been demonstrated to be significantly more Al tolerant than other cereal crops. However, the mechanisms of rice Al tolerance are largely unknown, and no genes underlying natural variation have been reported. We screened 383 diverse rice accessions, conducted a genome-wide association (GWA) study, and conducted QTL mapping in two bi-parental populations using three estimates of Al tolerance based on root growth. Subpopulation structure explained 57% of the phenotypic variation, and the mean Al tolerance in Japonica was twice that of Indica. Forty-eight regions associated with Al tolerance were identified by GWA analysis, most of which were subpopulation-specific. Four of these regions co-localized with a priori candidate genes, and two highly significant regions co-localized with previously identified QTLs. Three regions corresponding to induced Al-sensitive rice mutants (ART1, STAR2, Nrat1) were identified through bi-parental QTL mapping or GWA to be involved in natural variation for Al tolerance. Haplotype analysis around the Nrat1 gene identified susceptible and tolerant haplotypes explaining 40% of the Al tolerance variation within the aus subpopulation, and sequence analysis of Nrat1 identified a trio of non-synonymous mutations predictive of Al sensitivity in our diversity panel. GWA analysis discovered more phenotype-genotype associations and provided higher resolution, but QTL mapping identified critical rare and/or subpopulation-specific alleles not detected by GWA analysis. Mapping using Indica/Japonica populations identified QTLs associated with transgressive variation where alleles from a susceptible aus or indica parent enhanced Al tolerance in a tolerant Japonica background. This work supports the hypothesis that selectively introgressing alleles across subpopulations is an efficient approach for trait enhancement in plant breeding programs and demonstrates the fundamental importance of subpopulation in interpreting and manipulating the genetics of complex traits in rice.

    View details for DOI 10.1371/journal.pgen.1002221

    View details for Web of Science ID 000294297000018

    View details for PubMedID 21829395

  • Demographic history and rare allele sharing among human populations PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Gravel, S., Henn, B. M., Gutenkunst, R. N., Indap, A. R., Marth, G. T., Clark, A. G., Yu, F., Gibbs, R. A., Bustamante, C. D. 2011; 108 (29): 11983-11988


    High-throughput sequencing technology enables population-level surveys of human genomic variation. Here, we examine the joint allele frequency distributions across continental human populations and present an approach for combining complementary aspects of whole-genome, low-coverage data and targeted high-coverage data. We apply this approach to data generated by the pilot phase of the Thousand Genomes Project, including whole-genome 2-4× coverage data for 179 samples from HapMap European, Asian, and African panels as well as high-coverage target sequencing of the exons of 800 genes from 697 individuals in seven populations. We use the site frequency spectra obtained from these data to infer demographic parameters for an Out-of-Africa model for populations of African, European, and Asian descent and to predict, by a jackknife-based approach, the amount of genetic diversity that will be discovered as sample sizes are increased. We predict that the number of discovered nonsynonymous coding variants will reach 100,000 in each population after ∼1,000 sequenced chromosomes per population, whereas ∼2,500 chromosomes will be needed for the same number of synonymous variants. Beyond this point, the number of segregating sites in the European and Asian panel populations is expected to overcome that of the African panel because of faster recent population growth. Overall, we find that the majority of human genomic variable sites are rare and exhibit little sharing among diverged populations. Our results emphasize that replication of disease association for specific rare genetic variants across diverged populations must overcome both reduced statistical power because of rarity and higher population divergence.

    View details for DOI 10.1073/pnas.1019276108

    View details for Web of Science ID 000292876900056

    View details for PubMedID 21730125

  • Genomics for the world NATURE Bustamante, C. D., Burchard, E. G., De La Vega, F. M. 2011; 475 (7355): 163-165

    View details for Web of Science ID 000292690500024

    View details for PubMedID 21753830

  • Fast, Exact Linkage Analysis for Categorical Traits on Arbitrary Pedigree Designs GENETIC EPIDEMIOLOGY Brisbin, A., Cruickshank, J., Moise, N. S., Gunn, T., Bustamante, C. D., Mezey, J. G. 2011; 35 (5): 371-380


    Multi-symptom diseases without a consistent continuous measurement of severity may be best understood with a categorical interpretation. In this paper, we present LOCate v.2, a fast, exact algorithm for linkage analysis of all types of categorical traits, both ordinal and nominal. Our method is able to incorporate missing data and analyze complex genealogical structure, including inbreeding loops. LOCate v.2 computes exact likelihoods efficiently through an elimination algorithm, similar to that used by Superlink for binary traits. We compare LOCate v.2 to LOT and QTLlink, two existing methods of linkage analysis for ordinal traits. We find that LOCate v.2 outperforms both methods when used to analyze simulated nominal traits. In addition, LOCate v.2 performs as well as QTLlink on simulated ordinal traits, and better than LOT due to the necessity of cutting large pedigrees for analysis in LOT. To demonstrate the versatility of LOCate v.2, we conduct an ordinal and nominal linkage analysis of ventricular arrhythmias in a large, inbred pedigree of German Shepherd dogs. We find that a trichotomous ordinal or nominal interpretation strengthens the evidence in favor of linkage to a region on chromosome 6, and provides new evidence of linkage to a region on chromosome 11. LOCate v.2 is a unified, fast, and robust method for linkage analysis of ordinal and nominal traits which will be valuable to researchers interested in investigating any type of categorical trait.

    View details for DOI 10.1002/gepi.20585

    View details for Web of Science ID 000291591100009

    View details for PubMedID 21520271

  • On Identifying the Optimal Number of Population Clusters via the Deviance Information Criterion PLOS ONE Gao, H., Bryc, K., Bustamante, C. D. 2011; 6 (6)


    Inferring population structure using bayesian clustering programs often requires a priori specification of the number of subpopulations, K, from which the sample has been drawn. Here, we explore the utility of a common bayesian model selection criterion, the Deviance Information Criterion (DIC), for estimating K. We evaluate the accuracy of DIC, as well as other popular approaches, on datasets generated by coalescent simulations under various demographic scenarios. We find that DIC outperforms competing methods in many genetic contexts, validating its application in assessing population structure.

    View details for DOI 10.1371/journal.pone.0021014

    View details for Web of Science ID 000292142800008

    View details for PubMedID 21738600

  • Levels and Patterns of Nucleotide Variation in Domestication QTL Regions on Rice Chromosome 3 Suggest Lineage-Specific Selection PLOS ONE Xie, X., Molina, J., Hernandez, R., Reynolds, A., Boyko, A. R., Bustamante, C. D., Purugganan, M. D. 2011; 6 (6)


    Oryza sativa or Asian cultivated rice is one of the major cereal grass species domesticated for human food use during the Neolithic. Domestication of this species from the wild grass Oryza rufipogon was accompanied by changes in several traits, including seed shattering, percent seed set, tillering, grain weight, and flowering time. Quantitative trait locus (QTL) mapping has identified three genomic regions in chromosome 3 that appear to be associated with these traits. We would like to study whether these regions show signatures of selection and whether the same genetic basis underlies the domestication of different rice varieties. Fragments of 88 genes spanning these three genomic regions were sequenced from multiple accessions of two major varietal groups in O. sativa--indica and tropical japonica--as well as the ancestral wild rice species O. rufipogon. In tropical japonica, the levels of nucleotide variation in these three QTL regions are significantly lower compared to genome-wide levels, and coalescent simulations based on a complex demographic model of rice domestication indicate that these patterns are consistent with selection. In contrast, there is no significant reduction in nucleotide diversity in the homologous regions in indica rice. These results suggest that there are differences in the genetic and selective basis for domestication between these two Asian rice varietal groups.

    View details for DOI 10.1371/journal.pone.0020670

    View details for Web of Science ID 000291356400016

    View details for PubMedID 21674010

    View details for PubMedCentralID PMC3108957

  • Molecular evidence for a single evolutionary origin of domesticated rice PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Molina, J., Sikora, M., Garud, N., Flowers, J. M., Rubinstein, S., Reynolds, A., Huang, P., Jackson, S., Schaal, B. A., Bustamante, C. D., Boyko, A. R., Purugganan, M. D. 2011; 108 (20): 8351-8356


    Asian rice, Oryza sativa, is one of world's oldest and most important crop species. Rice is believed to have been domesticated ∼9,000 y ago, although debate on its origin remains contentious. A single-origin model suggests that two main subspecies of Asian rice, indica and japonica, were domesticated from the wild rice O. rufipogon. In contrast, the multiple independent domestication model proposes that these two major rice types were domesticated separately and in different parts of the species range of wild rice. This latter view has gained much support from the observation of strong genetic differentiation between indica and japonica as well as several phylogenetic studies of rice domestication. We reexamine the evolutionary history of domesticated rice by resequencing 630 gene fragments on chromosomes 8, 10, and 12 from a diverse set of wild and domesticated rice accessions. Using patterns of SNPs, we identify 20 putative selective sweeps on these chromosomes in cultivated rice. Demographic modeling based on these SNP data and a diffusion-based approach provide the strongest support for a single domestication origin of rice. Bayesian phylogenetic analyses implementing the multispecies coalescent and using previously published phylogenetic sequence datasets also point to a single origin of Asian domesticated rice. Finally, we date the origin of domestication at ∼8,200-13,500 y ago, depending on the molecular clock estimate that is used, which is consistent with known archaeological data that suggests rice was first cultivated at around this time in the Yangtze Valley of China.

    View details for DOI 10.1073/pnas.1104686108

    View details for Web of Science ID 000290719600056

    View details for PubMedID 21536870

    View details for PubMedCentralID PMC3101000

  • Hunter-gatherer genomic diversity suggests a southern African origin for modern humans PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Henn, B. M., Gignoux, C. R., Jobin, M., Granka, J. M., Macpherson, J. M., Kidd, J. M., Rodriguez-Botigue, L., Ramachandran, S., Hon, L., Brisbin, A., Lin, A. A., Underhill, P. A., Comas, D., Kidd, K. K., Norman, P. J., Parham, P., Bustamante, C. D., Mountain, J. L., Feldman, M. W. 2011; 108 (13): 5154-5162


    Africa is inferred to be the continent of origin for all modern human populations, but the details of human prehistory and evolution in Africa remain largely obscure owing to the complex histories of hundreds of distinct populations. We present data for more than 580,000 SNPs for several hunter-gatherer populations: the Hadza and Sandawe of Tanzania, and the ≠Khomani Bushmen of South Africa, including speakers of the nearly extinct N|u language. We find that African hunter-gatherer populations today remain highly differentiated, encompassing major components of variation that are not found in other African populations. Hunter-gatherer populations also tend to have the lowest levels of genome-wide linkage disequilibrium among 27 African populations. We analyzed geographic patterns of linkage disequilibrium and population differentiation, as measured by F(ST), in Africa. The observed patterns are consistent with an origin of modern humans in southern Africa rather than eastern Africa, as is generally assumed. Additionally, genetic variation in African hunter-gatherer populations has been significantly affected by interaction with farmers and herders over the past 5,000 y, through both severe population bottlenecks and sex-biased migration. However, African hunter-gatherer populations continue to maintain the highest levels of genetic diversity in the world.

    View details for DOI 10.1073/pnas.1017511108

    View details for Web of Science ID 000288894800009

    View details for PubMedID 21383195

  • Detecting Directional Selection in the Presence of Recent Admixture in African-Americans GENETICS Lohmueller, K. E., Bustamante, C. D., Clark, A. G. 2011; 187 (3): 823-835


    We investigate the performance of tests of neutrality in admixed populations using plausible demographic models for African-American history as well as resequencing data from African and African-American populations. The analysis of both simulated and human resequencing data suggests that recent admixture does not result in an excess of false-positive results for neutrality tests based on the frequency spectrum after accounting for the population growth in the parental African population. Furthermore, when simulating positive selection, Tajima's D, Fu and Li's D, and haplotype homozygosity have lower power to detect population-specific selection using individuals sampled from the admixed population than from the nonadmixed population. Fay and Wu's H test, however, has more power to detect selection using individuals from the admixed population than from the nonadmixed population, especially when the selective sweep ended long ago. Our results have implications for interpreting recent genome-wide scans for positive selection in human populations.

    View details for DOI 10.1534/genetics.110.122739

    View details for Web of Science ID 000288457800016

    View details for PubMedID 21196524

  • Genetic structure and domestication history of the grape PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Myles, S., Boyko, A. R., Owens, C. L., Brown, P. J., Grassi, F., Aradhya, M. K., Prins, B., Reynolds, A., Chia, J., Ware, D., Bustamante, C. D., Buckler, E. S. 2011; 108 (9): 3530-3535


    The grape is one of the earliest domesticated fruit crops and, since antiquity, it has been widely cultivated and prized for its fruit and wine. Here, we characterize genome-wide patterns of genetic variation in over 1,000 samples of the domesticated grape, Vitis vinifera subsp. vinifera, and its wild relative, V. vinifera subsp. sylvestris from the US Department of Agriculture grape germplasm collection. We find support for a Near East origin of vinifera and present evidence of introgression from local sylvestris as the grape moved into Europe. High levels of genetic diversity and rapid linkage disequilibrium (LD) decay have been maintained in vinifera, which is consistent with a weak domestication bottleneck followed by thousands of years of widespread vegetative propagation. The considerable genetic diversity within vinifera, however, is contained within a complex network of close pedigree relationships that has been generated by crosses among elite cultivars. We show that first-degree relationships are rare between wine and table grapes and among grapes from geographically distant regions. Our results suggest that although substantial genetic diversity has been maintained in the grape subsequent to domestication, there has been a limited exploration of this diversity. We propose that the adoption of vegetative propagation was a double-edged sword: Although it provided a benefit by ensuring true breeding cultivars, it also discouraged the generation of unique cultivars through crosses. The grape currently faces severe pathogen pressures, and the long-term sustainability of the grape and wine industries will rely on the exploitation of the grape's tremendous natural genetic diversity.

    View details for DOI 10.1073/pnas.1009363108

    View details for Web of Science ID 000287844400021

    View details for PubMedID 21245334

  • A Population Genetic Approach to Mapping Neurological Disorder Genes Using Deep Resequencing PLOS GENETICS Myers, R. A., Casals, F., Gauthier, J., Hamdan, F. F., Keebler, J., Boyko, A. R., Bustamante, C. D., Piton, A. M., Spiegelman, D., Henrion, E., Zilversmit, M., Hussin, J., Quinlan, J., Yang, Y., Lafreniere, R. G., Griffing, A. R., Stone, E. A., Rouleau, G. A., Awadalla, P. 2011; 7 (2)


    Deep resequencing of functional regions in human genomes is key to identifying potentially causal rare variants for complex disorders. Here, we present the results from a large-sample resequencing (n  =  285 patients) study of candidate genes coupled with population genetics and statistical methods to identify rare variants associated with Autism Spectrum Disorder and Schizophrenia. Three genes, MAP1A, GRIN2B, and CACNA1F, were consistently identified by different methods as having significant excess of rare missense mutations in either one or both disease cohorts. In a broader context, we also found that the overall site frequency spectrum of variation in these cases is best explained by population models of both selection and complex demography rather than neutral models or models accounting for complex demography alone. Mutations in the three disease-associated genes explained much of the difference in the overall site frequency spectrum among the cases versus controls. This study demonstrates that genes associated with complex disorders can be mapped using resequencing and analytical methods with sample sizes far smaller than those required by genome-wide association studies. Additionally, our findings support the hypothesis that rare mutations account for a proportion of the phenotypic variance of these complex disorders.

    View details for DOI 10.1371/journal.pgen.1001318

  • Comparative and demographic analysis of orang-utan genomes NATURE Locke, D. P., Hillier, L. W., Warren, W. C., Worley, K. C., Nazareth, L. V., Muzny, D. M., Yang, S., Wang, Z., Chinwalla, A. T., Minx, P., Mitreva, M., Cook, L., Delehaunty, K. D., Fronick, C., Schmidt, H., Fulton, L. A., Fulton, R. S., Nelson, J. O., Magrini, V., Pohl, C., Graves, T. A., Markovic, C., Cree, A., Dinh, H. H., Hume, J., Kovar, C. L., Fowler, G. R., Lunter, G., Meader, S., Heger, A., Ponting, C. P., Marques-Bonet, T., Alkan, C., Chen, L., Cheng, Z., Kidd, J. M., Eichler, E. E., White, S., Searle, S., Vilella, A. J., Chen, Y., Flicek, P., Ma, J., Raney, B., Suh, B., Burhans, R., Herrero, J., Haussler, D., Faria, R., Fernando, O., Darre, F., Farre, D., Gazave, E., Oliva, M., Navarro, A., Roberto, R., Capozzi, O., Archidiacono, N., Della Valle, G., Purgato, S., Rocchi, M., Konkel, M. K., Walker, J. A., Ullmer, B., Batzer, M. A., Smit, A. F., Hubley, R., Casola, C., Schrider, D. R., Hahn, M. W., Quesada, V., Puente, X. S., Ordonez, G. R., Lopez-Otin, C., Vinar, T., Brejova, B., Ratan, A., Harris, R. S., Miller, W., Kosiol, C., Lawson, H. A., Taliwal, V., Martins, A. L., Siepel, A., RoyChoudhury, A., Ma, X., Degenhardt, J., Bustamante, C. D., Gutenkunst, R. N., Mailund, T., Dutheil, J. Y., Hobolth, A., Schierup, M. H., Ryder, O. A., Yoshinaga, Y., de Jong, P. J., Weinstock, G. M., Rogers, J., Mardis, E. R., Gibbs, R. A., Wilson, R. K. 2011; 469 (7331): 529-533


    'Orang-utan' is derived from a Malay term meaning 'man of the forest' and aptly describes the southeast Asian great apes native to Sumatra and Borneo. The orang-utan species, Pongo abelii (Sumatran) and Pongo pygmaeus (Bornean), are the most phylogenetically distant great apes from humans, thereby providing an informative perspective on hominid evolution. Here we present a Sumatran orang-utan draft genome assembly and short read sequence data from five Sumatran and five Bornean orang-utan genomes. Our analyses reveal that, compared to other primates, the orang-utan genome has many unique features. Structural evolution of the orang-utan genome has proceeded much more slowly than other great apes, evidenced by fewer rearrangements, less segmental duplication, a lower rate of gene family turnover and surprisingly quiescent Alu repeats, which have played a major role in restructuring other primate genomes. We also describe a primate polymorphic neocentromere, found in both Pongo species, emphasizing the gradual evolution of orang-utan genome structure. Orang-utans have extremely low energy usage for a eutherian mammal, far lower than their hominid relatives. Adding their genome to the repertoire of sequenced primates illuminates new signals of positive selection in several pathways including glycolipid metabolism. From the population perspective, both Pongo species are deeply diverse; however, Sumatran individuals possess greater diversity than their Bornean counterparts, and more species-specific variation. Our estimate of Bornean/Sumatran speciation time, 400,000 years ago, is more recent than most previous studies and underscores the complexity of the orang-utan speciation process. Despite a smaller modern census population size, the Sumatran effective population size (N(e)) expanded exponentially relative to the ancestral N(e) after the split, while Bornean N(e) declined over the same period. Overall, the resources and analyses presented here offer new opportunities in evolutionary genomics, insights into hominid biology, and an extensive database of variation for conservation efforts.

  • The functional spectrum of low-frequency coding variation GENOME BIOLOGY Marth, G. T., Yu, F., Indap, A. R., Garimella, K., Gravel, S., Leong, W. F., Tyler-Smith, C., Bainbridge, M., Blackwell, T., Zheng-Bradley, X., Chen, Y., Challis, D., Clarke, L., Ball, E. V., Cibulskis, K., Cooper, D. N., Fulton, B., Hartl, C., Koboldt, D., Muzny, D., Smith, R., Sougnez, C., Stewart, C., Ward, A., Yu, J., Xue, Y., Altshuler, D., Bustamante, C. D., Clark, A. G., Daly, M., DePristo, M., Flicek, P., Gabriel, S., Mardis, E., Palotie, A., Gibbs, R. 2011; 12 (9)


    Rare coding variants constitute an important class of human genetic variation, but are underrepresented in current databases that are based on small population samples. Recent studies show that variants altering amino acid sequence and protein function are enriched at low variant allele frequency, 2 to 5%, but because of insufficient sample size it is not clear if the same trend holds for rare variants below 1% allele frequency.The 1000 Genomes Exon Pilot Project has collected deep-coverage exon-capture data in roughly 1,000 human genes, for nearly 700 samples. Although medical whole-exome projects are currently afoot, this is still the deepest reported sampling of a large number of human genes with next-generation technologies. According to the goals of the 1000 Genomes Project, we created effective informatics pipelines to process and analyze the data, and discovered 12,758 exonic SNPs, 70% of them novel, and 74% below 1% allele frequency in the seven population samples we examined. Our analysis confirms that coding variants below 1% allele frequency show increased population-specificity and are enriched for functional variants.This study represents a large step toward detecting and interpreting low frequency coding variation, clearly lays out technical steps for effective analysis of DNA capture data, and articulates functional and population properties of this important class of genetic variation.

  • GENOME-WIDE ASSOCIATION MAPPING AND RARE ALLELES: FROM POPULATION GENOMICS TO PERSONALIZED MEDICINE - Session Introduction. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing De La Vega, F. M., Bustamante, C. D., Leal, S. M. 2011: 74-75


    Genome-wide associations studies (GWAS) have been very successful in identifying common genetic variation associated to numerous complex diseases [1]. However, most of the identified common genetic variants appear to confer modest risk and few causal alleles have been identified [2]. Furthermore, these associations account for a small portion of the total heritability of inherited disease variation [1]. This has led to the reexamination of the contribution of environment, gene-gene and gene-environment interactions, and rare genetic variants in complex diseases [1, 3, 4]. There is strong evidence that rare variants play an important role in complex disease etiology and may have larger genetic effects than common variants [2]. Currently, much of what we know regarding the contribution of rare genetic variants to disease risk is based on a limited number of phenotypes and candidate genes. However, rapid advancement of second generation sequencing technologies will invariably lead to widespread association studies comparing whole exome and eventually whole genome sequencing of cases and controls. A tremendous challenge for enabling these "next generation" medical genomic studies is developing statistical approaches for correlating rare genetic variants with disease outcome. The analysis of rare variants is challenging since methods used for common variants are woefully underpowered. Therefore, methods that can deal with genetic heterogeneity at the trait-associated locus have been developed to analyze rare variants. These methods instead analyzing individual variants analyze variants within a region/gene as a group and usually rely on collapsing. They can be applied to both in cases vs. controls and quantitative trait studies are needed. The paper of Bansal et al. in this volume describes the application of a number of statistical methods for testing associations between rare variants in two genes to obesity. The authors considered the relative merits of the different methods as well as important implementation details, such as the leveraging of genomic annotations and determining p-values. Knowledge of haplotypes can increase the power of GWAS studies and also highlight associations that are impossible to detect without haplotype phase (e.g. loss of heterozygosity). Even more complicated phase-dependent interactions of variants in linkage equilibrium have also been suggested as possible causes of missing heritability. In their work, Hallsorsson et al. formulate algorithmic strategies for haplotype phasing by multi-assembly of shared haplotypes from next-generation sequencing data. These methods would allow testing haplotypes harboring rare variants for association and potentially increase their explanatory power. Since single SNP tests are often underpowered in rare variant association analysis, Zeggini and Asimit propose a locus-based method that has high power in the presence of rare variants and that incorporate base quality scores available for sequencing data. Their results suggest that this multi-marker approach may be best suited for smaller regions, or after some filtering to reduce the number of SNPs that are jointly tested to reduce loss of power due to multiple-testing adjustments. Finally, the paper of Zhou et al., presents a penalized regression framework for association testing on sequence data, in the presence of both common and rare variants. This method also introduces the use of weights to incorporate available biological information on the variants. Although these tactics improve both false positive and false negative rates, they represent an incremental development and there is still significant room for improvement. With the development of sequencing technologies and methods to detect complex trait rare variant associations many new and exciting discovery are imminent. The analysis of rare variants is still in its infancy and the next few years promises to produce many new methods to meet the special demands of analyzing this type of data. Note from Publisher: This article contains the abstract and references.

    View details for PubMedID 21121034

    View details for PubMedCentralID PMC3278906

  • HUMAN ORIGINS Shadows of early migrations NATURE Bustamante, C. D., Henn, B. M. 2010; 468 (7327): 1044-1045

  • ALCHEMY: a reliable method for automated SNP genotype calling for small batch sizes and highly homozygous populations BIOINFORMATICS Wright, M. H., Tung, C., Zhao, K., Reynolds, A., McCouch, S. R., Bustamante, C. D. 2010; 26 (23): 2952-2960


    The development of new high-throughput genotyping products requires a significant investment in testing and training samples to evaluate and optimize the product before it can be used reliably on new samples. One reason for this is current methods for automated calling of genotypes are based on clustering approaches which require a large number of samples to be analyzed simultaneously, or an extensive training dataset to seed clusters. In systems where inbred samples are of primary interest, current clustering approaches perform poorly due to the inability to clearly identify a heterozygote cluster.As part of the development of two custom single nucleotide polymorphism genotyping products for Oryza sativa (domestic rice), we have developed a new genotype calling algorithm called 'ALCHEMY' based on statistical modeling of the raw intensity data rather than modelless clustering. A novel feature of the model is the ability to estimate and incorporate inbreeding information on a per sample basis allowing accurate genotyping of both inbred and heterozygous samples even when analyzed simultaneously. Since clustering is not used explicitly, ALCHEMY performs well on small sample sizes with accuracy exceeding 99% with as few as 18 samples.ALCHEMY is available for both commercial and academic use free of charge and distributed under the GNU General Public License at data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btq533

  • Fine-scale population structure and the era of next-generation sequencing HUMAN MOLECULAR GENETICS Henn, B. M., Gravel, S., Moreno-Estrada, A., Acevedo-Acevedo, S., Bustamante, C. D. 2010; 19: R221-R226


    Fine-scale population structure characterizes most continents and is especially pronounced in non-cosmopolitan populations. Roughly half of the world's population remains non-cosmopolitan and even populations within cities often assort along ethnic and linguistic categories. Barriers to random mating can be ecologically extreme, such as the Sahara Desert, or cultural, such as the Indian caste system. In either case, subpopulations accumulate genetic differences if the barrier is maintained over multiple generations. Genome-wide polymorphism data, initially with only a few hundred autosomal microsatellites, have clearly established differences in allele frequency not only among continental regions, but also within continents and within countries. We review recent evidence from the analysis of genome-wide polymorphism data for genetic boundaries delineating human population structure and the main demographic and genomic processes shaping variation, and discuss the implications of population structure for the distribution and discovery of disease-causing genetic variants, in the light of the imminent availability of sequencing data for a multitude of diverse human genomes.

    View details for DOI 10.1093/hmg/ddq403

  • Detection of Heterozygous Mutations in the Genome of Mismatch Repair Defective Diploid Yeast Using a Bayesian Approach GENETICS Zanders, S., Ma, X., RoyChoudhury, A., Hernandez, R. D., Demogines, A., Barker, B., Gu, Z., Bustamante, C. D., Alani, E. 2010; 186 (2): 493-503


    DNA replication errors that escape polymerase proofreading and mismatch repair (MMR) can lead to base substitution and frameshift mutations. Such mutations can disrupt gene function, reduce fitness, and promote diseases such as cancer and are also the raw material of molecular evolution. To analyze with limited bias genomic features associated with DNA polymerase errors, we performed a genome-wide analysis of mutations that accumulate in MMR-deficient diploid lines of Saccharomyces cerevisiae. These lines were derived from a common ancestor and were grown for 160 generations, with bottlenecks reducing the population to one cell every 20 generations. We sequenced to between 8- and 20-fold coverage one wild-type and three mutator lines using Illumina Solexa 36-bp reads. Using an experimentally aware Bayesian genotype caller developed to pool experimental data across sequencing runs for all strains, we detected 28 heterozygous single-nucleotide polymorphisms (SNPs) and 48 single-nt insertion/deletions (indels) from the data set. This method was evaluated on simulated data sets and found to have a very low false-positive rate (∼6 × 10(-5)) and a false-negative rate of 0.08 within the unique mapping regions of the genome that contained at least sevenfold coverage. The heterozygous mutations identified by the Bayesian genotype caller were confirmed by Sanger sequencing. All of the mutations were unique to a given line, except for a single-nt deletion mutation which occurred independently in two lines. All 48 indels, composed of 46 deletions and two insertions, occurred in homopolymer (HP) tracts [i.e., 47 poly(A) or (T) tracts, 1 poly(G) or (C) tract] between 5 and 13 bp long. Our findings are of interest because HP tracts are present at high levels in the yeast genome (>77,400 for 5- to 20-nt HP tracts), and frameshift mutations in these regions are likely to disrupt gene function. In addition, they demonstrate that the mutation pattern seen previously in mismatch repair defective strains using a limited number of reporters holds true for the entire genome.

    View details for DOI 10.1534/genetics.110.120105

    View details for Web of Science ID 000282807400005

  • Balancing Selection Maintains a Form of ERAP2 that Undergoes Nonsense-Mediated Decay and Affects Antigen Presentation PLOS GENETICS Andres, A. M., Dennis, M. Y., Kretzschmar, W. W., Cannons, J. L., Lee-Lin, S., Hurle, B., Schwartzberg, P. L., Williamson, S. H., Bustamante, C. D., Nielsen, R., Clark, A. G., Green, E. D. 2010; 6 (10)


    A remarkable characteristic of the human major histocompatibility complex (MHC) is its extreme genetic diversity, which is maintained by balancing selection. In fact, the MHC complex remains one of the best-known examples of natural selection in humans, with well-established genetic signatures and biological mechanisms for the action of selection. Here, we present genetic and functional evidence that another gene with a fundamental role in MHC class I presentation, endoplasmic reticulum aminopeptidase 2 (ERAP2), has also evolved under balancing selection and contains a variant that affects antigen presentation. Specifically, genetic analyses of six human populations revealed strong and consistent signatures of balancing selection affecting ERAP2. This selection maintains two highly differentiated haplotypes (Haplotype A and Haplotype B), with frequencies 0.44 and 0.56, respectively. We found that ERAP2 expressed from Haplotype B undergoes differential splicing and encodes a truncated protein, leading to nonsense-mediated decay of the mRNA. To investigate the consequences of ERAP2 deficiency on MHC presentation, we correlated surface MHC class I expression with ERAP2 genotypes in primary lymphocytes. Haplotype B homozygotes had lower levels of MHC class I expressed on the surface of B cells, suggesting that naturally occurring ERAP2 deficiency affects MHC presentation and immune response. Interestingly, an ERAP2 paralog, endoplasmic reticulum aminopeptidase 1 (ERAP1), also shows genetic signatures of balancing selection. Together, our findings link the genetic signatures of selection with an effect on splicing and a cellular phenotype. Although the precise selective pressure that maintains polymorphism is unknown, the demonstrated differences between the ERAP2 splice forms provide important insights into the potential mechanism for the action of selection.

    View details for DOI 10.1371/journal.pgen.1001157

  • The Baker's Yeast Diploid Genome Is Remarkably Stable in Vegetative Growth and Meiosis PLOS GENETICS Nishant, K. T., Wei, W., Mancera, E., Argueso, J. L., Schlattl, A., Delhomme, N., Ma, X., Bustamante, C. D., Korbel, J. O., Gu, Z., Steinmetz, L. M., Alani, E. 2010; 6 (9)


    Accurate estimates of mutation rates provide critical information to analyze genome evolution and organism fitness. We used whole-genome DNA sequencing, pulse-field gel electrophoresis, and comparative genome hybridization to determine mutation rates in diploid vegetative and meiotic mutation accumulation lines of Saccharomyces cerevisiae. The vegetative lines underwent only mitotic divisions while the meiotic lines underwent a meiotic cycle every ∼20 vegetative divisions. Similar base substitution rates were estimated for both lines. Given our experimental design, these measures indicated that the meiotic mutation rate is within the range of being equal to zero to being 55-fold higher than the vegetative rate. Mutations detected in vegetative lines were all heterozygous while those in meiotic lines were homozygous. A quantitative analysis of intra-tetrad mating events in the meiotic lines showed that inter-spore mating is primarily responsible for rapidly fixing mutations to homozygosity as well as for removing mutations. We did not observe 1-2 nt insertion/deletion (in-del) mutations in any of the sequenced lines and only one structural variant in a non-telomeric location was found. However, a large number of structural variations in subtelomeric sequences were seen in both vegetative and meiotic lines that did not affect viability. Our results indicate that the diploid yeast nuclear genome is remarkably stable during the vegetative and meiotic cell cycles and support the hypothesis that peripheral regions of chromosomes are more dynamic than gene-rich central sections where structural rearrangements could be deleterious. This work also provides an improved estimate for the mutational load carried by diploid organisms.

    View details for DOI 10.1371/journal.pgen.1001109

  • Bayesian Linkage Analysis of Categorical Traits for Arbitrary Pedigree Designs PLOS ONE Brisbin, A., Weissman, M. M., Fyer, A. J., Hamilton, S. P., Knowles, J. A., Bustamante, C. D., Mezey, J. G. 2010; 5 (8)


    Pedigree studies of complex heritable diseases often feature nominal or ordinal phenotypic measurements and missing genetic marker or phenotype data.We have developed a Bayesian method for Linkage analysis of Ordinal and Categorical traits (LOCate) that can analyze complex genealogical structure for family groups and incorporate missing data. LOCate uses a Gibbs sampling approach to assess linkage, incorporating a simulated tempering algorithm for fast mixing. While our treatment is Bayesian, we develop a LOD (log of odds) score estimator for assessing linkage from Gibbs sampling that is highly accurate for simulated data. LOCate is applicable to linkage analysis for ordinal or nominal traits, a versatility which we demonstrate by analyzing simulated data with a nominal trait, on which LOCate outperforms LOT, an existing method which is designed for ordinal traits. We additionally demonstrate our method's versatility by analyzing a candidate locus (D2S1788) for panic disorder in humans, in a dataset with a large amount of missing data, which LOT was unable to handle.LOCate's accuracy and applicability to both ordinal and nominal traits will prove useful to researchers interested in mapping loci for categorical traits.

    View details for DOI 10.1371/journal.pone.0012307

  • An ADAM9 mutation in canine cone-rod dystrophy 3 establishes homology with human cone-rod dystrophy 9 MOLECULAR VISION Goldstein, O., Mezey, J. G., Boyko, A. R., Gao, C., Wang, W., Bustamante, C. D., Anguish, L. J., Jordan, J. A., Pearce-Kelling, S. E., Aguirre, G. D., Acland, G. M. 2010; 16 (167-70): 1549-1569


    To identify the causative mutation in a canine cone-rod dystrophy (crd3) that segregates as an adult onset disorder in the Glen of Imaal Terrier breed of dog.Glen of Imaal Terriers were ascertained for crd3 phenotype by clinical ophthalmoscopic examination, and in selected cases by electroretinography. Blood samples from affected cases and non-affected controls were collected and used, after DNA extraction, to undertake a genome-wide association study using Affymetrix Version 2 Canine single nucleotide polymorphism chips and 250K Sty Assay protocol. Positional candidate gene analysis was undertaken for genes identified within the peak-association signal region. Retinal morphology of selected crd3-affected dogs was evaluated by light and electron microscopy.A peak association signal exceeding genome-wide significance was identified on canine chromosome 16. Evaluation of genes in this region suggested A Disintegrin And Metalloprotease domain, family member 9 (ADAM9), identified concurrently elsewhere as the cause of human cone-rod dystrophy 9 (CORD9), as a strong positional candidate for canine crd3. Sequence analysis identified a large genomic deletion (over 20 kb) that removed exons 15 and 16 from the ADAM9 transcript, introduced a premature stop, and would remove critical domains from the encoded protein. Light and electron microscopy established that, as in ADAM9 knockout mice, the primary lesion in crd3 appears to be a failure of the apical microvilli of the retinal pigment epithelium to appropriately invest photoreceptor outer segments. By electroretinography, retinal function appears normal in very young crd3-affected dogs, but by 15 months of age, cone dysfunction is present. Subsequently, both rod and cone function degenerate.Identification of this ADAM9 deletion in crd3-affected dogs establishes this canine disease as orthologous to CORD9 in humans, and offers opportunities for further characterization of the disease process, and potential for genetic therapeutic intervention.

  • A Simple Genetic Architecture Underlies Morphological Variation in Dogs PLOS BIOLOGY Boyko, A. R., Quignon, P., Li, L., Schoenebeck, J. J., Degenhardt, J. D., Lohmueller, K. E., Zhao, K., Brisbin, A., Parker, H. G., vonHoldt, B. M., Cargill, M., Auton, A., Reynolds, A., Elkahloun, A. G., Castelhano, M., Mosher, D. S., Sutter, N. B., Johnson, G. S., Novembre, J., Hubisz, M. J., Siepel, A., Wayne, R. K., Bustamante, C. D., Ostrander, E. A. 2010; 8 (8)


    Domestic dogs exhibit tremendous phenotypic diversity, including a greater variation in body size than any other terrestrial mammal. Here, we generate a high density map of canine genetic variation by genotyping 915 dogs from 80 domestic dog breeds, 83 wild canids, and 10 outbred African shelter dogs across 60,968 single-nucleotide polymorphisms (SNPs). Coupling this genomic resource with external measurements from breed standards and individuals as well as skeletal measurements from museum specimens, we identify 51 regions of the dog genome associated with phenotypic variation among breeds in 57 traits. The complex traits include average breed body size and external body dimensions and cranial, dental, and long bone shape and size with and without allometric scaling. In contrast to the results from association mapping of quantitative traits in humans and domesticated plants, we find that across dog breeds, a small number of quantitative trait loci (< or = 3) explain the majority of phenotypic variation for most of the traits we studied. In addition, many genomic regions show signatures of recent selection, with most of the highly differentiated regions being associated with breed-defining traits such as body size, coat characteristics, and ear floppiness. Our results demonstrate the efficacy of mapping multiple traits in the domestic dog using a database of genotyped individuals and highlight the important role human-directed selection has played in altering the genetic architecture of key traits in this important species.

    View details for DOI 10.1371/journal.pbio.1000451

  • Successful Computational Prediction of Novel Imprinted Genes from Epigenomic Features MOLECULAR AND CELLULAR BIOLOGY Brideau, C. M., Eilertson, K. E., Hagarman, J. A., Bustamante, C. D., Soloway, P. D. 2010; 30 (13): 3357-3370


    Approximately 100 mouse genes undergo genomic imprinting, whereby one of the two parental alleles is epigenetically silenced. Imprinted genes influence processes including development, X chromosome inactivation, obesity, schizophrenia, and diabetes, motivating the identification of all imprinted loci. Local sequence features have been used to predict candidate imprinted genes, but rigorous testing using reciprocal crosses validated only three, one of which resided in previously identified imprinting clusters. Here we show that specific epigenetic features in mouse cells correlate with imprinting status in mice, and we identify hundreds of additional genes predicted to be imprinted in the mouse. We used a multitiered approach to validate imprinted expression, including use of a custom single nucleotide polymorphism array and traditional molecular methods. Of 65 candidates subjected to molecular assays for allele-specific expression, we found 10 novel imprinted genes that were maternally expressed in the placenta.

    View details for DOI 10.1128/MCB.01355-09

  • The Effect of Recent Admixture on Inference of Ancient Human Population History GENETICS Lohmueller, K. E., Bustamante, C. D., Clark, A. G. 2010; 185 (2): 611-U327


    Despite the widespread study of genetic variation in admixed human populations, such as African-Americans, there has not been an evaluation of the effects of recent admixture on patterns of polymorphism or inferences about population demography. These issues are particularly relevant because estimates of the timing and magnitude of population growth in Africa have differed among previous studies, some of which examined African-American individuals. Here we use simulations and single-nucleotide polymorphism (SNP) data collected through direct resequencing and genotyping to investigate these issues. We find that when estimating the current population size and magnitude of recent growth in an ancestral population using the site frequency spectrum (SFS), it is possible to obtain reasonably accurate estimates of the parameters when using samples drawn from the admixed population under certain conditions. We also show that methods for demographic inference that use haplotype patterns are more sensitive to recent admixture than are methods based on the SFS. The analysis of human genetic variation data from the Yoruba people of Ibadan, Nigeria and African-Americans supports the predictions from the simulations. Our results have important implications for the evaluation of previous population genetic studies that have considered African-American individuals as a proxy for individuals from West Africa as well as for future population genetic studies of additional admixed populations.

    View details for DOI 10.1534/genetics.109.113761

  • Genomic Diversity and Introgression in O. sativa Reveal the Impact of Domestication and Breeding on the Rice Genome PLOS ONE Zhao, K., Wright, M., Kimball, J., Eizenga, G., McClung, A., Kovach, M., Tyagi, W., Ali, M. L., Tung, C., Reynolds, A., Bustamante, C. D., McCouch, S. R. 2010; 5 (5)


    The domestication of Asian rice (Oryza sativa) was a complex process punctuated by episodes of introgressive hybridization among and between subpopulations. Deep genetic divergence between the two main varietal groups (Indica and Japonica) suggests domestication from at least two distinct wild populations. However, genetic uniformity surrounding key domestication genes across divergent subpopulations suggests cultural exchange of genetic material among ancient farmers.In this study, we utilize a novel 1,536 SNP panel genotyped across 395 diverse accessions of O. sativa to study genome-wide patterns of polymorphism, to characterize population structure, and to infer the introgression history of domesticated Asian rice. Our population structure analyses support the existence of five major subpopulations (indica, aus, tropical japonica, temperate japonica and GroupV) consistent with previous analyses. Our introgression analysis shows that most accessions exhibit some degree of admixture, with many individuals within a population sharing the same introgressed segment due to artificial selection. Admixture mapping and association analysis of amylose content and grain length illustrate the potential for dissecting the genetic basis of complex traits in domesticated plant populations.Genes in these regions control a myriad of traits including plant stature, blast resistance, and amylose content. These analyses highlight the power of population genomics in agricultural systems to identify functionally important regions of the genome and to decipher the role of human-directed breeding in refashioning the genomes of a domesticated species.

    View details for Web of Science ID 000278034600010

    Hispanic/Latino populations possess a complex genetic structure that reflects recent admixture among and potentially ancient substructure within Native American, European, and West African source populations. Here, we quantify genome-wide patterns of SNP and haplotype variation among 100 individuals with ancestry from Ecuador, Colombia, Puerto Rico, and the Dominican Republic genotyped on the Illumina 610-Quad arrays and 112 Mexicans genotyped on Affymetrix 500K platform. Intersecting these data with previously collected high-density SNP data from 4,305 individuals, we use principal component analysis and clustering methods FRAPPE and STRUCTURE to investigate genome-wide patterns of African, European, and Native American population structure within and among Hispanic/Latino populations. Comparing autosomal, X and Y chromosome, and mtDNA variation, we find evidence of a significant sex bias in admixture proportions consistent with disproportionate contribution of European male and Native American female ancestry to present-day populations. We also find that patterns of linkage-disequilibria in admixed Hispanic/Latino populations are largely affected by the admixture dynamics of the populations, with faster decay of LD in populations of higher African ancestry. Finally, using the locus-specific ancestry inference method LAMP, we reconstruct fine-scale chromosomal patterns of admixture. We document moderate power to differentiate among potential subcontinental source populations within the Native American, European, and African segments of the admixed Hispanic/Latino genomes. Our results suggest future genome-wide association scans in Hispanic/Latino populations may require correction for local genomic ancestry at a subcontinental scale when associating differences in the genome with disease risk, progression, and drug efficacy, as well as for admixture mapping.

    View details for DOI 10.1073/pnas.0914618107

  • Genome-wide SNP and haplotype analyses reveal a rich history underlying dog domestication NATURE vonHoldt, B. M., Pollinger, J. P., Lohmueller, K. E., Han, E., Parker, H. G., Quignon, P., Degenhardt, J. D., Boyko, A. R., Earl, D. A., Auton, A., Reynolds, A., Bryc, K., Brisbin, A., Knowles, J. C., Mosher, D. S., Spady, T. C., Elkahloun, A., Geffen, E., Pilot, M., Jedrzejewski, W., Greco, C., Randi, E., Bannasch, D., Wilton, A., Shearman, J., Musiani, M., Cargill, M., Jones, P. G., Qian, Z., Huang, W., Ding, Z., Zhang, Y., Bustamante, C. D., Ostrander, E. A., Novembre, J., Wayne, R. K. 2010; 464 (7290): 898-U109


    Advances in genome technology have facilitated a new understanding of the historical and genetic processes crucial to rapid phenotypic evolution under domestication. To understand the process of dog diversification better, we conducted an extensive genome-wide survey of more than 48,000 single nucleotide polymorphisms in dogs and their wild progenitor, the grey wolf. Here we show that dog breeds share a higher proportion of multi-locus haplotypes unique to grey wolves from the Middle East, indicating that they are a dominant source of genetic diversity for dogs rather than wolves from east Asia, as suggested by mitochondrial DNA sequence data. Furthermore, we find a surprising correspondence between genetic and phenotypic/functional breed groupings but there are exceptions that suggest phenotypic diversification depended in part on the repeated crossing of individuals with novel phenotypes. Our results show that Middle Eastern wolves were a critical source of genome diversity, although interbreeding with local wolf populations clearly occurred elsewhere in the early history of specific lineages. More recently, the evolution of modern dog breeds seems to have been an iterative process that drew on a limited genetic toolkit to create remarkable phenotypic diversity.

    View details for DOI 10.1038/nature08837

  • A genome-wide linkage scan in German shepherd dogs localizes canine platelet procoagulant deficiency (Scott syndrome) to canine chromosome 27 GENE Brooks, M., Etter, K., Catalfarno, J., Brisbin, A., Bustamante, C., Mezey, J. 2010; 450 (1-2): 70-75


    Scott syndrome is a rare hereditary bleeding disorder associated with an inability of stimulated platelets to externalize the negatively charged phospholipid, phosphatidylserine (PS). Canine Scott syndrome (CSS) is the only naturally occurring animal model of this defect and therefore represents a unique tool to discover a disease gene capable of producing this platelet phenotype. We undertook platelet function studies and linkage analyses in a pedigree of CSS-affected German shepherd dogs. Based on residual serum prothrombin and flow cytometric assays, CSS segregates as an autosomal recessive trait. An initial genome scan, performed by genotyping 48 dogs for 280 microsatellite markers, suggested linkage with markers on chromosome 27. Genotypes ultimately obtained for a total of 56 dogs at 11 markers on chromosome 27 revealed significant LOD scores for 2 markers near the centromere, with multipoint linkage indicating a CSS trait locus spanning approximately 14 cm. These results provide the basis for fine mapping studies to narrow the disease interval and target the evaluation of putative disease genes.

    View details for DOI 10.1016/j.gene.2009.09.016

  • Genome-wide patterns of population structure and admixture in West Africans and African Americans PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Bryc, K., Auton, A., Nelson, M. R., Oksenberg, J. R., Hauser, S. L., Williams, S., Froment, A., Bodo, J., Wambebe, C., Tishkoff, S. A., Bustamante, C. D. 2010; 107 (2): 786-791


    Quantifying patterns of population structure in Africans and African Americans illuminates the history of human populations and is critical for undertaking medical genomic studies on a global scale. To obtain a fine-scale genome-wide perspective of ancestry, we analyze Affymetrix GeneChip 500K genotype data from African Americans (n = 365) and individuals with ancestry from West Africa (n = 203 from 12 populations) and Europe (n = 400 from 42 countries). We find that population structure within the West African sample reflects primarily language and secondarily geographical distance, echoing the Bantu expansion. Among African Americans, analysis of genomic admixture by a principal component-based approach indicates that the median proportion of European ancestry is 18.5% (25th-75th percentiles: 11.6-27.7%), with very large variation among individuals. In the African-American sample as a whole, few autosomal regions showed exceptionally high or low mean African ancestry, but the X chromosome showed elevated levels of African ancestry, consistent with a sex-biased pattern of gene flow with an excess of European male and African female ancestry. We also find that genomic profiles of individual African Americans afford personalized ancestry reconstructions differentiating ancient vs. recent European and African ancestry. Finally, patterns of genetic similarity among inferred African segments of African-American genomes and genomes of contemporary African populations included in this study suggest African ancestry is most similar to non-Bantu Niger-Kordofanian-speaking populations, consistent with historical documents of the African Diaspora and trans-Atlantic slave trade.

    View details for DOI 10.1073/pnas.0909559107

  • SNP identification, verification, and utility for population genetics in a non-model genus. BMC genetics Williams, L. M., Ma, X., Boyko, A. R., Bustamante, C. D., Oleksiak, M. F. 2010; 11: 32-?


    By targeting SNPs contained in both coding and non-coding areas of the genome, we are able to identify genetic differences and characterize genome-wide patterns of variation among individuals, populations and species. We investigated the utility of 454 sequencing and MassARRAY genotyping for population genetics in natural populations of the teleost, Fundulus heteroclitus as well as closely related Fundulus species (F. grandis, F. majalis and F. similis).We used 454 pyrosequencing and MassARRAY genotyping technology to identify and type 458 genome-wide SNPs and determine genetic differentiation within and between populations and species of Fundulus. Specifically, pyrosequencing identified 96 putative SNPs across coding and non-coding regions of the F. heteroclitus genome: 88.8% were verified as true SNPs with MassARRAY. Additionally, putative SNPs identified in F. heteroclitus EST sequences were verified in most (86.5%) F. heteroclitus individuals; fewer were genotyped in F. grandis (74.4%), F. majalis (72.9%), and F. similis (60.7%) individuals. SNPs were polymorphic and showed latitudinal clinal variation separating northern and southern populations and established isolation by distance in F. heteroclitus populations. In F. grandis, SNPs were less polymorphic but still established isolation by distance. Markers differentiated species and populations.In total, these approaches were used to quickly determine differences within the Fundulus genome and provide markers for population genetic studies.

  • Targets of Balancing Selection in the Human Genome MOLECULAR BIOLOGY AND EVOLUTION Andres, A. M., Hubisz, M. J., Indap, A., Torgerson, D. G., Degenhardt, J. D., Boyko, A. R., Gutenkunst, R. N., White, T. J., Green, E. D., Bustamante, C. D., Clark, A. G., Nielsen, R. 2009; 26 (12): 2755-2764


    Balancing selection is potentially an important biological force for maintaining advantageous genetic diversity in populations, including variation that is responsible for long-term adaptation to the environment. By serving as a means to maintain genetic variation, it may be particularly relevant to maintaining phenotypic variation in natural populations. Nevertheless, its prevalence and specific targets in the human genome remain largely unknown. We have analyzed the patterns of diversity and divergence of 13,400 genes in two human populations using an unbiased single-nucleotide polymorphism data set, a genome-wide approach, and a method that incorporates demography in neutrality tests. We identified an unbiased catalog of genes with signatures of long-term balancing selection, which includes immunity genes as well as genes encoding keratins and membrane channels; the catalog also shows enrichment in functional categories involved in cellular structure. Patterns are mostly concordant in the two populations, with a small fraction of genes showing population-specific signatures of selection. Power considerations indicate that our findings represent a subset of all targets in the genome, suggesting that although balancing selection may not have an obvious impact on a large proportion of human genes, it is a key force affecting the evolution of a number of genes in humans.

    View details for DOI 10.1093/molbev/msp190

  • Coat Variation in the Domestic Dog Is Governed by Variants in Three Genes SCIENCE Cadieu, E., Neff, M. W., Quignon, P., Walsh, K., Chase, K., Parker, H. G., vonHoldt, B. M., Rhue, A., Boyko, A., Byers, A., Wong, A., Mosher, D. S., Elkahloun, A. G., Spady, T. C., Andre, C., Lark, K. G., Cargill, M., Bustamante, C. D., Wayne, R. K., Ostrander, E. A. 2009; 326 (5949): 150-153


    Coat color and type are essential characteristics of domestic dog breeds. Although the genetic basis of coat color has been well characterized, relatively little is known about the genes influencing coat growth pattern, length, and curl. We performed genome-wide association studies of more than 1000 dogs from 80 domestic breeds to identify genes associated with canine fur phenotypes. Taking advantage of both inter- and intrabreed variability, we identified distinct mutations in three genes, RSPO2, FGF5, and KRT71 (encoding R-spondin-2, fibroblast growth factor-5, and keratin-71, respectively), that together account for most coat phenotypes in purebred dogs in the United States. Thus, an array of varied and seemingly complex phenotypes can be reduced to the combinatorial effects of only a few genes.

    View details for DOI 10.1126/science.1177808

  • Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data PLOS GENETICS Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H., Bustamante, C. D. 2009; 5 (10)


    Demographic models built from genetic data play important roles in illuminating prehistorical events and serving as null models in genome scans for selection. We introduce an inference method based on the joint frequency spectrum of genetic variants within and between populations. For candidate models we numerically compute the expected spectrum using a diffusion approximation to the one-locus, two-allele Wright-Fisher process, involving up to three simultaneous populations. Our approach is a composite likelihood scheme, since linkage between neutral loci alters the variance but not the expectation of the frequency spectrum. We thus use bootstraps incorporating linkage to estimate uncertainties for parameters and significance values for hypothesis tests. Our method can also incorporate selection on single sites, predicting the joint distribution of selected alleles among populations experiencing a bevy of evolutionary forces, including expansions, contractions, migrations, and admixture. We model human expansion out of Africa and the settlement of the New World, using 5 Mb of noncoding DNA resequenced in 68 individuals from 4 populations (YRI, CHB, CEU, and MXL) by the Environmental Genome Project. We infer divergence between West African and Eurasian populations 140 thousand years ago (95% confidence interval: 40-270 kya). This is earlier than other genetic studies, in part because we incorporate migration. We estimate the European (CEU) and East Asian (CHB) divergence time to be 23 kya (95% c.i.: 17-43 kya), long after archeological evidence places modern humans in Europe. Finally, we estimate divergence between East Asians (CHB) and Mexican-Americans (MXL) of 22 kya (95% c.i.: 16.3-26.9 kya), and our analysis yields no evidence for subsequent migration. Furthermore, combining our demographic model with a previously estimated distribution of selective effects among newly arising amino acid mutations accurately predicts the frequency spectrum of nonsynonymous variants across three continental populations (YRI, CHB, CEU).

    View details for DOI 10.1371/journal.pgen.1000695

  • An Expressed Fgf4 Retrogene Is Associated with Breed-Defining Chondrodysplasia in Domestic Dogs SCIENCE Parker, H. G., vonHoldt, B. M., Quignon, P., Margulies, E. H., Shao, S., Mosher, D. S., Spady, T. C., Elkahloun, A., Cargill, M., Jones, P. G., Maslen, C. L., Acland, G. M., Sutter, N. B., Kuroki, K., Bustamante, C. D., Wayne, R. K., Ostrander, E. A. 2009; 325 (5943): 995-998


    Retrotransposition of processed mRNAs is a common source of novel sequence acquired during the evolution of genomes. Although the vast majority of retroposed gene copies, or retrogenes, rapidly accumulate debilitating mutations that disrupt the reading frame, a small percentage become new genes that encode functional proteins. By using a multibreed association analysis in the domestic dog, we demonstrate that expression of a recently acquired retrogene encoding fibroblast growth factor 4 (fgf4) is strongly associated with chondrodysplasia, a short-legged phenotype that defines at least 19 dog breeds including dachshund, corgi, and basset hound. These results illustrate the important role of a single evolutionary event in constraining and directing phenotypic diversity in the domestic dog.

    View details for DOI 10.1126/science.1173275

  • Complex population structure in African village dogs and its implications for inferring dog domestication history PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Boyko, A. R., Boyko, R. H., Boyko, C. M., Parker, H. G., Castelhano, M., Corey, L., Degenhardt, J. D., Auton, A., Hedimbi, M., Kityo, R., Ostrander, E. A., Schoenebeck, J., Todhunter, R. J., Jones, P., Bustamante, C. D. 2009; 106 (33): 13903-13908


    High genetic diversity of East Asian village dogs has recently been used to argue for an East Asian origin of the domestic dog. However, global village dog genetic diversity and the extent to which semiferal village dogs represent distinct, indigenous populations instead of admixtures of various dog breeds has not been quantified. Understanding these issues is critical to properly reconstructing the timing, number, and locations of dog domestication. To address these questions, we sampled 318 village dogs from 7 regions in Egypt, Uganda, and Namibia, measuring genetic diversity >680 bp of the mitochondrial D-loop, 300 SNPs, and 89 microsatellite markers. We also analyzed breed dogs, including putatively African breeds (Afghan hounds, Basenjis, Pharaoh hounds, Rhodesian ridgebacks, and Salukis), Puerto Rican street dogs, and mixed breed dogs from the United States. Village dogs from most African regions appear genetically distinct from non-native breed and mixed-breed dogs, although some individuals cluster genetically with Puerto Rican dogs or United States breed mixes instead of with neighboring village dogs. Thus, African village dogs are a mosaic of indigenous dogs descended from early migrants to Africa, and non-native, breed-admixed individuals. Among putatively African breeds, Pharaoh hounds, and Rhodesian ridgebacks clustered with non-native rather than indigenous African dogs, suggesting they have predominantly non-African origins. Surprisingly, we find similar mtDNA haplotype diversity in African and East Asian village dogs, potentially calling into question the hypothesis of an East Asian origin for dog domestication.

    View details for DOI 10.1073/pnas.0902129106

  • Evolutionary Processes Acting on Candidate cis-Regulatory Regions in Humans Inferred from Patterns of Polymorphism and Divergence PLOS GENETICS Torgerson, D. G., Boyko, A. R., Hernandez, R. D., Indap, A., Hu, X., White, T. J., Sninsky, J. J., Cargill, M., Adams, M. D., Bustamante, C. D., Clark, A. G. 2009; 5 (8)


    Analysis of polymorphism and divergence in the non-coding portion of the human genome yields crucial information about factors driving the evolution of gene regulation. Candidate cis-regulatory regions spanning more than 15,000 genes in 15 African Americans and 20 European Americans were re-sequenced and aligned to the chimpanzee genome in order to identify potentially functional polymorphism and to characterize and quantify departures from neutral evolution. Distortions of the site frequency spectra suggest a general pattern of selective constraint on conserved non-coding sites in the flanking regions of genes (CNCs). Moreover, there is an excess of fixed differences that cannot be explained by a Gamma model of deleterious fitness effects, suggesting the presence of positive selection on CNCs. Extensions of the McDonald-Kreitman test identified candidate cis-regulatory regions with high probabilities of positive and negative selection near many known human genes, the biological characteristics of which exhibit genome-wide trends that differ from patterns observed in protein-coding regions. Notably, there is a higher probability of positive selection in candidate cis-regulatory regions near genes expressed in the fetal brain, suggesting that a larger portion of adaptive regulatory changes has occurred in genes expressed during brain development. Overall we find that natural selection has played an important role in the evolution of candidate cis-regulatory regions throughout hominid evolution.

    View details for DOI 10.1371/journal.pgen.1000592

  • Genomewide SNP variation reveals relationships among landraces and modern varieties of rice PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA McNally, K. L., Childs, K. L., Bohnert, R., Davidson, R. M., Zhao, K., Ulat, V. J., Zeller, G., Clark, R. M., Hoen, D. R., Bureau, T. E., Stokowski, R., Ballinger, D. G., Frazer, K. A., Cox, D. R., Padhukasahasram, B., Bustamante, C. D., Weigel, D., Mackill, D. J., Bruskiewich, R. M., Raetsch, G., Buell, C. R., Leung, H., Leach, J. E. 2009; 106 (30): 12273-12278


    Rice, the primary source of dietary calories for half of humanity, is the first crop plant for which a high-quality reference genome sequence from a single variety was produced. We used resequencing microarrays to interrogate 100 Mb of the unique fraction of the reference genome for 20 diverse varieties and landraces that capture the impressive genotypic and phenotypic diversity of domesticated rice. Here, we report the distribution of 160,000 nonredundant SNPs. Introgression patterns of shared SNPs revealed the breeding history and relationships among the 20 varieties; some introgressed regions are associated with agronomic traits that mark major milestones in rice improvement. These comprehensive SNP data provide a foundation for deep exploration of rice diversity and gene-trait relationships and their use for future rice improvement.

    View details for DOI 10.1073/pnas.0900992106

  • Darwinian and demographic forces affecting human protein coding genes GENOME RESEARCH Nielsen, R., Hubisz, M. J., Hellmann, I., Torgerson, D., Andres, A. M., Albrechtsen, A., Gutenkunst, R., Adams, M. D., Cargill, M., Boyko, A., Indap, A., Bustamante, C. D., Clark, A. G. 2009; 19 (5): 838-849


    Past demographic changes can produce distortions in patterns of genetic variation that can mimic the appearance of natural selection unless the demographic effects are explicitly removed. Here we fit a detailed model of human demography that incorporates divergence, migration, admixture, and changes in population size to directly sequenced data from 13,400 protein coding genes from 20 European-American and 19 African-American individuals. Based on this demographic model, we use several new and established statistical methods for identifying genes with extreme patterns of polymorphism likely to be caused by Darwinian selection, providing the first genome-wide analysis of allele frequency distributions in humans based on directly sequenced data. The tests are based on observations of excesses of high frequency-derived alleles, excesses of low frequency-derived alleles, and excesses of differences in allele frequencies between populations. We detect numerous new genes with strong evidence of selection, including a number of genes related to psychiatric and other diseases. We also show that microRNA controlled genes evolve under extremely high constraints and are more likely to undergo negative selection than other genes. Furthermore, we show that genes involved in muscle development have been subject to positive selection during recent human history. In accordance with previous studies, we find evidence for negative selection against mutations in genes associated with Mendelian disease and positive selection acting on genes associated with several complex diseases.

    View details for DOI 10.1101/gr.088336.108

  • Global distribution of genomic diversity underscores rich complex history of continental human populations GENOME RESEARCH Auton, A., Bryc, K., Boyko, A. R., Lohmueller, K. E., Novembre, J., Reynolds, A., Indap, A., Wright, M. H., Degenhardt, J. D., Gutenkunst, R. N., King, K. S., Nelson, M. R., Bustamante, C. D. 2009; 19 (5): 795-803


    Characterizing patterns of genetic variation within and among human populations is important for understanding human evolutionary history and for careful design of medical genetic studies. Here, we analyze patterns of variation across 443,434 single nucleotide polymorphisms (SNPs) genotyped in 3845 individuals from four continental regions. This unique resource allows us to illuminate patterns of diversity in previously under-studied populations at the genome-wide scale including Latin America, South Asia, and Southern Europe. Key insights afforded by our analysis include quantifying the degree of admixture in a large collection of individuals from Guadalajara, Mexico; identifying language and geography as key determinants of population structure within India; and elucidating a north-south gradient in haplotype diversity within Europe. We also present a novel method for identifying long-range tracts of homozygosity indicative of recent common ancestry. Application of our approach suggests great variation within and among populations in the extent of homozygosity, suggesting both demographic history (such as population bottlenecks) and recent ancestry events (such as consanguinity) play an important role in patterning variation in large modern human populations.

    View details for DOI 10.1101/gr.088898.108

  • Methods for Human Demographic Inference Using Haplotype Patterns From Genomewide Single-Nucleotide Polymorphism Data GENETICS Lohmueller, K. E., Bustamante, C. D., Clark, A. G. 2009; 182 (1): 217-231


    We propose a novel approximate-likelihood method to fit demographic models to human genomewide single-nucleotide polymorphism (SNP) data. We divide the genome into windows of constant genetic map width and then tabulate the number of distinct haplotypes and the frequency of the most common haplotype for each window. We summarize the data by the genomewide joint distribution of these two statistics-termed the HCN statistic. Coalescent simulations are used to generate the expected HCN statistic for different demographic parameters. The HCN statistic provides additional information for disentangling complex demography beyond statistics based on single-SNP frequencies. Application of our method to simulated data shows it can reliably infer parameters from growth and bottleneck models, even in the presence of recombination hotspots when properly modeled. We also examined how practical problems with genomewide data sets, such as errors in the genetic map, haplotype phase uncertainty, and SNP ascertainment bias, affect our method. Several modifications of our method served to make it robust to these problems. We have applied our method to data collected by Perlegen Sciences and find evidence for a severe population size reduction in northwestern Europe starting 32,500-47,500 years ago.

    View details for DOI 10.1534/genetics.108.099275

  • Genome-Wide Survey of SNP Variation Uncovers the Genetic Structure of Cattle Breeds SCIENCE Gibbs, R. A., Taylor, J. F., Van Tassell, C. P., Barendse, W., Eversoie, K. A., Gill, C. A., Green, R. D., Hamernik, D. L., Kappes, S. M., Lien, S., Matukumalli, L. K., McEwan, J. C., Nazareth, L. V., Schnabel, R. D., Taylor, J. F., Weinstock, G. M., Wheeler, D. A., Ajmone-Marsan, P., Barendse, W., Boettcher, P. J., Caetano, A. R., Garcia, J. F., Hanotte, O., Mariani, P., Skow, L. C., Williams, J. L., Caetano, A. R., Diallo, B., Green, R. D., Hailemariam, L., Hanotte, O., Martinez, M. L., Morris, C. A., Silva, L. O., Spelman, R. J., Taylor, J. F., Mulatu, W., Zhao, K., Abbey, C. A., Agaba, M., Araujo, F. R., Bunch, R. J., Burton, J., Gill, C. A., Gorni, C., Olivier, H., Harrison, B. E., Luff, B., Machado, M. A., Mariani, P., Morris, C. A., Mwakaya, J., Plastow, G., Sim, W., Skow, L. C., Smith, T., Sonstegard, T. S., Spelman, R. J., Taylor, J. F., Thomas, M. B., Valentini, A., Williams, P., Womack, J., Wooliams, J. A., Liu, Y., Qin, X., Worley, K. C., Gao, C., Gill, C. A., Jiang, H., Liu, Y., Moore, S. S., Nazareth, L. V., Ren, Y., Song, X., Bustamante, C. D., Hernandez, R. D., Muzny, D. M., Nazareth, L. V., Patil, S., Lucas, A. S., Fu, Q., Kent, M. P., Moore, S. S., Vega, R., Abbey, C. A., Gao, C., Gill, C. A., Green, R. D., Matukumalli, L. K., Matukumalli, A., McWilliam, S., Schnabel, R. D., Sclep, G., Ajmone-Marsan, P., Bryc, K., Bustamante, C. D., Choi, J., Gao, H., Grefenstette, J. J., Murdoch, B., Stella, A., Villa-Angulo, R., Wright, M., Aerts, J., Jann, O., Negrini, R., Sonstegard, T. S., Williams, J. L., Taylor, J. F., Villa-Angulo, R., Goddard, M. E., Hayes, B. J., Barendse, W., Bradley, D. G., Boettcher, P. J., Bustamante, C. D., da Silva, M. B., Lau, L. P., Liu, G. E., Lynn, D. J., Panzitta, F., Sclep, G., Wright, M., Dodds, K. G. 2009; 324 (5926): 528-532


    The imprints of domestication and breed development on the genomes of livestock likely differ from those of companion animals. A deep draft sequence assembly of shotgun reads from a single Hereford female and comparative sequences sampled from six additional breeds were used to develop probes to interrogate 37,470 single-nucleotide polymorphisms (SNPs) in 497 cattle from 19 geographically and biologically diverse breeds. These data show that cattle have undergone a rapid recent decrease in effective population size from a very large ancestral population, possibly due to bottlenecks associated with domestication, selection, and breed formation. Domestication and artificial selection appear to have left detectable signatures of selection within the cattle genome, yet the current levels of diversity within breeds are at least as great as exists within humans.

    View details for DOI 10.1126/science.1167936

  • Linkage Disequilibrium and Demographic History of Wild and Domestic Canids GENETICS Gray, M. M., Granka, J. M., Bustamante, C. D., Sutter, N. B., Boyko, A. R., Zhu, L., Ostrander, E. A., Wayne, R. 2009; 181 (4): 1493-1505


    Assessing the extent of linkage disequilibrium (LD) in natural populations of a nonmodel species has been difficult due to the lack of available genomic markers. However, with advances in genotyping and genome sequencing, genomic characterization of natural populations has become feasible. Using sequence data and SNP genotypes, we measured LD and modeled the demographic history of wild canid populations and domestic dog breeds. In 11 gray wolf populations and one coyote population, we find that the extent of LD as measured by the distance at which r2=0.2 extends <10 kb in outbred populations to >1.7 Mb in populations that have experienced significant founder events and bottlenecks. This large range in the extent of LD parallels that observed in 18 dog breeds where the r2 value varies from approximately 20 kb to >5 Mb. Furthermore, in modeling demographic history under a composite-likelihood framework, we find that two of five wild canid populations exhibit evidence of a historical population contraction. Five domestic dog breeds display evidence for a minor population contraction during domestication and a more severe contraction during breed formation. Only a 5% reduction in nucleotide diversity was observed as a result of domestication, whereas the loss of nucleotide diversity with breed formation averaged 35%.

    View details for DOI 10.1534/genetics.108.098830

  • Molecular and Evolutionary History of Melanism in North American Gray Wolves SCIENCE Anderson, T. M., vonHoldt, B. M., Candille, S. I., Musiani, M., Greco, C., Stahler, D. R., Smith, D. W., Padhukasahasram, B., Randi, E., Leonard, J. A., Bustamante, C. D., Ostrander, E. A., Tang, H., Wayne, R. K., Barsh, G. S. 2009; 323 (5919): 1339-1343


    Morphological diversity within closely related species is an essential aspect of evolution and adaptation. Mutations in the Melanocortin 1 receptor (Mc1r) gene contribute to pigmentary diversity in natural populations of fish, birds, and many mammals. However, melanism in the gray wolf, Canis lupus, is caused by a different melanocortin pathway component, the K locus, that encodes a beta-defensin protein that acts as an alternative ligand for Mc1r. We show that the melanistic K locus mutation in North American wolves derives from past hybridization with domestic dogs, has risen to high frequency in forested habitats, and exhibits a molecular signature of positive selection. The same mutation also causes melanism in the coyote, Canis latrans, and in Italian gray wolves, and hence our results demonstrate how traits selected in domesticated species can influence the morphological diversity of their wild relatives.

    View details for DOI 10.1126/science.1165448

  • Improving the Fitness of High-Dimensional Biomechanical Models via Data-Driven Stochastic Exploration IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING Santos, V. J., Bustamante, C. D., Valero-Cuevas, F. J. 2009; 56 (3): 552-564


    The field of complex biomechanical modeling has begun to rely on Monte Carlo techniques to investigate the effects of parameter variability and measurement uncertainty on model outputs, search for optimal parameter combinations, and define model limitations. However, advanced stochastic methods to perform data-driven explorations, such as Markov chain Monte Carlo (MCMC), become necessary as the number of model parameters increases. Here, we demonstrate the feasibility and, what to our knowledge is, the first use of an MCMC approach to improve the fitness of realistically large biomechanical models. We used a Metropolis-Hastings algorithm to search increasingly complex parameter landscapes (3, 8, 24, and 36 dimensions) to uncover underlying distributions of anatomical parameters of a "truth model" of the human thumb on the basis of simulated kinematic data (thumbnail location, orientation, and linear and angular velocities) polluted by zero-mean, uncorrelated multivariate Gaussian "measurement noise." Driven by these data, ten Markov chains searched each model parameter space for the subspace that best fit the data (posterior distribution). As expected, the convergence time increased, more local minima were found, and marginal distributions broadened as the parameter space complexity increased. In the 36-D scenario, some chains found local minima but the majority of chains converged to the true posterior distribution (confirmed using a cross-validation dataset), thus demonstrating the feasibility and utility of these methods for realistically large biomechanical problems.

    View details for DOI 10.1109/TBME.2008.2006033

  • Copy Number Variation of CCL3-like Genes Affects Rate of Progression to Simian-AIDS in Rhesus Macaques (Macaca mulatta) PLOS GENETICS Degenhardt, J. D., de Candia, P., Chabot, A., Schwartz, S., Henderson, L., Ling, B., Hunter, M., Jiang, Z., Palermo, R. E., Katze, M., Eichler, E. E., Ventura, M., Rogers, J., Marx, P., Gilad, Y., Bustamante, C. D. 2009; 5 (1)


    Variation in genes underlying host immunity can lead to marked differences in susceptibility to HIV infection among humans. Despite heavy reliance on non-human primates as models for HIV/AIDS, little is known about which host factors are shared and which are unique to a given primate lineage. Here, we investigate whether copy number variation (CNV) at CCL3-like genes (CCL3L), a key genetic host factor for HIV/AIDS susceptibility and cell-mediated immune response in humans, is also a determinant of time until onset of simian-AIDS in rhesus macaques. Using a retrospective study of 57 rhesus macaques experimentally infected with SIVmac, we find that CCL3L CNV explains approximately 18% of the variance in time to simian-AIDS (p<0.001) with lower CCL3L copy number associating with more rapid disease course. We also find that CCL3L copy number varies significantly (p<10(-6)) among rhesus subpopulations, with Indian-origin macaques having, on average, half as many CCL3L gene copies as Chinese-origin macaques. Lastly, we confirm that CCL3L shows variable copy number in humans and chimpanzees and report on CCL3L CNV within and among three additional primate species. On the basis of our findings we suggest that (1) the difference in population level copy number may explain previously reported observations of longer post-infection survivorship of Chinese-origin rhesus macaques, (2) stratification by CCL3L copy number in rhesus SIV vaccine trials will increase power and reduce noise due to non-vaccine-related differences in survival, and (3) CCL3L CNV is an ancestral component of the primate immune response and, therefore, copy number variation has not been driven by HIV or SIV per se.

    View details for DOI 10.1371/journal.pgen.1000346

  • Evaluating signatures of sex-specific processes in the human genome NATURE GENETICS Bustamante, C. D., Ramachandran, S. 2009; 41 (1): 8-10

    View details for DOI 10.1038/ng0109-8

  • Genes mirror geography within Europe NATURE Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A. R., Auton, A., Indap, A., King, K. S., Bergmann, S., Nelson, M. R., Stephens, M., Bustamante, C. D. 2008; 456 (7218): 98-U5


    Understanding the genetic structure of human populations is of fundamental interest to medical, forensic and anthropological sciences. Advances in high-throughput genotyping technology have markedly improved our understanding of global patterns of human genetic variation and suggest the potential to use large samples to uncover variation among closely spaced populations. Here we characterize genetic variation in a sample of 3,000 European individuals genotyped at over half a million variable DNA sites in the human genome. Despite low average levels of genetic differentiation among Europeans, we find a close correspondence between genetic and geographic distances; indeed, a geographical map of Europe arises naturally as an efficient two-dimensional summary of genetic variation in Europeans. The results emphasize that when mapping the genetic basis of a disease phenotype, spurious associations can arise if genetic structure is not properly accounted for. In addition, the results are relevant to the prospects of genetic ancestry testing; an individual's DNA can be used to infer their geographic origin with surprising accuracy-often to within a few hundred kilometres.

    View details for DOI 10.1038/nature07331

  • The population reference sample, POPRES: A resource for population, disease, and pharmacological genetics research AMERICAN JOURNAL OF HUMAN GENETICS Nelson, M. R., Bryc, K., King, K. S., Indap, A., Boyko, A. R., Novembre, J., Briley, L. P., Maruyama, Y., Waterworth, D. M., Waeber, G., Vollenweider, P., Oksenberg, J. R., Hauser, S. L., Stirnadel, H. A., Kooner, J. S., Chambers, J. C., Jones, B., Mooser, V., Bustamante, C. D., Roses, A. D., Burns, D. K., Ehm, M. G., Lail, E. H. 2008; 83 (3): 347-358


    Technological and scientific advances, stemming in large part from the Human Genome and HapMap projects, have made large-scale, genome-wide investigations feasible and cost effective. These advances have the potential to dramatically impact drug discovery and development by identifying genetic factors that contribute to variation in disease risk as well as drug pharmacokinetics, treatment efficacy, and adverse drug reactions. In spite of the technological advancements, successful application in biomedical research would be limited without access to suitable sample collections. To facilitate exploratory genetics research, we have assembled a DNA resource from a large number of subjects participating in multiple studies throughout the world. This growing resource was initially genotyped with a commercially available genome-wide 500,000 single-nucleotide polymorphism panel. This project includes nearly 6,000 subjects of African-American, East Asian, South Asian, Mexican, and European origin. Seven informative axes of variation identified via principal-component analysis (PCA) of these data confirm the overall integrity of the data and highlight important features of the genetic structure of diverse populations. The potential value of such extensively genotyped collections is illustrated by selection of genetically matched population controls in a genome-wide analysis of abacavir-associated hypersensitivity reaction. We find that matching based on country of origin, identity-by-state distance, and multidimensional PCA do similarly well to control the type I error rate. The genotype and demographic data from this reference sample are freely available through the NCBI database of Genotypes and Phenotypes (dbGaP).

    View details for DOI 10.1016/j.ajhg.2008.08.005

  • Exclusion of ABCA-1 as a candidate gene for canine Scott syndrome JOURNAL OF THROMBOSIS AND HAEMOSTASIS Brooks, M. B., Catalfamo, J. L., ETTER, K., Brisbin, A., Bustamante, C. D. 2008; 6 (9): 1608-U5
  • Patterns of Positive Selection in Six Mammalian Genomes PLOS GENETICS Kosiol, C., Vinar, T., da Fonseca, R. R., Hubisz, M. J., Bustamante, C. D., Nielsen, R., Siepel, A. 2008; 4 (8)


    Genome-wide scans for positively selected genes (PSGs) in mammals have provided insight into the dynamics of genome evolution, the genetic basis of differences between species, and the functions of individual genes. However, previous scans have been limited in power and accuracy owing to small numbers of available genomes. Here we present the most comprehensive examination of mammalian PSGs to date, using the six high-coverage genome assemblies now available for eutherian mammals. The increased phylogenetic depth of this dataset results in substantially improved statistical power, and permits several new lineage- and clade-specific tests to be applied. Of approximately 16,500 human genes with high-confidence orthologs in at least two other species, 400 genes showed significant evidence of positive selection (FDR<0.05), according to a standard likelihood ratio test. An additional 144 genes showed evidence of positive selection on particular lineages or clades. As in previous studies, the identified PSGs were enriched for roles in defense/immunity, chemosensory perception, and reproduction, but enrichments were also evident for more specific functions, such as complement-mediated immunity and taste perception. Several pathways were strongly enriched for PSGs, suggesting possible co-evolution of interacting genes. A novel Bayesian analysis of the possible "selection histories" of each gene indicated that most PSGs have switched multiple times between positive selection and nonselection, suggesting that positive selection is often episodic. A detailed analysis of Affymetrix exon array data indicated that PSGs are expressed at significantly lower levels, and in a more tissue-specific manner, than non-PSGs. Genes that are specifically expressed in the spleen, testes, liver, and breast are significantly enriched for PSGs, but no evidence was found for an enrichment for PSGs among brain-specific genes. This study provides additional evidence for widespread positive selection in mammalian evolution and new genome-wide insights into the functional implications of positive selection.

    View details for DOI 10.1371/journal.pgen.1000144

  • Natural selection on genes that underlie human disease susceptibility CURRENT BIOLOGY Blekhman, R., Man, O., Herrmann, L., Boyko, A. R., Indap, A., Kosiol, C., Bustamante, C. D., Teshima, K. M., Przeworskil, M. 2008; 18 (12): 883-889


    What evolutionary forces shape genes that contribute to the risk of human disease? Do similar selective pressures act on alleles that underlie simple versus complex disorders [1-3]? Answers to these questions will shed light onto the origin of human disorders (e.g., [4]) and help to predict the population frequencies of alleles that contribute to disease risk, with important implications for the efficient design of mapping studies [5-7]. As a first step toward addressing these questions, we created a hand-curated version of the Mendelian Inheritance in Man database (OMIM). We then examined selective pressures on Mendelian-disease genes, genes that contribute to complex-disease risk, and genes known to be essential in mouse by analyzing patterns of human polymorphism and of divergence between human and rhesus macaque. We found that Mendelian-disease genes appear to be under widespread purifying selection, especially when the disease mutations are dominant (rather than recessive). In contrast, the class of genes that influence complex-disease risk shows little signs of evolutionary conservation, possibly because this category includes targets of both purifying and positive selection.

    View details for DOI 10.1016/j.cub.2008.04.074

  • Assessing the evolutionary impact of amino acid mutations in the human genome PLOS GENETICS Boyko, A. R., Williamson, S. H., Indap, A. R., Degenhardt, J. D., Hernandez, R. D., Lohmueller, K. E., Adams, M. D., Schmidt, S., Sninsky, J. J., Sunyaev, S. R., White, T. J., Nielsen, R., Clark, A. G., Bustamante, C. D. 2008; 4 (5)


    Quantifying the distribution of fitness effects among newly arising mutations in the human genome is key to resolving important debates in medical and evolutionary genetics. Here, we present a method for inferring this distribution using Single Nucleotide Polymorphism (SNP) data from a population with non-stationary demographic history (such as that of modern humans). Application of our method to 47,576 coding SNPs found by direct resequencing of 11,404 protein coding-genes in 35 individuals (20 European Americans and 15 African Americans) allows us to assess the relative contribution of demographic and selective effects to patterning amino acid variation in the human genome. We find evidence of an ancient population expansion in the sample with African ancestry and a relatively recent bottleneck in the sample with European ancestry. After accounting for these demographic effects, we find strong evidence for great variability in the selective effects of new amino acid replacing mutations. In both populations, the patterns of variation are consistent with a leptokurtic distribution of selection coefficients (e.g., gamma or log-normal) peaked near neutrality. Specifically, we predict 27-29% of amino acid changing (nonsynonymous) mutations are neutral or nearly neutral (|s|<0.01%), 30-42% are moderately deleterious (0.01%<|s|<1%), and nearly all the remainder are highly deleterious or lethal (|s|>1%). Our results are consistent with 10-20% of amino acid differences between humans and chimpanzees having been fixed by positive selection with the remainder of differences being neutral or nearly neutral. Our analysis also predicts that many of the alleles identified via whole-genome association mapping may be selectively neutral or (formerly) positively selected, implying that deleterious genetic variation affecting disease phenotype may be missed by this widely used approach for mapping genes underlying complex traits.

    View details for DOI 10.1371/journal.pgen.1000083

  • Exploring population genetic models with recombination using efficient forward-time simulations GENETICS Padhukasahasram, B., Marjoram, P., Wall, J. D., Bustamante, C. D., Nordborg, M. 2008; 178 (4): 2417-2427


    We present an exact forward-in-time algorithm that can efficiently simulate the evolution of a finite population under the Wright-Fisher model. We used simulations based on this algorithm to verify the accuracy of the ancestral recombination graph approximation by comparing it to the exact Wright-Fisher scenario. We find that the recombination graph is generally a very good approximation for models with complete outcrossing, whereas, for models with self-fertilization, the approximation becomes slightly inexact for some combinations of selfing and recombination parameters.

    View details for DOI 10.1534/genetics.107.085332

  • Proportionally more deleterious genetic variation in European than in African populations NATURE Lohmueller, K. E., Indap, A. R., Schmidt, S., Boyko, A. R., Hernandez, R. D., Hubisz, M. J., Sninsky, J. J., White, T. J., Sunyaev, S. R., Nielsen, R., Clark, A. G., Bustamante, C. D. 2008; 451 (7181): 994-U5


    Quantifying the number of deleterious mutations per diploid human genome is of crucial concern to both evolutionary and medical geneticists. Here we combine genome-wide polymorphism data from PCR-based exon resequencing, comparative genomic data across mammalian species, and protein structure predictions to estimate the number of functionally consequential single-nucleotide polymorphisms (SNPs) carried by each of 15 African American (AA) and 20 European American (EA) individuals. We find that AAs show significantly higher levels of nucleotide heterozygosity than do EAs for all categories of functional SNPs considered, including synonymous, non-synonymous, predicted 'benign', predicted 'possibly damaging' and predicted 'probably damaging' SNPs. This result is wholly consistent with previous work showing higher overall levels of nucleotide variation in African populations than in Europeans. EA individuals, in contrast, have significantly more genotypes homozygous for the derived allele at synonymous and non-synonymous SNPs and for the damaging allele at 'probably damaging' SNPs than AAs do. For SNPs segregating only in one population or the other, the proportion of non-synonymous SNPs is significantly higher in the EA sample (55.4%) than in the AA sample (47.0%; P < 2.3 x 10(-37)). We observe a similar proportional excess of SNPs that are inferred to be 'probably damaging' (15.9% in EA; 12.1% in AA; P < 3.3 x 10(-11)). Using extensive simulations, we show that this excess proportion of segregating damaging alleles in Europeans is probably a consequence of a bottleneck that Europeans experienced at about the time of the migration out of Africa.

    View details for DOI 10.1038/nature06611

  • Population genetics of polymorphism and divergence under fluctuating selection GENETICS Huerta-Sanchez, E., Durrefft, R., Bustamante, C. D. 2008; 178 (1): 325-337


    Current methods for detecting fluctuating selection require time series data on genotype frequencies. Here, we propose an alternative approach that makes use of DNA polymorphism data from a sample of individuals collected at a single point in time. Our method uses classical diffusion approximations to model temporal fluctuations in the selection coefficients to find the expected distribution of mutation frequencies in the population. Using the Poisson random-field setting we derive the site-frequency spectrum (SFS) for three different models of fluctuating selection. We find that the general effect of fluctuating selection is to produce a more "U"-shaped site-frequency spectrum with an excess of high-frequency derived mutations at the expense of middle-frequency variants. We present likelihood-ratio tests, comparing the fluctuating selection models to the neutral model using SFS data, and use Monte Carlo simulations to assess their power. We find that we have sufficient power to reject a neutral hypothesis using samples on the order of a few hundred SNPs and a sample size of approximately 20 and power to distinguish between selection that varies in time and constant selection for a sample of size 20. We also find that fluctuating selection increases the probability of fixation of selected sites even if, on average, there is no difference in selection among a pair of alleles segregating at the locus. Fluctuating selection will, therefore, lead to an increase in the ratio of divergence to polymorphism similar to that observed under positive directional selection.

    View details for DOI 10.1531/genetics.107.073361

  • Recent and ongoing selection in the human genome NATURE REVIEWS GENETICS Nielsen, R., Hellmann, I., Hubisz, M., Bustamante, C., Clark, A. G. 2007; 8 (11): 857-868


    The recent availability of genome-scale genotyping data has led to the identification of regions of the human genome that seem to have been targeted by selection. These findings have increased our understanding of the evolutionary forces that affect the human genome, have augmented our knowledge of gene function and promise to increase our understanding of the genetic basis of disease. However, inferences of selection are challenged by several confounding factors, especially the complex demographic history of human populations, and concordance between studies is variable. Although such studies will always be associated with some uncertainty, steps can be taken to minimize the effects of confounding factors and improve our interpretation of their findings.

    View details for DOI 10.1038/nrg2187

  • Context-dependent mutation rates may cause spurious signatures of a fixation bias favoring higher GC-Content in humans MOLECULAR BIOLOGY AND EVOLUTION Hernandez, R. D., Williamson, S. H., Zhu, L., Bustamante, C. D. 2007; 24 (10): 2196-2202


    Understanding the proximate and ultimate causes underlying the evolution of nucleotide composition in mammalian genomes is of fundamental interest to the study of molecular evolution. Comparative genomics studies have revealed that many more substitutions occur from G and C nucleotides to A and T nucleotides than the reverse, suggesting that mammalian genomes are not at equilibrium for base composition. Analysis of human polymorphism data suggests that mutations that increase GC-content tend to be at much higher frequencies than those that decrease or preserve GC-content when the ancestral allele is inferred via parsimony using the chimpanzee genome. These observations have been interpreted as evidence for a fixation bias in favor of G and C alleles due to either positive natural selection or biased gene conversion. Here, we test the robustness of this interpretation to violations of the parsimony assumption using a data set of 21,488 noncoding single nucleotide polymorphisms (SNPs) discovered by the National Institute of Environmental Health Sciences (NIEHS) SNPs project via direct resequencing of n = 95 individuals. Applying standard nonparametric and parametric population genetic approaches, we replicate the signatures of a fixation bias in favor of G and C alleles when the ancestral base is assumed to be the base found in the chimpanzee outgroup. However, upon taking into account the probability of misidentifying the ancestral state of each SNP using a context-dependent mutation model, the corrected distribution of SNP frequencies for GC-content increasing SNPs are nearly indistinguishable from the patterns observed for other types of mutations, suggesting that the signature of fixation bias is a spurious artifact of the parsimony assumption.

    View details for DOI 10.1093/molbev/msm149

  • Genome-wide patterns of nucleotide polymorphism in domesticated rice PLOS GENETICS Caicedo, A. L., Williamson, S. H., Hernandez, R. D., Boyko, A., Fledel-Alon, A., York, T. L., Polato, N. R., Olsen, K. M., Nielsen, R., McCouch, S. R., Bustamante, C. D., Purugganan, M. D. 2007; 3 (9): 1745-1756


    Domesticated Asian rice (Oryza sativa) is one of the oldest domesticated crop species in the world, having fed more people than any other plant in human history. We report the patterns of DNA sequence variation in rice and its wild ancestor, O. rufipogon, across 111 randomly chosen gene fragments, and use these to infer the evolutionary dynamics that led to the origins of rice. There is a genome-wide excess of high-frequency derived single nucleotide polymorphisms (SNPs) in O. sativa varieties, a pattern that has not been reported for other crop species. We developed several alternative models to explain contemporary patterns of polymorphisms in rice, including a (i) selectively neutral population bottleneck model, (ii) bottleneck plus migration model, (iii) multiple selective sweeps model, and (iv) bottleneck plus selective sweeps model. We find that a simple bottleneck model, which has been the dominant demographic model for domesticated species, cannot explain the derived nucleotide polymorphism site frequency spectrum in rice. Instead, a bottleneck model that incorporates selective sweeps, or a more complex demographic model that includes subdivision and gene flow, are more plausible explanations for patterns of variation in domesticated rice varieties. If selective sweeps are indeed the explanation for the observed nucleotide data of domesticated rice, it suggests that strong selection can leave its imprint on genome-wide polymorphism patterns, contrary to expectations that selection results only in a local signature of variation.

    View details for DOI 10.1371/journal.pgen.0030163

  • Global dissemination of a single mutation conferring white pericarp in rice PLOS GENETICS Sweeney, M. T., Thomson, M. J., Cho, Y. G., Park, Y. J., Williamson, S. H., Bustamante, C. D., McCouch, S. R. 2007; 3 (8): 1418-1424


    Here we report that the change from the red seeds of wild rice to the white seeds of cultivated rice (Oryza sativa) resulted from the strong selective sweep of a single mutation, a frame-shift deletion within the Rc gene that is found in 97.9% of white rice varieties today. A second mutation, also within Rc, is present in less than 3% of white accessions surveyed. Haplotype analysis revealed that the predominant mutation originated in the japonica subspecies and crossed both geographic and sterility barriers to move into the indica subspecies. A little less than one Mb of japonica DNA hitchhiked with the rc allele into most indica varieties, suggesting that other linked domestication alleles may have been transferred from japonica to indica along with white pericarp color. Our finding provides evidence of active cultural exchange among ancient farmers over the course of rice domestication coupled with very strong, positive selection for a single white allele in both subspecies of O. sativa.

    View details for DOI 10.1371/journal.pgen.0030133

  • Context dependence, ancestral misidentification, and spurious signatures of natural selection MOLECULAR BIOLOGY AND EVOLUTION Hernandez, R. D., Williamson, S. H., Bustamante, C. D. 2007; 24 (8): 1792-1800


    Population genetic analyses often use polymorphism data from one species, and orthologous genomic sequences from closely related outgroup species. These outgroup sequences are frequently used to identify ancestral alleles at segregating sites and to compare the patterns of polymorphism and divergence. Inherent in such studies is the assumption of parsimony, which posits that the ancestral state of each single nucleotide polymorphism (SNP) is the allele that matches the orthologous site in the outgroup sequence, and that all nucleotide substitutions between species have been observed. This study tests the effect of violating the parsimony assumption when mutation rates vary across sites and over time. Using a context-dependent mutation model that accounts for elevated mutation rates at CpG dinucleotides, increased propensity for transitional versus transversional mutations, as well as other directional and contextual mutation biases estimated along the human lineage, we show (using both simulations and a theoretical model) that enough unobserved substitutions could have occurred since the divergence of human and chimpanzee to cause many statistical tests to spuriously reject neutrality. Moreover, using both the chimpanzee and rhesus macaque genomes to parsimoniously identify ancestral states causes a large fraction of the data to be removed while not completely alleviating problem. By constructing a novel model of the context-dependent mutation process, we can correct polymorphism data for the effect of ancestral misidentification using a single outgroup.

    View details for DOI 10.1093/molbev/msm108

  • On the utility of linkage disequilibrium as a statistic for identifying targets of positive selection in nonequilibrium populations GENETICS Jensen, J. D., Thornton, K. R., Bustamante, C. D., Aquadro, C. E. 2007; 176 (4): 2371-2379


    A critically important challenge in empirical population genetics is distinguishing neutral nonequilibrium processes from selective forces that produce similar patterns of variation. We here examine the extent to which linkage disequilibrium (i.e., nonrandom associations between markers) improves this discrimination. We show that patterns of linkage disequilibrium recently proposed to be unique to hitchhiking models are replicated under nonequilibrium neutral models. We also demonstrate that jointly considering spatial patterns of association among variants alongside the site-frequency spectrum is nonetheless of value. Through a comparison of models of equilibrium neutrality, nonequilibrium neutrality, equilibrium hitchhiking, nonequilibrium hitchhiking, and recurrent hitchhiking, we evaluate a linkage disequilibrium (LD) statistic (omega(max)) that appears to have power to identify regions recently shaped by positive selection. Most notably, for demographic parameters relevant to non-African populations of Drosophila melanogaster, we demonstrate that selected loci are distinguishable from neutral loci using this statistic.

    View details for DOI 10.1534/genetics.106.069450

  • A Markov chain Monte Carlo approach for joint inference of population structure and inbreeding rates from multilocus genotype data GENETICS Gao, H., Williamson, S., Bustamante, C. D. 2007; 176 (3): 1635-1651


    Nonrandom mating induces correlations in allelic states within and among loci that can be exploited to understand the genetic structure of natural populations (Wright 1965). For many species, it is of considerable interest to quantify the contribution of two forms of nonrandom mating to patterns of standing genetic variation: inbreeding (mating among relatives) and population substructure (limited dispersal of gametes). Here, we extend the popular Bayesian clustering approach STRUCTURE (Pritchard et al. 2000) for simultaneous inference of inbreeding or selfing rates and population-of-origin classification using multilocus genetic markers. This is accomplished by eliminating the assumption of Hardy-Weinberg equilibrium within clusters and, instead, calculating expected genotype frequencies on the basis of inbreeding or selfing rates. We demonstrate the need for such an extension by showing that selfing leads to spurious signals of population substructure using the standard STRUCTURE algorithm with a bias toward spurious signals of admixture. We gauge the performance of our method using extensive coalescent simulations and demonstrate that our approach can correct for this bias. We also apply our approach to understanding the population structure of the wild relative of domesticated rice, Oryza rufipogon, an important partially selfing grass species. Using a sample of n = 16 individuals sequenced at 111 random loci, we find strong evidence for existence of two subpopulations, which correlates well with geographic location of sampling, and estimate selfing rates for both groups that are consistent with estimates from experimental data (s approximately 0.48-0.70).

    View details for DOI 10.1534/genetics.107.072371

  • Localizing recent adaptive evolution in the human genome PLOS GENETICS Williamson, S. H., Hubisz, M. J., Clark, A. G., Payseur, B. A., Bustamante, C. D., Nielsen, R. 2007; 3 (6): 901-915


    Identifying genomic locations that have experienced selective sweeps is an important first step toward understanding the molecular basis of adaptive evolution. Using statistical methods that account for the confounding effects of population demography, recombination rate variation, and single-nucleotide polymorphism ascertainment, while also providing fine-scale estimates of the position of the selected site, we analyzed a genomic dataset of 1.2 million human single-nucleotide polymorphisms genotyped in African-American, European-American, and Chinese samples. We identify 101 regions of the human genome with very strong evidence (p < 10(-5)) of a recent selective sweep and where our estimate of the position of the selective sweep falls within 100 kb of a known gene. Within these regions, genes of biological interest include genes in pigmentation pathways, components of the dystrophin protein complex, clusters of olfactory receptors, genes involved in nervous system development and function, immune system genes, and heat shock genes. We also observe consistent evidence of selective sweeps in centromeric regions. In general, we find that recent adaptation is strikingly pervasive in the human genome, with as much as 10% of the genome affected by linkage to a selective sweep.

    View details for DOI 10.1371/journal.pgen.0030090

  • A mutation in the myostatin gene increases muscle mass and enhances racing performance in heterozygote dogs PLOS GENETICS Mosher, D. S., Quignon, P., Bustamante, C. D., Sutter, N. B., Mellersh, C. S., Parker, H. G., Ostrander, E. A. 2007; 3 (5): 779-786


    Double muscling is a trait previously described in several mammalian species including cattle and sheep and is caused by mutations in the myostatin (MSTN) gene (previously referred to as GDF8). Here we describe a new mutation in MSTN found in the whippet dog breed that results in a double-muscled phenotype known as the "bully" whippet. Individuals with this phenotype carry two copies of a two-base-pair deletion in the third exon of MSTN leading to a premature stop codon at amino acid 313. Individuals carrying only one copy of the mutation are, on average, more muscular than wild-type individuals (p = 7.43 x 10(-6); Kruskal-Wallis Test) and are significantly faster than individuals carrying the wild-type genotype in competitive racing events (Kendall's nonparametric measure, tau = 0.3619; p approximately 0.00028). These results highlight the utility of performance-enhancing polymorphisms, marking the first time a mutation in MSTN has been quantitatively linked to increased athletic performance.

    View details for DOI 10.1371/journal.pgen.0030079

  • Evolutionary and biomedical insights from the rhesus macaque genome SCIENCE Gibbs, R. A., Rogers, J., Katze, M. G., Bumgarner, R., Weinstock, G. M., Mardis, E. R., Remington, K. A., Strausberg, R. L., Venter, J. C., Wilson, R. K., Batzer, M. A., Bustamante, C. D., Eichler, E. E., Hahn, M. W., Hardison, R. C., Makova, K. D., Miller, W., Milosavljevic, A., Palermo, R. E., Siepel, A., Sikela, J. M., Attaway, T., Bell, S., Bernard, K. E., Buhay, C. J., Chandrabose, M. N., Dao, M., Davis, C., Delehaunty, K. D., Ding, Y., Dinh, H. H., Dugan-Rocha, S., Fulton, L. A., Gabisi, R. A., Garner, T. T., Godfrey, J., Hawes, A. C., Hernandez, J., Hines, S., Holder, M., Hume, J., Jhangiani, S. N., Joshi, V., Khan, Z. M., Kirkness, E. F., Cree, A., Fowler, R. G., Lee, S., Lewis, L. R., Li, Z., Liu, Y., Moore, S. M., Muzny, D., Nazareth, L. V., Ngo, D. N., Okwuonu, G. O., Pai, G., Parker, D., Paul, H. A., Pfannkoch, C., Pohl, C. S., Rogers, Y., Ruiz, S. J., Sabo, A., Santibanez, J., Schneider, B. W., Smith, S. M., Sodergren, E., Svatek, A. F., Utterback, T. R., Vattathil, S., Warren, W., White, C. S., Chinwalla, A. T., Feng, Y., Halpern, A. L., Hillier, L. W., Huang, X., Minx, P., Nelson, J. O., Pepin, K. H., Qin, X., Sutton, G. G., Venter, E., Walenz, B. P., Wallis, J. W., Worley, K. C., Yang, S., Jones, S. M., Marra, M. A., Rocchi, M., Schein, J. E., Baertsch, R., Clarke, L., Csuros, M., Glasscock, J., Harris, R. A., Haviak, P., Jackson, A. R., Jiang, H., Liu, Y., Messina, D. N., Shen, Y., Song, H. X., Wylie, T., Zhang, L., Birney, E., Han, K., Konkel, M. K., Lee, J., Smit, A. F., Ullmer, B., Wang, H., Xing, J., Burhans, R., Cheng, Z., Karro, J. E., Ma, J., Raney, B., She, X., Cox, M. J., Demuth, J. P., Dumas, L. J., Han, S., Hopkins, J., Karimpour-Fard, A., Kim, Y. H., Pollack, J. R., Vinar, T., Addo-Quaye, C., Degenhardt, J., Denby, A., Hubisz, M. J., Indap, A., Kosiol, C., Lahn, B. T., Lawson, H. A., Marklein, A., Nielsen, R., Vallender, E. J., Clark, A. G., Ferguson, B., Hernandez, R. D., Hirani, K., Kehrer-Sawatzki, H., Kolb, J., Patil, S., Pu, L., Ren, Y., Smith, D. G., Wheeler, D. A., Schenck, I., Ball, E. V., Chen, R., Cooper, D. N., Giardine, B., Hsu, F., Kent, W. J., Lesk, A., Nelson, D. L., O'Brien, W. E., Prufer, K., Stenson, P. D., Wallace, J. C., Ke, H., Liu, X., Wang, P., Xiang, A. P., Yang, F., Barber, G. P., Haussler, D., Karolchik, D., Kern, A. D., Kuhn, R. M., Smith, K. E., Zwieg, A. S. 2007; 316 (5822): 222-234


    The rhesus macaque (Macaca mulatta) is an abundant primate species that diverged from the ancestors of Homo sapiens about 25 million years ago. Because they are genetically and physiologically similar to humans, rhesus monkeys are the most widely used nonhuman primate in basic and applied biomedical research. We determined the genome sequence of an Indian-origin Macaca mulatta female and compared the data with chimpanzees and humans to reveal the structure of ancestral primate genomes and to identify evidence for positive selection and lineage-specific expansions and contractions of gene families. A comparison of sequences from individual animals was used to investigate their underlying genetic diversity. The complete description of the macaque genome blueprint enhances the utility of this animal model for biomedical research and improves our understanding of the basic biology of the species.

    View details for DOI 10.1126/science.1139247

  • Demographic histories and patterns of linkage disequilibrium in Chinese and Indian rhesus macaques SCIENCE Hernandez, R. D., Hubisz, M. J., Wheeler, D. A., Smith, D. G., Ferguson, B., Rogers, J., Nazareth, L., Indap, A., Bourquin, T., McPherson, J., Muzny, D., Gibbs, R., Nielsen, R., Bustamante, C. D. 2007; 316 (5822): 240-243


    To understand the demographic history of rhesus macaques (Macaca mulatta) and document the extent of linkage disequilibrium (LD) in the genome, we partially resequenced five Encyclopedia of DNA Elements regions in 9 Chinese and 38 captive-born Indian rhesus macaques. Population genetic analyses of the 1467 single-nucleotide polymorphisms discovered suggest that the two populations separated about 162,000 years ago, with the Chinese population tripling in size since then and the Indian population eventually shrinking by a factor of four. Using coalescent simulations, we confirmed that these inferred demographic events explain a much faster decay of LD in Chinese (r(2) approximately 0.15 at 10 kilobases) versus Indian (r(2) approximately 0.52 at 10 kilobases) macaque populations.

    View details for DOI 10.1126/science.1140462

  • A single IGF1 allele is a major determinant of small size in dogs SCIENCE Sutter, N. B., Bustamante, C. D., Chase, K., Gray, M. M., Zhao, K., Zhu, L., Padhukasahasram, B., Karlins, E., Davis, S., Jones, P. G., Quignon, P., Johnson, G. S., Parker, H. G., Fretwell, N., Mosher, D. S., Lawler, D. F., Satyaraj, E., Nordborg, M., Lark, K. G., Wayne, R. K., Ostrander, E. A. 2007; 316 (5821): 112-115


    The domestic dog exhibits greater diversity in body size than any other terrestrial vertebrate. We used a strategy that exploits the breed structure of dogs to investigate the genetic basis of size. First, through a genome-wide scan, we identified a major quantitative trait locus (QTL) on chromosome 15 influencing size variation within a single breed. Second, we examined genetic variation in the 15-megabase interval surrounding the QTL in small and giant breeds and found marked evidence for a selective sweep spanning a single gene (IGF1), encoding insulin-like growth factor 1. A single IGF1 single-nucleotide polymorphism haplotype is common to all small breeds and nearly absent from giant breeds, suggesting that the same causal sequence variant is a major contributor to body size in all small dogs.

    View details for DOI 10.1126/science.1137045

  • Human Genome Variation 2006: emerging views on structural variation and large-scale SNP analysis. Nature genetics Abecasis, G., Tam, P. K., Bustamante, C. D., Ostrander, E. A., Scherer, S. W., Chanock, S. J., Kwok, P., Brookes, A. J. 2007; 39 (2): 153-155


    The eighth annual Human Genome Variation Meeting was held in September 2006 in the Hong Kong Special Administrative Region, China. The meeting highlighted recent advances in characterization of genetic variation, including genome-wide association studies and structural variation.

  • Selective sweep mapping of genes with large phenotypic effects GENOME RESEARCH Pollinger, J. P., Bustamante, C. D., Fledel-Alon, A., Schmutz, S., Gray, M. M., Wayne, R. K. 2005; 15 (12): 1809-1819


    Many domestic dog breeds have originated through fixation of discrete mutations by intense artificial selection. As a result of this process, markers in the proximity of genes influencing breed-defining traits will have reduced variation (a selective sweep) and will show divergence in allele frequency. Consequently, low-resolution genomic scans can potentially be used to identify regions containing genes that have a major influence on breed-defining traits. We model the process of breed formation and show that the probability of two or three adjacent marker loci showing a spurious signal of selection within at least one breed (i.e., Type I error or false-positive rate) is low if highly variable and moderately spaced markers are utilized. We also use simulations with selection to demonstrate that even a moderately spaced set of highly polymorphic markers (e.g., one every 0.8 cM) has high power to detect regions targeted by strong artificial selection in dogs. Further, we show that a gene responsible for black coat color in the Large Munsterlander has a 40-Mb region surrounding the gene that is very low in heterozygosity for microsatellite markers. Similarly, we survey 302 microsatellite markers in the Dachshund and find three linked monomorphic microsatellite markers all within a 10-Mb region on chromosome 3. This region contains the FGFR3 gene, which is responsible for achondroplasia in humans, but not in dogs. Consequently, our results suggest that the causative mutation is a gene or regulatory region closely linked to FGFR3.

    View details for DOI 10.1101/gr.4374505

  • Ascertainment bias in studies of human genome-wide polymorphism GENOME RESEARCH Clark, A. G., Hubisz, M. J., Bustamante, C. D., Williamson, S. H., Nielsen, R. 2005; 15 (11): 1496-1502


    Large-scale SNP genotyping studies rely on an initial assessment of nucleotide variation to identify sites in the DNA sequence that harbor variation among individuals. This "SNP discovery" sample may be quite variable in size and composition, and it has been well established that properties of the SNPs that are found are influenced by the discovery sampling effort. The International HapMap project relied on nearly any piece of information available to identify SNPs-including BAC end sequences, shotgun reads, and differences between public and private sequences-and even made use of chimpanzee data to confirm human sequence differences. In addition, the ascertainment criteria shifted from using only SNPs that had been validated in population samples, to double-hit SNPs, to finally accepting SNPs that were singletons in small discovery samples. In contrast, Perlegen's primary discovery was a resequencing-by-hybridization effort using the 24 people of diverse origin in the Polymorphism Discovery Resource. Here we take these two data sets and contrast two basic summary statistics, heterozygosity and F(ST), as well as the site frequency spectra, for 500-kb windows spanning the genome. The magnitude of disparity between these samples in these measures of variability indicates that population genetic analysis on the raw genotype data is ill advised. Given the knowledge of the discovery samples, we perform an ascertainment correction and show how the post-correction data are more consistent across these studies. However, discrepancies persist, suggesting that the heterogeneity in the SNP discovery process of the HapMap project resulted in a data set resistant to complete ascertainment correction. Ascertainment bias will likely erode the power of tests of association between SNPs and complex disorders, but the effect will likely be small, and perhaps more importantly, it is unlikely that the bias will introduce false-positive inferences.

    View details for DOI 10.1101/gr.4107905

  • Genomic scans for selective sweeps using SNP data GENOME RESEARCH Nielsen, R., Williamson, S., Kim, Y., Hubisz, M. J., Clark, A. G., Bustamante, C. 2005; 15 (11): 1566-1575


    Detecting selective sweeps from genomic SNP data is complicated by the intricate ascertainment schemes used to discover SNPs, and by the confounding influence of the underlying complex demographics and varying mutation and recombination rates. Current methods for detecting selective sweeps have little or no robustness to the demographic assumptions and varying recombination rates, and provide no method for correcting for ascertainment biases. Here, we present several new tests aimed at detecting selective sweeps from genomic SNP data. Using extensive simulations, we show that a new parametric test, based on composite likelihood, has a high power to detect selective sweeps and is surprisingly robust to assumptions regarding recombination rates and demography (i.e., has low Type I error). Our new test also provides estimates of the location of the selective sweep(s) and the magnitude of the selection coefficient. To illustrate the method, we apply our approach to data from the Seattle SNP project and to Chromosome 2 data from the HapMap project. In Chromosome 2, the most extreme signal is found in the lactase gene, which previously has been shown to be undergoing positive selection. Evidence for selective sweeps is also found in many other regions, including genes known to be associated with disease risk such as DPP10 and COL4A3.

    View details for DOI 10.1101/gr.4252305

  • Natural selection on protein-coding genes in the human genome NATURE Bustamante, C. D., Fledel-Alon, A., Williamson, S., Nielsen, R., Hubisz, M. T., Glanowski, S., Tanenbaum, D. M., White, T. J., Sninsky, J. J., Hernandez, R. D., Civello, D., Adams, M. D., Cargill, M., Clark, A. G. 2005; 437 (7062): 1153-1157


    Comparisons of DNA polymorphism within species to divergence between species enables the discovery of molecular adaptation in evolutionarily constrained genes as well as the differentiation of weak from strong purifying selection. The extent to which weak negative and positive darwinian selection have driven the molecular evolution of different species varies greatly, with some species, such as Drosophila melanogaster, showing strong evidence of pervasive positive selection, and others, such as the selfing weed Arabidopsis thaliana, showing an excess of deleterious variation within local populations. Here we contrast patterns of coding sequence polymorphism identified by direct sequencing of 39 humans for over 11,000 genes to divergence between humans and chimpanzees, and find strong evidence that natural selection has shaped the recent molecular evolution of our species. Our analysis discovered 304 (9.0%) out of 3,377 potentially informative loci showing evidence of rapid amino acid evolution. Furthermore, 813 (13.5%) out of 6,033 potentially informative loci show a paucity of amino acid differences between humans and chimpanzees, indicating weak negative selection and/or balancing selection operating on mutations at these loci. We find that the distribution of negatively and positively selected genes varies greatly among biological processes and molecular functions, and that some classes, such as transcription factors, show an excess of rapidly evolving genes, whereas others, such as cytoskeletal proteins, show an excess of genes with extensive amino acid polymorphism within humans and yet little amino acid divergence between humans and chimpanzees.

    View details for DOI 10.1038/nature04240

  • Distinguishing between selective sweeps and demography using DNA polymorphism data GENETICS Jensen, J. D., Kim, Y., DuMont, V. B., Aquadro, C. F., Bustamante, C. D. 2005; 170 (3): 1401-1410


    In 2002 Kim and Stephan proposed a promising composite-likelihood method for localizing and estimating the fitness advantage of a recently fixed beneficial mutation. Here, we demonstrate that their composite-likelihood-ratio (CLR) test comparing selective and neutral hypotheses is not robust to undetected population structure or a recent bottleneck, with some parameter combinations resulting in a false positive rate of nearly 90%. We also propose a goodness-of-fit test for discriminating rejections due to directional selection (true positive) from those due to population and demographic forces (false positives) and demonstrate that the new method has high sensitivity to differentiate the two classes of rejections.

    View details for DOI 10.1534/genetics.104.038224

  • A composite-likelihood approach for detecting directional selection from DNA sequence data GENETICS Zhu, L., Bustamante, C. D. 2005; 170 (3): 1411-1421


    We present a novel composite-likelihood-ratio test (CLRT) for detecting genes and genomic regions that are subject to recurrent natural selection (either positive or negative). The method uses the likelihood functions of Hartl et al. (1994) for inference in a Wright-Fisher genic selection model and corrects for nonindependence among sites by application of coalescent simulations with recombination. Here, we (1) characterize the distribution of the CLRT statistic (Lambda) as a function of the population recombination rate (R=4Ner); (2) explore the effects of bias in estimation of R on the size (type I error) of the CLRT; (3) explore the robustness of the model to population growth, bottlenecks, and migration; (4) explore the power of the CLRT under varying levels of mutation, selection, and recombination; (5) explore the discriminatory power of the test in distinguishing negative selection from population growth; and (6) evaluate the performance of maximum composite-likelihood estimation (MCLE) of the selection coefficient. We find that the test has excellent power to detect weak negative selection and moderate power to detect positive selection. Moreover, the test is quite robust to bias in the estimate of local recombination rate, but not to certain demographic scenarios such as population growth or a recent bottleneck. Last, we demonstrate that the MCLE of the selection parameter has little bias for weak negative selection and has downward bias for positively selected mutations.

    View details for DOI 10.1534/genetics.104.035097

  • Detecting coevolving amino acid sites using Bayesian mutational mapping 13th International Conference on Intelligent Systems for Molecular Biology Dimmic, M. W., Hubisz, M. J., Bustamante, C. D., Nielsen, R. OXFORD UNIV PRESS. 2005: I126–I135


    The evolution of protein sequences is constrained by complex interactions between amino acid residues. Because harmful substitutions may be compensated for by other substitutions at neighboring sites, residues can coevolve. We describe a Bayesian phylogenetic approach to the detection of coevolving residues in protein families. This method, Bayesian mutational mapping (BMM), assigns mutations to the branches of the evolutionary tree stochastically, and then test statistics are calculated to determine whether a coevolutionary signal exists in the mapping. Posterior predictive P-values provide an estimate of significance, and specificity is maintained by integrating over uncertainty in the estimation of the tree topology, branch lengths and substitution rates. A coevolutionary Markov model for codon substitution is also described, and this model is used as the basis of several test statistics.Results on simulated coevolutionary data indicate that the BMM method can successfully detect nearly all coevolving sites when the model has been correctly specified, and that non-parametric statistics such as mutual information are generally less powerful than parametric statistics. On a dataset of eukaryotic proteins from the phosphoglycerate kinase (PGK) family, interdomain site contacts yield a significantly greater coevolutionary signal than interdomain non-contacts, an indication that the method provides information about interacting sites. Failure to account for the heterogeneity in rates across sites in PGK resulted in a less discriminating test, yielding a marked increase in the number of reported positives at both contact and non-contact sites.

    View details for DOI 10.1093/bioinformatics/bti1032

  • A scan for positively selected genes in the genomes of humans and chimpanzees PLOS BIOLOGY Nielsen, R., Bustamante, C., Clark, A. G., Glanowski, S., Sackton, T. B., Hubisz, M. J., Fledel-Alon, A., Tanenbaum, D. M., Civello, D., White, T. J., Sninsky, J. J., Adams, M. D., Cargill, M. 2005; 3 (6): 976-985


    Since the divergence of humans and chimpanzees about 5 million years ago, these species have undergone a remarkable evolution with drastic divergence in anatomy and cognitive abilities. At the molecular level, despite the small overall magnitude of DNA sequence divergence, we might expect such evolutionary changes to leave a noticeable signature throughout the genome. We here compare 13,731 annotated genes from humans to their chimpanzee orthologs to identify genes that show evidence of positive selection. Many of the genes that present a signature of positive selection tend to be involved in sensory perception or immune defenses. However, the group of genes that show the strongest evidence for positive selection also includes a surprising number of genes involved in tumor suppression and apoptosis, and of genes involved in spermatogenesis. We hypothesize that positive selection in some of these genes may be driven by genomic conflict due to apoptosis during spermatogenesis. Genes with maximal expression in the brain show little or no evidence for positive selection, while genes with maximal expression in the testis tend to be enriched with positively selected genes. Genes on the X chromosome also tend to show an elevated tendency for positive selection. We also present polymorphism data from 20 Caucasian Americans and 19 African Americans for the 50 annotated genes showing the strongest evidence for positive selection. The polymorphism analysis further supports the presence of positive selection in these genes by showing an excess of high-frequency derived nonsynonymous mutations.

    View details for DOI 10.1371/journal.pbio.0030170

  • Simultaneous inference of selection and population growth from patterns of variation in the human genome PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Williamson, S. H., Hernandez, R., Fledel-Alon, A., Zhu, L., Nielsen, R., Bustamante, C. D. 2005; 102 (22): 7882-7887


    Natural selection and demographic forces can have similar effects on patterns of DNA polymorphism. Therefore, to infer selection from samples of DNA sequences, one must simultaneously account for demographic effects. Here we take a model-based approach to this problem by developing predictions for patterns of polymorphism in the presence of both population size change and natural selection. If data are available from different functional classes of variation, and a priori information suggests that mutations in one of those classes are selectively neutral, then the putatively neutral class can be used to infer demographic parameters, and inferences regarding selection on other classes can be performed given demographic parameter estimates. This procedure is more robust to assumptions regarding the true underlying demography than previous approaches to detecting and analyzing selection. We apply this method to a large polymorphism data set from 301 human genes and find (i) widespread negative selection acting on standing nonsynonymous variation, (ii) that the fitness effects of nonsynonymous mutations are well predicted by several measures of amino acid exchangeability, especially site-specific methods, and (iii) strong evidence for very recent population growth.

    View details for DOI 10.1073/pnas.0502300102

  • A statistical characterization of consistent patterns of human immunodeficiency virus evolution within infected patients MOLECULAR BIOLOGY AND EVOLUTION Williamson, S., Perry, S. M., Bustamante, C. D., Orive, M. E., Stearns, M. N., Kelly, J. K. 2005; 22 (3): 456-468


    Within-patient HIV populations evolve rapidly because of a high mutation rate, short generation time, and strong positive selection pressures. Previous studies have identified "consistent patterns" of viral sequence evolution. Just before HIV infection progresses to AIDS, evolution seems to slow markedly, and the genetic diversity of the viral population drops. This evolutionary slowdown could be caused either by a reduction in the average viral replication rate or because selection pressures weaken with the collapse of the immune system. The former hypothesis (which we denote "cellular exhaustion") predicts a simultaneous reduction in both synonymous and nonsynonymous evolution, whereas the latter hypothesis (denoted "immune relaxation") predicts that only nonsynonymous evolution will slow. In this paper, we present a set of statistical procedures for distinguishing between these alternative hypotheses using DNA sequences sampled over the course of infection. The first component is a new method for estimating evolutionary rates that takes advantage of the temporal information in longitudinal DNA sequence samples. Second, we develop a set of probability models for the analysis of evolutionary rates in HIV populations in vivo. Application of these models to both synonymous and nonsynonymous evolution affords a comparison of the cellular-exhaustion and immune-relaxation hypotheses. We apply the procedures to longitudinal data sets in which sequences of the env gene were sampled over the entire course of infection. Our analyses (1) statistically confirm that an evolutionary slowdown occurs late in infection, (2) strongly support the immune-relaxation hypothesis, and (3) indicate that the cessation of nonsynonymous evolution is associated with disease progression.

    View details for Web of Science ID 000227163100012

    View details for PubMedID 15509726

  • Inferring SNP function using evolutionary, structural, and computational methods. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Dimmic, M. W., Sunyaev, S., Bustamante, C. D. 2005: 382-384

  • Population genetics of polymorphism and divergence for diploid selection models with arbitrary dominance GENETICS Williamson, S., Fledel-Alon, A., Bustamante, C. D. 2004; 168 (1): 463-475


    We develop a Poisson random-field model of polymorphism and divergence that allows arbitrary dominance relations in a diploid context. This model provides a maximum-likelihood framework for estimating both selection and dominance parameters of new mutations using information on the frequency spectrum of sequence polymorphisms. This is the first DNA sequence-based estimator of the dominance parameter. Our model also leads to a likelihood-ratio test for distinguishing nongenic from genic selection; simulations indicate that this test is quite powerful when a large number of segregating sites are available. We also use simulations to explore the bias in selection parameter estimates caused by unacknowledged dominance relations. When inference is based on the frequency spectrum of polymorphisms, genic selection estimates of the selection parameter can be very strongly biased even for minor deviations from the genic selection model. Surprisingly, however, when inference is based on polymorphism and divergence (McDonald-Kreitman) data, genic selection estimates of the selection parameter are nearly unbiased, even for completely dominant or recessive mutations. Further, we find that weak overdominant selection can increase, rather than decrease, the substitution rate relative to levels of polymorphism. This nonintuitive result has major implications for the interpretation of several popular tests of neutrality.

    View details for DOI 10.1534/genetics.103.024745

  • Natural selection on the olfactory receptor gene family in humans and chimpanzees AMERICAN JOURNAL OF HUMAN GENETICS Gilad, Y., Bustamante, C. D., Lancet, D., Paabo, S. 2003; 73 (3): 489-501


    The olfactory receptor (OR) genes constitute the largest gene family in mammalian genomes. Humans have >1,000 OR genes, of which only approximately 40% have an intact coding region and are therefore putatively functional. In contrast, the fraction of intact OR genes in the genomes of the great apes is significantly greater (68%-72%), suggesting that selective pressures on the OR repertoire vary among these species. We have examined the evolutionary forces that shaped the OR gene family in humans and chimpanzees by resequencing 20 OR genes in 16 humans, 16 chimpanzees, and one orangutan. We compared the variation at the OR genes with that at intergenic regions. In both humans and chimpanzees, OR pseudogenes seem to evolve neutrally. In chimpanzees, patterns of variability are consistent with purifying selection acting on intact OR genes, whereas, in humans, there is suggestive evidence for positive selection acting on intact OR genes. These observations are likely due to differences in lifestyle, between humans and great apes, that have led to distinct sensory needs.

  • Maximum likelihood and Bayesian methods for estimating the distribution of selective effects among classes of mutations using DNA polymorphism data THEORETICAL POPULATION BIOLOGY Bustamante, C. D., Nielsen, R., Hartl, D. L. 2003; 63 (2): 91-103


    Maximum likelihood and Bayesian approaches are presented for analyzing hierarchical statistical models of natural selection operating on DNA polymorphism within a panmictic population. For analyzing Bayesian models, we present Markov chain Monte-Carlo (MCMC) methods for sampling from the joint posterior distribution of parameters. For frequentist analysis, an Expectation-Maximization (EM) algorithm is presented for finding the maximum likelihood estimate of the genome wide mean and variance in selection intensity among classes of mutations. The framework presented here provides an ideal setting for modeling mutations dispersed through the genome and, in particular, for the analysis of how natural selection operates on different classes of single nucleotide polymorphisms (SNPs).

    View details for DOI 10.1016/S0040-5809(02)00050-3

  • Selection on rapidly evolving proteins in the Arabidopsis genome GENETICS Barrier, M., Bustamante, C. D., Yu, J. Y., Purugganan, M. D. 2003; 163 (2): 723-733


    Genes that have undergone positive or diversifying selection are likely to be associated with adaptive divergence between species. One indicator of adaptive selection at the molecular level is an excess of amino acid replacement fixed differences per replacement site relative to the number of synonymous fixed differences per synonymous site (omega = K(a)/K(s)). We used an evolutionary expressed sequence tag (EST) approach to estimate the distribution of omega among 304 orthologous loci between Arabidopsis thaliana and A. lyrata to identify genes potentially involved in the adaptive divergence between these two Brassicaceae species. We find that 14 of 304 genes (approximately 5%) have an estimated omega > 1 and are candidates for genes with increased selection intensities. Molecular population genetic analyses of 6 of these rapidly evolving protein loci indicate that, despite their high levels of between-species nonsynonymous divergence, these genes do not have elevated levels of intraspecific replacement polymorphisms compared to previously studied genes. A hierarchical Bayesian analysis of protein-coding region evolution within and between species also indicates that the selection intensities of these genes are elevated compared to previously studied A. thaliana nuclear loci.

  • Bayesian analysis suggests that most amino acid replacements in Drosophila are driven by positive selection Meeting on Evolution, Genomics and Bioinformatics Sawyer, S. A., Kulathinal, R. J., Bustamante, C. D., Hartl, D. L. SPRINGER. 2003: S154–S164


    One of the principal goals of population genetics is to understand the processes by which genetic variation within species (polymorphism) becomes converted into genetic differences between species (divergence). In this transformation, selective neutrality, near neutrality, and positive selection may each play a role, differing from one gene to the next. Synonymous nucleotide sites are often used as a uniform standard of comparison across genes on the grounds that synonymous sites are subject to relatively weak selective constraints and so may, to a first approximation, be regarded as neutral. Synonymous sites are also interdigitated with nonsynonymous sites and so are affected equally by genomic context and demographic factors. Hence a comparison of levels of polymorphism and divergence between synonymous sites and amino acid replacement sites in a gene is potentially informative about the magnitude of selective forces associated with amino acid replacements. We have analyzed 56 genes in which polymorphism data from D. simulans are compared with divergence from a reference strain of D. melanogaster. The framework of the analysis is Bayesian and assumes that the distribution of selective effects (Malthusian fitnesses) is Gaussian with a mean that differs for each gene. In such a model, the average scaled selection intensity (gamma = N(e)s) of amino acid replacements eligible to become polymorphic or fixed is -7.31, and the standard deviation of selective effects within each locus is 6.79 (assuming homoscedasticity across loci). For newly arising mutations of this type that occur in autosomal or X-linked genes, the average proportion of beneficial mutations is 19.7%. Among the amino acid polymorphisms in the sample, the expected average proportion of beneficial mutations is 47.7%, and among amino acid replacements that become fixed the average proportion of beneficial mutations is 94.3%. The average scaled selection intensity of fixed mutations is +5.1. The presence of positive selection is pervasive with the single exception of kl-5, a Y-linked fertility gene. We find no evidence that a significant fraction of fixed amino acid replacements is neutral or nearly neutral or that positive selection drives amino acid replacements at only a subset of the loci. These results are model dependent and we discuss possible modifications of the model that might allow more neutral and nearly neutral amino acid replacements to be fixed.

    View details for DOI 10.1007/s00239-003-0022-3

  • The cost of inbreeding in Arabidopsis NATURE Bustamante, C. D., Nielsen, R., Sawyer, S. A., Olsen, K. M., Purugganan, M. D., Hartl, D. L. 2002; 416 (6880): 531-534


    Population geneticists have long sought to estimate the distribution of selection intensities among genes of diverse function across the genome. Only recently have DNA sequencing and analytical techniques converged to make this possible. Important advances have come from comparing genetic variation within species (polymorphism) with fixed differences between species (divergence). These approaches have been used to examine individual genes for evidence of selection. Here we use the fact that the time since species divergence allows combination of data across genes. In a comparison of amino-acid replacements among species of the mustard weed Arabidopsis with those among species of the fruitfly Drosophila, we find evidence for predominantly beneficial gene substitutions in Drosophila but predominantly detrimental substitutions in Arabidopsis. We attribute this difference to the Arabidopsis mating system of partial self-fertilization, which corroborates a prediction of population genetics theory that species with a high frequency of inbreeding are less efficient in eliminating deleterious mutations owing to their reduced effective population size.

  • A maximum likelihood method for analyzing pseudogene evolution: Implications for silent site evolution in humans and rodents MOLECULAR BIOLOGY AND EVOLUTION Bustamante, C. D., Nielsen, R., Hartl, D. L. 2002; 19 (1): 110-117


    We present a new likelihood method for detecting constrained evolution at synonymous sites and other forms of nonneutral evolution in putative pseudogenes. The model is applicable whenever the DNA sequence is available from a protein-coding functional gene, a pseudogene derived from the protein-coding gene, and an orthologous functional copy of the gene. Two nested likelihood ratio tests are developed to test the hypotheses that (1) the putative pseudogene has equal rates of silent and replacement substitutions; and (2) the rate of synonymous substitution in the functional gene equals the rate of substitution in the pseudogene. The method is applied to a data set containing 74 human processed-pseudogene loci, 25 mouse processed-pseudogene loci, and 22 rat processed-pseudogene loci. Using the informatics resources of the Human Genome Project, we localized 67 of the human-pseudogene pairs in the genome and estimated the GC content of a large surrounding genomic region for each. We find that, for pseudogenes deposited in GC regions similar to those of their paralogs, the assumption of equal rates of silent and replacement site evolution in the pseudogene is upheld; in these cases, the rate of silent site evolution in the functional genes is approximately 70% the rate of evolution in the pseudogene. On the other hand, for pseudogenes located in genomic regions of much lower GC than their functional gene, we see a sharp increase in the rate of silent site substitutions, leading to a large rate of rejection for the pseudogene equality likelihood ratio test.

  • Directional selection and the site-frequency spectrum GENETICS Bustamante, C. D., Wakeley, J., Sawyer, S., Hartl, D. L. 2001; 159 (4): 1779-1788


    In this article we explore statistical properties of the maximum-likelihood estimates (MLEs) of the selection and mutation parameters in a Poisson random field population genetics model of directional selection at DNA sites. We derive the asymptotic variances and covariance of the MLEs and explore the power of the likelihood ratio tests (LRT) of neutrality for varying levels of mutation and selection as well as the robustness of the LRT to deviations from the assumption of free recombination among sites. We also discuss the coverage of confidence intervals on the basis of two standard-likelihood methods. We find that the LRT has high power to detect deviations from neutrality and that the maximum-likelihood estimation performs very well when the ancestral states of all mutations in the sample are known. When the ancestral states are not known, the test has high power to detect deviations from neutrality for negative selection but not for positive selection. We also find that the LRT is not robust to deviations from the assumption of independence among sites.

  • Chromosomal effects of rapid gene evolution in Drosophila melanogaster SCIENCE Nurminsky, D., De Aguiar, D., Bustamante, C. D., Hartl, D. L. 2001; 291 (5501): 128-130


    Rapid adaptive fixation of a new favorable mutation is expected to affect neighboring genes along the chromosome. Evolutionary theory predicts that the chromosomal region would show a reduced level of genetic variation and an excess of rare alleles. We have confirmed these predictions in a region of the X chromosome of Drosophila melanogaster that contains a newly evolved gene for a component of the sperm axoneme. In D. simulans, where the novel gene does not exist, the pattern of genetic variation is consistent with selection against recurrent deleterious mutations. These findings imply that the pattern of genetic variation along a chromosome may be useful for inferring its evolutionary history and for revealing regions in which recent adaptive fixations have taken place.

  • Solvent accessibility and purifying selection within proteins of Escherichia coli and Salmonella enterica MOLECULAR BIOLOGY AND EVOLUTION Bustamante, C. D., Townsend, J. P., Hartl, D. L. 2000; 17 (2): 301-308


    The neutral theory of molecular evolution predicts that variation within species is inversely related to the strength of purifying selection, but the strength of purifying selection itself must be related to physical constraints imposed by protein folding and function. In this paper, we analyzed five enzymes for which polymorphic sequence variation within Escherichia coli and/or Salmonella enterica was available, along with a protein structure. Single and multivariate logistic regression models are presented that evaluate amino acid size, physicochemical properties, solvent accessibility, and secondary structure as predictors of polymorphism. A model that contains a positive coefficient of association between polymorphism and solvent accessibility and separate intercepts for each secondary-structure element is sufficient to explain the observed variation in polymorphism between sites. The model predicts an increase in the probability of amino acid polymorphism with increasing solvent accessibility for each protein regardless of physicochemical properties, secondary-structure element, or size of the amino acid. This result, when compared with the distribution of synonymous polymorphism, which shows no association with solvent accessibility, suggests a strong decrease in purifying selection with increasing solvent accessibility.

