Trained in the French school of Data Analysis in Montpellier, Susan Holmes has been working in non parametric multivariate statistics applied to Biology since 1985. She has taught at MIT, Harvard and was an Associate Professor of Biometry at Cornell before moving to Stanford in 1998. She teaches the Thinking Matters class: Breaking Codes and Finding patterns and likes working on big messy data sets, mostly from the areas of Immunology, Cancer Biology and Microbial Ecology. Her theoretical interests include applied probability, MCMC (Monte Carlo Markov chains), Graph Limit Theory, Differential Geometry and the topology of the space of Phylogenetic Trees.

Administrative Appointments

  • CoDirector, Mathematical and Computational Sciences IDP (2002 - 2017)
  • Director, VIGRE program in Statistics (2005 - 2014)
  • Professor, Member, BioX (2003 - Present)

Honors & Awards

  • CASBS Fellow, Center for the Advanced study of the Behavioral Sciences (2017-2018)
  • Breiman Lecturer, NIPS (December, 2016)
  • Fellow, Fields Institute in Mathematical Sciences, Toronto, Canada (2015)
  • Director's Transformative Research Award, NIH (2013)
  • John Henry Samter University Fellow in Undergraduate Education, Stanford (2012)
  • Fellow of the Institute of Mathematical Statistics, IMS (2005)

Boards, Advisory Committees, Professional Organizations

  • Member and Chair (2010-2012), Science Board, NimBioS (2007 - 2012)
  • Science Boards, Fields Research Institute in the Mathematical Sciences (2012 - 2015)
  • Council Member, Institute of Mathematical Statistics (2018 - Present)

Research & Scholarship

Current Research and Scholarly Interests

Our work focuses on large heterogeneous multi-layer data analyses. Whether using image analysis and segmentation for the study of cancer and immune cell interactions, or brain imaging and DNA sequence analyses for the study of dependencies between genetic and neurological dynamics, all these statistical studies have involved large complex datasets of different types where dynamics of interactions between different components of a system are the key to understanding the underlying biology.

We have generalized methods such as Principal Components Analysis (PCA) to more diverse data incorporating spatial information as well as tree dependency structures. This has proved useful in the study of drug resistant mutations in HIV and in the study of the dynamics of bacterial communities in the Human Microbiome.

The statistical bases for these nonparametric methods are computer intensive methods using optimization and Kernels and we often find useful embeddings of high dimensional data in low dimensional structures, the extreme case being finding a natural ordering in high dimensional data. More general manifolds have also proved useful in one of our current projects, joint with Xavier Pennec of INRIA-SophiaAntiolis which focuses on the uses of differential geometry in computational anatomy and image processing.

In a long term collaboration with Professor David Relman (Stanford Medical School) we are developing a multi-table toolbox of non parametric methods that enable users to normalize and visualize the multiple facets of the microbiota in the human body under different classes of perturbations. The tools developed in this project are all open source packages developed in R and provide an example of reproducible research in action.


  • Study of the dynamics of the human microbiome., Stanford University (March 1, 2013 - 2018)

    We are using statistical multivariate methods we are devloping new methods for modeling and visualizing the dynamics of bacterial community networks in the human microbiome.


    United States


  • Hierarchical Testing, Stanford University (8/1/2012)

    Develop tools for testing evokutionary signals in bacterial data.




    • Alfred Spormann, Member, Bio-X, Independent Labs, Institutes, and Centers (Dean of Research)
  • Multivariate Data Analysis of Drug Resistance in HIV, Stanford University (9/1/2003)

    We use phylogenetic trees, networks and multivariate analyses to deompose the complexities of drug resistance in HIV.




  • Multiway Data Analysis for the Human Microbiome, Stanford University (11/1/2013 - 10/31/2018)

    Combining 16S sRNA data, metabolic and transcriptomic data to predict resilience in the human microbiome after perturbations.




    • David Relman, Professor, Stanford


2018-19 Courses

Stanford Advisees

Graduate and Fellowship Programs


All Publications

  • Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome Davis, N. M., Proctor, D. M., Holmes, S. P., Relman, D. A., Callahan, B. J. 2018; 6 (1): 226


    BACKGROUND: The accuracy of microbial community surveys based on marker-gene and metagenomic sequencing (MGS) suffers from the presence of contaminants-DNA sequences not truly present in the sample. Contaminants come from various sources, including reagents. Appropriate laboratory practices can reduce contamination, but do not eliminate it. Here we introduce decontam ( ), an open-source R package that implements a statistical classification procedure that identifies contaminants in MGS data based on two widely reproduced patterns: contaminants appear at higher frequencies in low-concentration samples and are often found in negative controls.RESULTS: Decontam classified amplicon sequence variants (ASVs) in a human oral dataset consistently with prior microscopic observations of the microbial taxa inhabiting that environment and previous reports of contaminant taxa. In metagenomics and marker-gene measurements of a dilution series, decontam substantially reduced technical variation arising from different sequencing protocols. The application of decontam to two recently published datasets corroborated and extended their conclusions that little evidence existed for an indigenous placenta microbiome and that some low-frequency taxa seemingly associated with preterm birth were contaminants.CONCLUSIONS: Decontam improves the quality of metagenomic and marker-gene sequencing by identifying and removing contaminant DNA sequences. Decontam integrates easily with existing MGS workflows and allows researchers to generate more accurate profiles of microbial communities at little to no additional cost.

    View details for PubMedID 30558668

  • Latent variable modeling for the microbiome. Biostatistics (Oxford, England) Sankaran, K., Holmes, S. P. 2018


    The human microbiome is a complex ecological system, and describing its structure and function under different environmental conditions is important from both basic scientific and medical perspectives. Viewed through a biostatistical lens, many microbiome analysis goals can be formulated as latent variable modeling problems. However, although probabilistic latent variable models are a cornerstone of modern unsupervised learning, they are rarely applied in the context of microbiome data analysis, in spite of the evolutionary, temporal, and count structure that could be directly incorporated through such models. We explore the application of probabilistic latent variable models to microbiome data, with a focus on Latent Dirichlet allocation, Non-negative matrix factorization, and Dynamic Unigram models. To develop guidelines for when different methods are appropriate, we perform a simulation study. We further illustrate and compare these techniques using the data of Dethlefsen and Relman (2011, Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation. Proceedings of the National Academy of Sciences108, 4554-4561), a study on the effects of antibiotics on bacterial community composition. Code and data for all simulations and case studies are available publicly.

    View details for PubMedID 29868846

  • Bayesian Nonparametric Ordination for the Analysis of Microbial Communities JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Ren, B., Bacallado, S., Favaro, S., Holmes, S., Trippa, L. 2017; 112 (520): 1430–42


    Human microbiome studies use sequencing technologies to measure the abundance of bacterial species or Operational Taxonomic Units (OTUs) in samples of biological material. Typically the data are organized in contingency tables with OTU counts across heterogeneous biological samples. In the microbial ecology community, ordination methods are frequently used to investigate latent factors or clusters that capture and describe variations of OTU counts across biological samples. It remains important to evaluate how uncertainty in estimates of each biological sample's microbial distribution propagates to ordination analyses, including visualization of clusters and projections of biological samples on low dimensional spaces. We propose a Bayesian analysis for dependent distributions to endow frequently used ordinations with estimates of uncertainty. A Bayesian nonparametric prior for dependent normalized random measures is constructed, which is marginally equivalent to the normalized generalized Gamma process, a well-known prior for nonparametric analyses. In our prior, the dependence and similarity between microbial distributions is represented by latent factors that concentrate in a low dimensional space. We use a shrinkage prior to tune the dimensionality of the latent factors. The resulting posterior samples of model parameters can be used to evaluate uncertainty in analyses routinely applied in microbiome studies. Specifically, by combining them with multivariate data analysis techniques we can visualize credible regions in ecological ordination plots. The characteristics of the proposed model are illustrated through a simulation study and applications in two microbiome datasets.

    View details for PubMedID 29430070

    View details for PubMedCentralID PMC5804367

  • DADA2: High-resolution sample inference from Illumina amplicon data. Nature methods Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J., Holmes, S. P. 2016; 13 (7): 581-583


    We present the open-source software package DADA2 for modeling and correcting Illumina-sequenced amplicon errors ( DADA2 infers sample sequences exactly and resolves differences of as little as 1 nucleotide. In several mock communities, DADA2 identified more real variants and output fewer spurious sequences than other methods. We applied DADA2 to vaginal samples from a cohort of pregnant women, revealing a diversity of previously undetected Lactobacillus crispatus variants.

    View details for DOI 10.1038/nmeth.3869

    View details for PubMedID 27214047

    View details for PubMedCentralID PMC4927377

  • Waste not, want not: why rarefying microbiome data is inadmissible. PLoS computational biology McMurdie, P. J., Holmes, S. 2014; 10 (4)


    Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

    View details for DOI 10.1371/journal.pcbi.1003531

    View details for PubMedID 24699258

  • Metagenomic analysis with strain-level resolution reveals fine-scale variation in the human pregnancy microbiome. Genome research Goltsman, D. S., Sun, C. L., Proctor, D. M., DiGiulio, D. B., Robaczewska, A., Thomas, B. C., Shaw, G. M., Stevenson, D. K., Holmes, S. P., Banfield, J. F., Relman, D. A. 2018


    Recent studies suggest that the microbiome has an impact on gestational health and outcome. However, characterization of the pregnancy-associated microbiome has largely relied on 16S rRNA gene amplicon-based surveys. Here, we describe an assembly-driven, metagenomics-based, longitudinal study of the vaginal, gut, and oral microbiomes in 292 samples from 10 subjects sampled every three weeks throughout pregnancy. Nonhuman sequences in the amount of 1.53 Gb were assembled into scaffolds, and functional genes were predicted for gene- and pathway-based analyses. Vaginal assemblies were binned into 97 draft quality genomes. Redundancy analysis (RDA) of microbial community composition at all three body sites revealed gestational age to be a significant source of variation in patterns of gene abundance. In addition, health complications were associated with variation in community functional gene composition in the mouth and gut. The diversity of Lactobacillus iners-dominated communities in the vagina, unlike most other vaginal community types, significantly increased with gestational age. The genomes of co-occurring Gardnerella vaginalis strains with predicted distinct functions were recovered in samples from two subjects. In seven subjects, gut samples contained strains of the same Lactobacillus species that dominated the vaginal community of that same subject and not other Lactobacillus species; however, these within-host strains were divergent. CRISPR spacer analysis suggested shared phage and plasmid populations across body sites and individuals. This work underscores the dynamic behavior of the microbiome during pregnancy and suggests the potential importance of understanding the sources of this behavior for fetal development and gestational outcome.

    View details for PubMedID 30232199

  • Differential Induction of IFN-alpha and Modulation of CD112 and CD54 Expression Govern the Magnitude of NK Cell IFN-gamma Response to Influenza A Viruses. Journal of immunology (Baltimore, Md. : 1950) Kronstad, L. M., Seiler, C., Vergara, R., Holmes, S. P., Blish, C. A. 2018


    In human and murine studies, IFN-gamma is a critical mediator immunity to influenza. IFN-gamma production is critical for viral clearance and the development of adaptive immune responses, yet excessive production of IFN-gamma and other cytokines as part of a cytokine storm is associated with poor outcomes of influenza infection in humans. As NK cells are the main population of lung innate immune cells capable of producing IFN-gamma early in infection, we set out to identify the drivers of the human NK cell IFN-gamma response to influenza A viruses. We found that influenza triggers NK cells to secrete IFN-gamma in the absence of T cells and in a manner dependent upon signaling from both cytokines and receptor-ligand interactions. Further, we discovered that the pandemic A/California/07/2009 (H1N1) strain elicits a seven-fold greater IFN-gamma response than other strains tested, including a seasonal A/Victoria/361/2011 (H3N2) strain. These differential responses were independent of memory NK cells. Instead, we discovered that the A/Victoria/361/2011 influenza strain suppresses the NK cell IFN-gamma response by downregulating NK-activating ligands CD112 and CD54 and by repressing the type I IFN response in a viral replication-dependent manner. In contrast, the A/California/07/2009 strain fails to repress the type I IFN response or to downregulate CD54 and CD112 to the same extent, which leads to the enhanced NK cell IFN-gamma response. Our results indicate that influenza implements a strain-specific mechanism governing NK cell production of IFN-gamma and identifies a previously unrecognized influenza innate immune evasion strategy.

    View details for PubMedID 30143589

  • Characterization of the impact of daclizumab beta on circulating natural killer cells by mass cytometry Ranganath, T., Seiler, C., Vendrame, E., Le Gars, M., Fontenot, J. D., Fam, S., Holmes, S., Blish, C. LIPPINCOTT WILLIAMS & WILKINS. 2018
  • Topologically Constrained Template Estimation via Morse-Smale Complexes Controls Its Statistical Consistency SIAM JOURNAL ON APPLIED ALGEBRA AND GEOMETRY Miolane, N., Holmes, S., Pennec, X. 2018; 2 (2): 348–75

    View details for DOI 10.1137/17M1129222

    View details for Web of Science ID 000437369100007

  • A spatial gradient of bacterial diversity in the human oral cavity shaped by salivary flow. Nature communications Proctor, D. M., Fukuyama, J. A., Loomer, P. M., Armitage, G. C., Lee, S. A., Davis, N. M., Ryder, M. I., Holmes, S. P., Relman, D. A. 2018; 9 (1): 681


    Spatial and temporal patterns in microbial communities provide insights into the forces that shape them, their functions and roles in health and disease. Here, we used spatial and ecological statistics to analyze the role that saliva plays in structuring bacterial communities of the human mouth using >9000 dental and mucosal samples. We show that regardless of tissue type (teeth, alveolar mucosa, keratinized gingiva, or buccal mucosa), surface-associated bacterial communities vary along an ecological gradient from the front to the back of the mouth, and that on exposed tooth surfaces, the gradient is pronounced on lingual compared to buccal surfaces. Furthermore, our data suggest that this gradient is attenuated in individuals with low salivary flow due to Sjögren's syndrome. Taken together, our findings imply that salivary flow influences the spatial organization of microbial communities and that biogeographical patterns may be useful for understanding host physiological processes and for predicting disease.

    View details for PubMedID 29445174

    View details for PubMedCentralID PMC5813034

  • Gut microbiome transition across a lifestyle gradient in Himalaya. PLoS biology Jha, A. R., Davenport, E. R., Gautam, Y., Bhandari, D., Tandukar, S., Ng, K. M., Fragiadakis, G. K., Holmes, S., Gautam, G. P., Leach, J., Sherchand, J. B., Bustamante, C. D., Sonnenburg, J. L. 2018; 16 (11): e2005396


    The composition of the gut microbiome in industrialized populations differs from those living traditional lifestyles. However, it has been difficult to separate the contributions of human genetic and geographic factors from lifestyle. Whether shifts away from the foraging lifestyle that characterize much of humanity's past influence the gut microbiome, and to what degree, remains unclear. Here, we characterize the stool bacterial composition of four Himalayan populations to investigate how the gut community changes in response to shifts in traditional human lifestyles. These groups led seminomadic hunting-gathering lifestyles until transitioning to varying levels of agricultural dependence upon farming. The Tharu began farming 250-300 years ago, the Raute and Raji transitioned 30-40 years ago, and the Chepang retain many aspects of a foraging lifestyle. We assess the contributions of dietary and environmental factors on their gut-associated microbes and find that differences in the lifestyles of Himalayan foragers and farmers are strongly correlated with microbial community variation. Furthermore, the gut microbiomes of all four traditional Himalayan populations are distinct from that of the Americans, indicating that industrialization may further exacerbate differences in the gut community. The Chepang foragers harbor an elevated abundance of taxa associated with foragers around the world. Conversely, the gut microbiomes of the populations that have transitioned to farming are more similar to those of Americans, with agricultural dependence and several associated lifestyle and environmental factors correlating with the extent of microbiome divergence from the foraging population. The gut microbiomes of Raute and Raji reveal an intermediate state between the Chepang and Tharu, indicating that divergence from a stereotypical foraging microbiome can occur within a single generation. Our results also show that environmental factors such as drinking water source and solid cooking fuel are significantly associated with the gut microbiome. Despite the pronounced differences in gut bacterial composition across populations, we found little differences in alpha diversity across lifestyles. These findings in genetically similar populations living in the same geographical region establish the key role of lifestyle in determining human gut microbiome composition and point to the next challenging steps of determining how large-scale gut microbiome reconfiguration impacts human biology.

    View details for PubMedID 30439937

  • Multivariate Heteroscedasticity Models for Functional Brain Connectivity FRONTIERS IN NEUROSCIENCE Seiler, C., Holmes, S. 2017; 11: 696


    Functional brain connectivity is the co-occurrence of brain activity in different areas during resting and while doing tasks. The data of interest are multivariate timeseries measured simultaneously across brain parcels using resting-state fMRI (rfMRI). We analyze functional connectivity using two heteroscedasticity models. Our first model is low-dimensional and scales linearly in the number of brain parcels. Our second model scales quadratically. We apply both models to data from the Human Connectome Project (HCP) comparing connectivity between short and conventional sleepers. We find stronger functional connectivity in short than conventional sleepers in brain areas consistent with previous findings. This might be due to subjects falling asleep in the scanner. Consequently, we recommend the inclusion of average sleep duration as a covariate to remove unwanted variation in rfMRI studies. A power analysis using the HCP data shows that a sample size of 40 detects 50% of the connectivity at a false discovery rate of 20%. We provide implementations using R and the probabilistic programming language Stan.

    View details for PubMedID 29311777

  • Parallel imaging of Drosophila embryos for quantitative analysis of genetic perturbations of the Ras pathway. Disease models & mechanisms Goyal, Y., Levario, T. J., Mattingly, H. H., Holmes, S., Shvartsman, S. Y., Lu, H. 2017


    The Ras pathway patterns the poles of the Drosophila embryo by downregulating the levels and activity of a DNA-binding transcriptional repressor Capicua (Cic). We demonstrate that the spatiotemporal pattern of Cic during this signaling event can be harnessed for functional studies of the Ras-pathway mutations from human diseases. Our approach relies on a new microfluidic device that enables parallel imaging of Cic dynamics in dozens of live embryos. We found that although the pattern of Cic in early embryos is complex, it can be accurately approximated by a product of one spatial profile and one time-dependent amplitude. Analysis of these functions of space and time alone reveals the differential effects of mutations within the Ras pathway. Given the highly-conserved nature of Ras-dependent control of Cic, our approach opens a new way for functional analysis of multiple sequence variants from developmental abnormalities and cancers.

    View details for DOI 10.1242/dmm.030163

    View details for PubMedID 28495673

  • Mutational Correlates of Virological Failure in Individuals Receiving a WHO-Recommended Tenofovir-Containing First-Line Regimen: An International Collaboration. EBioMedicine Rhee, S., Varghese, V., Holmes, S. P., van Zyl, G. U., Steegen, K., Boyd, M. A., Cooper, D. A., Nsanzimana, S., Saravanan, S., Charpentier, C., de Oliveira, T., Etiebet, M. A., Garcia, F., Goedhals, D., Gomes, P., Günthard, H. F., Hamers, R. L., Hoffmann, C. J., Hunt, G., Jiamsakul, A., Kaleebu, P., Kanki, P., Kantor, R., Kerschberger, B., Marconi, V. C., D'amour Ndahimana, J., Ndembi, N., Ngo-Giang-Huong, N., Rokx, C., Santoro, M. M., Schapiro, J. M., Schmidt, D., Seu, L., Sigaloff, K. C., Sirivichayakul, S., Skhosana, L., Sunpath, H., Tang, M., Yang, C., Carmona, S., Gupta, R. K., Shafer, R. W. 2017; 18: 225-235


    Tenofovir disoproxil fumarate (TDF) genotypic resistance defined by K65R/N and/or K70E/Q/G occurs in 20% to 60% of individuals with virological failure (VF) on a WHO-recommended TDF-containing first-line regimen. However, the full spectrum of reverse transcriptase (RT) mutations selected in individuals with VF on such a regimen is not known. To identify TDF regimen-associated mutations (TRAMs), we compared the proportion of each RT mutation in 2873 individuals with VF on a WHO-recommended first-line TDF-containing regimen to its proportion in a cohort of 50,803 antiretroviral-naïve individuals. To identify TRAMs specifically associated with TDF-selection pressure, we compared the proportion of each TRAM to its proportion in a cohort of 5805 individuals with VF on a first-line thymidine analog-containing regimen. We identified 83 TRAMs including 33 NRTI-associated, 40 NNRTI-associated, and 10 uncommon mutations of uncertain provenance. Of the 33 NRTI-associated TRAMs, 12 - A62V, K65R/N, S68G/N/D, K70E/Q/T, L74I, V75L, and Y115F - were more common among individuals receiving a first-line TDF-containing compared to a first-line thymidine analog-containing regimen. These 12 TDF-selected TRAMs will be important for monitoring TDF-associated transmitted drug-resistance and for determining the extent of reduced TDF susceptibility in individuals with VF on a TDF-containing regimen.

    View details for DOI 10.1016/j.ebiom.2017.03.024

    View details for PubMedID 28365230

    View details for PubMedCentralID PMC5405160

  • Replication and refinement of a vaginal microbial signature of preterm birth in two racially distinct cohorts of US women. Proceedings of the National Academy of Sciences of the United States of America Callahan, B. J., DiGiulio, D. B., Goltsman, D. S., Sun, C. L., Costello, E. K., Jeganathan, P., Biggio, J. R., Wong, R. J., Druzin, M. L., Shaw, G. M., Stevenson, D. K., Holmes, S. P., Relman, D. A. 2017


    Preterm birth (PTB) is the leading cause of neonatal morbidity and mortality. Previous studies have suggested that the maternal vaginal microbiota contributes to the pathophysiology of PTB, but conflicting results in recent years have raised doubts. We conducted a study of PTB compared with term birth in two cohorts of pregnant women: one predominantly Caucasian (n = 39) at low risk for PTB, the second predominantly African American and at high-risk (n = 96). We profiled the taxonomic composition of 2,179 vaginal swabs collected prospectively and weekly during gestation using 16S rRNA gene sequencing. Previously proposed associations between PTB and lower Lactobacillus and higher Gardnerella abundances replicated in the low-risk cohort, but not in the high-risk cohort. High-resolution bioinformatics enabled taxonomic assignment to the species and subspecies levels, revealing that Lactobacillus crispatus was associated with low risk of PTB in both cohorts, while Lactobacillus iners was not, and that a subspecies clade of Gardnerella vaginalis explained the genus association with PTB. Patterns of cooccurrence between L. crispatus and Gardnerella were highly exclusive, while Gardnerella and L. iners often coexisted at high frequencies. We argue that the vaginal microbiota is better represented by the quantitative frequencies of these key taxa than by classifying communities into five community state types. Our findings extend and corroborate the association between the vaginal microbiota and PTB, demonstrate the benefits of high-resolution statistical bioinformatics in clinical microbiome studies, and suggest that previous conflicting results may reflect the different risk profile of women of black race.

    View details for PubMedID 28847941

  • Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The ISME journal Callahan, B. J., McMurdie, P. J., Holmes, S. P. 2017


    Recent advances have made it possible to analyze high-throughput marker-gene sequencing data without resorting to the customary construction of molecular operational taxonomic units (OTUs): clusters of sequencing reads that differ by less than a fixed dissimilarity threshold. New methods control errors sufficiently such that amplicon sequence variants (ASVs) can be resolved exactly, down to the level of single-nucleotide differences over the sequenced gene region. The benefits of finer resolution are immediately apparent, and arguments for ASV methods have focused on their improved resolution. Less obvious, but we believe more important, are the broad benefits that derive from the status of ASVs as consistent labels with intrinsic biological meaning identified independently from a reference database. Here we discuss how these features grant ASVs the combined advantages of closed-reference OTUs-including computational costs that scale linearly with study size, simple merging between independently processed data sets, and forward prediction-and of de novo OTUs-including accurate measurement of diversity and applicability to communities lacking deep coverage in reference databases. We argue that the improvements in reusability, reproducibility and comprehensiveness are sufficiently great that ASVs should replace OTUs as the standard unit of marker-gene analysis and reporting.The ISME Journal advance online publication, 21 July 2017; doi:10.1038/ismej.2017.119.

    View details for PubMedID 28731476

  • Discussion of "50 Years of Data Science" JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Holmes, S., Josse, J. 2017; 26 (4): 768–69
  • STATISTICAL PROOF? THE PROBLEM OF IRREPRODUCIBILITY Bulletin of the American Mathematical Society Holmes, S. P. 2017

    View details for DOI 10.1090/bull/1597

  • Multidomain analyses of a longitudinal human microbiome intestinal cleanout perturbation experiment PLOS Computational Biology Fukuyama, J., Rumker, L., Sankaran, K., Jeganathan, P., Relman, D. A., Holmes, S. P. 2017; 13 (8): e1005706


    Our work focuses on the stability, resilience, and response to perturbation of the bacterial communities in the human gut. Informative flash flood-like disturbances that eliminate most gastrointestinal biomass can be induced using a clinically-relevant iso-osmotic agent. We designed and executed such a disturbance in human volunteers using a dense longitudinal sampling scheme extending before and after induced diarrhea. This experiment has enabled a careful multidomain analysis of a controlled perturbation of the human gut microbiota with a new level of resolution. These new longitudinal multidomain data were analyzed using recently developed statistical methods that demonstrate improvements over current practices. By imposing sparsity constraints we have enhanced the interpretability of the analyses and by employing a new adaptive generalized principal components analysis, incorporated modulated phylogenetic information and enhanced interpretation through scoring of the portions of the tree most influenced by the perturbation. Our analyses leverage the taxa-sample duality in the data to show how the gut microbiota recovers following this perturbation. Through a holistic approach that integrates phylogenetic, metagenomic and abundance information, we elucidate patterns of taxonomic and functional change that characterize the community recovery process across individuals. We provide complete code and illustrations of new sparse statistical methods for high-dimensional, longitudinal multidomain data that provide greater interpretability than existing methods.

    View details for DOI 10.1371/journal.pcbi.1005706

    View details for PubMedCentralID PMC5576755

  • Interactive Visualization of Hierarchically Structured Data Journal of Computational and Graphical Statistics Sankaran, K., Holmes, S. P. 2017
  • Template Shape Estimation: Correcting an Asymptotic Bias SIAM J Imaging Science Miolane, N., Holmes, S., Pennec, X. 2017; 10 (2): 808-844

    View details for DOI 10.1137/16M1084493

  • Bayesian Nonparametric Ordination for the Analysis of Microbial Communities Journal of the American Statistical Association Ren, B., Bacallado, S., Favaro, S., Holmes, S., Trippa, L. 2017
  • Mass Cytometry Analytical Approaches Reveal Cytokine-Induced Changes in Natural Killer Cells. Cytometry. Part B, Clinical cytometry Vendrame, E., Fukuyama, J., Strauss-Albee, D. M., Holmes, S., Blish, C. A. 2017; 92 (1): 57-67


    Natural killer (NK) cells have antiviral and antitumor activity that could be harnessed for the treatment of infections and malignancies. To maintain cell viability and enhance antiviral and antitumor effects, NK cells are frequently treated with cytokines. Here we performed an extensive assessment of the effects of cytokines on the phenotype and function of human NK cells.We used cytometry by time-of-flight (CyTOF) to evaluate NK cell repertoire changes after stimulation with interleukin (IL)-2, IL-15 or a combination of IL- 12/IL-15/IL-18. To analyze the high dimensional CyTOF data, we used several statistical and visualization tools, including viSNE (Visualization of t-Distributed Stochastic Neighbor Embedding), Citrus (Cluster identification, characterization, and regression), correspondence analysis, and the Friedman-Rafsky test.All three treatments (IL-2, IL-15, and IL-12/IL-15/IL-18) increase expression of CD56 and CD69. The effects of treatment with IL-2 and IL-15 are nearly indistinguishable and characterized principally by increased expression of surface markers including CD56, NKp30, NKp44, and increased expression of functional markers, such as perforin, granzyme B, and MIP-1β. The combination of IL-12/IL-15/IL- 18 induces a profound shift in the repertoire structure, decreasing expression of CD16, CD57, CD8, NKp30, NKp46, and NKG2D, and dramatically increasing expression of IFN-γ.CyTOF provides insights into the effects of cytokines on the phenotype and function of NK cells, which could inform future research efforts and approaches to NK cell immunotherapy. There are several analytical approaches to CyTOF data, and the appropriate method should be carefully selected based on which aspect of the dataset is being explored. This article is protected by copyright. All rights reserved.

    View details for DOI 10.1002/cyto.b.21500

    View details for PubMedID 27933717

  • Multi-Table Differential Correlation Analysis of Neuroanatomical and Cognitive Interactions in Turner Syndrome. Neuroinformatics Seiler, C., Green, T., Hong, D., Chromik, L., Huffman, L., Holmes, S., Reiss, A. L. 2017


    Girls and women with Turner syndrome (TS) have a completely or partially missing X chromosome. Extensive studies on the impact of TS on neuroanatomy and cognition have been conducted. The integration of neuroanatomical and cognitive information into one consistent analysis through multi-table methods is difficult and most standard tests are underpowered. We propose a new two-sample testing procedure that compares associations between two tables in two groups. The procedure combines multi-table methods with permutation tests. In particular, we construct cluster size test statistics that incorporate spatial dependencies. We apply our new procedure to a newly collected dataset comprising of structural brain scans and cognitive test scores from girls with TS and healthy control participants (age and sex matched). We measure neuroanatomy with Tensor-Based Morphometry (TBM) and cognitive function with Wechsler IQ and NEuroPSYchological tests (NEPSY-II). We compare our multi-table testing procedure to a single-table analysis. Our new procedure reports differential correlations between two voxel clusters and a wide range of cognitive tests whereas the single-table analysis reports no differences. Our findings are consistent with the hypothesis that girls with TS have a different brain-cognition association structure than healthy controls.

    View details for PubMedID 29270892

  • 1,2-Dichloroethane Exposure Alters the Population Structure, Metabolism, and Kinetics of a Trichloroethene-Dechlorinating Dehalococcoides mccartyi Consortium. Environmental science & technology Mayer-Blackwell, K., Fincker, M., Molenda, O., Callahan, B., Sewell, H., Holmes, S., Edwards, E. A., Spormann, A. M. 2016: -?


    Bioremediation of groundwater contaminated with chlorinated aliphatic hydrocarbons such as perchloroethene and trichloroethene can result in the accumulation of the undesirable intermediate vinyl chloride. Such accumulation can either be due to the absence of specific vinyl chloride respiring Dehalococcoides mccartyi or to the inhibition of such strains by the metabolism of other microorganisms. The fitness of vinyl chloride respiring Dehalococcoides mccartyi subpopulations is particularly uncertain in the presence of chloroethene/chloroethane cocontaminant mixtures, which are commonly found in contaminated groundwater. Therefore, we investigated the structure of Dehalococcoides populations in a continuously fed reactor system under changing chloroethene/ethane influent conditions. We observed that increasing the influent ratio of 1,2-dichloroethane to trichloroethene was associated with ecological selection of a tceA-containing Dehalococcoides population relative to a vcrA-containing Dehalococcoides population. Although both vinyl chloride and 1,2-dichloroethane could be simultaneously transformed to ethene, prolonged exposure to 1,2-dichloroethane diminished the vinyl chloride transforming capacity of the culture. Kinetic tests revealed that dechlorination of 1,2-dichloroethane by the consortium was strongly inhibited by cis-dichloroethene but not vinyl chloride. Native polyacrylamide gel electrophoresis and mass spectrometry revealed that a trichloroethene reductive dehalogenase (TceA) homologue was the most consistently expressed of four detectable reductive dehalogenases during 1,2-dichloroethane exposure, suggesting that it catalyzes the reductive dihaloelimination of 1,2-dichloroethane to ethene.

    View details for PubMedID 27809491

  • HIV-1 Protease, Reverse Transcriptase, and Integrase Variation JOURNAL OF VIROLOGY Rhee, S., Sankaran, K., Varghese, V., Winters, M. A., Hurt, C. B., Eron, J. J., Parkin, N., Holmes, S. P., Holodniy, M., Shafer, R. W. 2016; 90 (13): 6058-6070


    HIV-1 protease (PR), reverse transcriptase (RT), and integrase (IN) variability presents a challenge to laboratories performing genotypic resistance testing. This challenge will grow with increased sequencing of samples enriched for proviral DNA such as dried blood spots and increased use of next-generation sequencing (NGS) to detect low-abundance HIV-1 variants. We analyzed PR and RT sequences from >100,000 individuals and IN sequences from >10,000 individuals to characterize variation at each amino acid position, identify mutations indicating APOBEC-mediated G-to-A editing, and identify mutations resulting from selective drug pressure. Forty-seven percent of PR, 37% of RT and 34% of IN positions had one or more amino acid variants with a prevalence ≥1%. Seventy percent of PR, 60% of RT and 60% of IN positions had one or more variants with a prevalence ≥0.1%. Overall 201 PR, 636 RT and 346 IN variants had a prevalence ≥0.1%. The median inter-subtype prevalence-ratio was 2.9-, 2.1- and 1.9-fold for these PR, RT and IN variants, respectively. Only 5.0% of PR, 3.7% of RT and 2.0% of IN variants had a median inter-subtype prevalence-ratio ≥10-fold. Variants at lower prevalences were more likely to differ biochemically and to be part of an electrophoretic mixture compared to high prevalence variants. There were 209 mutations indicative of APOBEC-mediated G-to-A editing and 326 non-polymorphic treatment-selected mutations. Identifying viruses with a high number of APOBEC-associated mutations will facilitate the quality control of dried blood spot sequencing. Identifying sequences with a high proportion of rare mutations will facilitate the quality control of NGS.Most antiretroviral drugs target three HIV-1 proteins: PR, RT, and IN. These proteins are highly variable: many different amino acids can be present at the same position in viruses from different individuals. Some of the amino acid variants cause drug resistance and occur mainly in individuals receiving antiretroviral drugs. Some variants result from a human cellular defense mechanism called APOBEC-mediated hypermutation. Many variants result from naturally occurring mutation. Some variants may represent technical artifacts. We studied PR and RT sequences from >100,000 individuals and IN sequences from >10,000 individuals to quantify variation at each amino acid position in these three HIV-1 proteins. We performed analyses to determine which amino acid variants resulted from antiretroviral drug selection pressure, APOBEC-mediated editing, and naturally occurring variation. Our results provide information essential to clinical, research, and public health laboratories performing genotypic resistance testing by sequencing HIV-1 PR, RT, and IN.

    View details for DOI 10.1128/JVI.00495-16

    View details for Web of Science ID 000378340300019

    View details for PubMedID 27099321

  • REPRODUCIBLE RESEARCH WORKFLOW IN R FOR THE ANALYSIS OF PERSONALIZED HUMAN MICROBIOME DATA. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Callahan, B., Proctor, D., Relman, D., Fukuyama, J., Holmes, S. 2016; 21: 183-194


    This article presents a reproducible research workflow for amplicon-based microbiome studies in personalized medicine created using Bioconductor packages and the knitr markdown interface.We show that sometimes a multiplicity of choices and lack of consistent documentation at each stage of the sequential processing pipeline used for the analysis of microbiome data can lead to spurious results. We propose its replacement with reproducible and documented analysis using R packages dada2, knitr, and phyloseq. This workflow implements both key stages of amplicon analysis: the initial filtering and denoising steps needed to construct taxonomic feature tables from error-containing sequencing reads (dada2), and the exploratory and inferential analysis of those feature tables and associated sample metadata (phyloseq). This workow facilitates reproducible interrogation of the full set of choices required in microbiome studies. We present several examples in which we leverage existing packages for analysis in a way that allows easy sharing and modification by others, and give pointers to articles that depend on this reproducible workflow for the study of longitudinal and spatial series analyses of the vaginal microbiome in pregnancy and the oral microbiome in humans with healthy dentition and intra-oral tissues.

    View details for PubMedID 26776185

  • Interpreting Prevotella and Bacteroides as biomarkers of diet and lifestyle. Microbiome Gorvitovskaia, A., Holmes, S. P., Huse, S. M. 2016; 4 (1): 15-?


    In a series of studies of the gut microbiome, "enterotypes" have been used to classify gut microbiome samples that cluster together in ordination analyses. Initially, three distinct enterotypes were described, although later studies reduced this to two clusters, one dominated by Bacteroides or Clostridiales species found more commonly in Western (American and Western European) subjects and the other dominated by Prevotella more often associated with non-Western subjects. The two taxa, Bacteroides and Prevotella, have been presumed to represent consistent underlying microbial communities, but no one has demonstrated the presence of additional microbial taxa across studies that can define these communities.We analyzed the combined microbiome data from five previous studies with samples across five continents. We clearly demonstrate that there are no consistent bacterial taxa associated with either Bacteroides- or Prevotella-dominated communities across the studies. By increasing the number and diversity of samples, we found gradients of both Bacteroides and Prevotella and a lack of the distinct clusters in the principal coordinate plots originally proposed in the "enterotypes" hypothesis. The apparent segregation of the samples seen in many ordination plots is due to the differences in the samples' Prevotella and Bacteroides abundances and does not represent consistent microbial communities within the "enterotypes" and is not associated with other taxa across studies. The projections we see are consistent with a continuum of values created from a simple mixture of Bacteroides and Prevotella; these two biomarkers are significantly correlated to the projection axes. We suggest that previous findings citing Bacteroides- and Prevotella-dominated clusters are the result of an artifact caused by the greater relative abundance of these two taxa over other taxa in the human gut and the sparsity of Prevotella abundant samples.We believe that the term "enterotypes" is misleading because it implies both an underlying consistency of community taxa and a clear separation of sets of human gut samples, neither of which is supported by the broader data. We propose the use of "biomarker" as a more accurate description of these and other taxa that correlate with diet, lifestyle, and disease state.

    View details for DOI 10.1186/s40168-016-0160-7

    View details for PubMedID 27068581

    View details for PubMedCentralID PMC4828855

  • Measuring multivariate association and beyond. Statistics Surveys Josse, J., Holmes, S. 2016; 10: 132-167


    Simple correlation coefficients between two variables have been generalized to measure association between two matrices in many ways. Coefficients such as the RV coefficient, the distance covariance (dCov) coefficient and kernel based coefficients are being used by different research communities. Scientists use these coefficients to test whether two random vectors are linked. Once it has been ascertained that there is such association through testing, then a next step, often ignored, is to explore and uncover the association's underlying patterns. This article provides a survey of various measures of dependence between random vectors and tests of independence and emphasizes the connections and differences between the various approaches. After providing definitions of the coefficients and associated tests, we present the recent improvements that enhance their statistical properties and ease of interpretation. We summarize multi-table approaches and provide scenarii where the indices can provide useful summaries of heterogeneous multi-block data. We illustrate these different strategies on several examples of real data and suggest directions for future research.

    View details for DOI 10.1214/16-SS116

    View details for PubMedCentralID PMC5658146

  • Bioconductor workflow for microbiome data analysis: from raw reads to community analyses. F1000Research Callahan, B. J., Sankaran, K., Fukuyama, J. A., McMurdie, P. J., Holmes, S. P. 2016; 5: 1492-?


    High-throughput sequencing of PCR-amplified taxonomic markers (like the 16S rRNA gene) has enabled a new level of analysis of complex bacterial communities known as microbiomes. Many tools exist to quantify and compare abundance levels or microbial composition of communities in different conditions. The sequencing reads have to be denoised and assigned to the closest taxa from a reference database. Common approaches use a notion of 97% similarity and normalize the data by subsampling to equalize library sizes. In this paper, we show that statistical models allow more accurate abundance estimates. By providing a complete workflow in R, we enable the user to do sophisticated downstream statistical analyses, including both parameteric and nonparametric methods. We provide examples of using the R packages dada2, phyloseq, DESeq2, ggplot2 and vegan to filter, visualize and test microbiome data. We also provide examples of supervised analyses using random forests, partial least squares and linear models as well as nonparametric testing using community networks and the ggnetwork package.

    View details for DOI 10.12688/f1000research.8986.1

    View details for PubMedID 27508062

    View details for PubMedCentralID PMC4955027

  • More effective drugs lead to harder selective sweeps in the evolution of drug resistance in HIV-1. eLife Feder, A. F., Rhee, S., Holmes, S. P., Shafer, R. W., Petrov, D. A., Pennings, P. S. 2016; 5


    In the early days of HIV treatment, drug resistance occurred rapidly and predictably in all patients, but under modern treatments, resistance arises slowly, if at all. The probability of resistance should be controlled by the rate of generation of resistance mutations. If many adaptive mutations arise simultaneously, then adaptation proceeds by soft selective sweeps in which multiple adaptive mutations spread concomitantly, but if adaptive mutations occur rarely in the population, then a single adaptive mutation should spread alone in a hard selective sweep. Here, we use 6717 HIV-1 consensus sequences from patients treated with first-line therapies between 1989 and 2013 to confirm that the transition from fast to slow evolution of drug resistance was indeed accompanied with the expected transition from soft to hard selective sweeps. This suggests more generally that evolution proceeds via hard sweeps if resistance is unlikely and via soft sweeps if it is likely.

    View details for DOI 10.7554/eLife.10670

    View details for PubMedID 26882502

    View details for PubMedCentralID PMC4764592

  • Marine mammals harbor unique microbiotas shaped by and yet distinct from the sea. Nature communications Bik, E. M., Costello, E. K., Switzer, A. D., Callahan, B. J., Holmes, S. P., Wells, R. S., Carlin, K. P., Jensen, E. D., Venn-Watson, S., Relman, D. A. 2016; 7: 10516-?


    Marine mammals play crucial ecological roles in the oceans, but little is known about their microbiotas. Here we study the bacterial communities in 337 samples from 5 body sites in 48 healthy dolphins and 18 healthy sea lions, as well as those of adjacent seawater and other hosts. The bacterial taxonomic compositions are distinct from those of other mammals, dietary fish and seawater, are highly diverse and vary according to body site and host species. Dolphins harbour 30 bacterial phyla, with 25 of them in the mouth, several abundant but poorly characterized Tenericutes species in gastric fluid and a surprisingly paucity of Bacteroidetes in distal gut. About 70% of near-full length bacterial 16S ribosomal RNA sequences from dolphins are unique. Host habitat, diet and phylogeny all contribute to variation in marine mammal distal gut microbiota composition. Our findings help elucidate the factors structuring marine mammal microbiotas and may enhance monitoring of marine mammal health.

    View details for DOI 10.1038/ncomms10516

    View details for PubMedID 26839246

    View details for PubMedCentralID PMC4742810

  • Variation in Taxonomic Composition of the Fecal Microbiota in an Inbred Mouse Strain across Individuals and Time PLOS ONE Hoy, Y. E., Bik, E. M., Lawley, T. D., Holmes, S. P., Monack, D. M., Theriot, J. A., Relman, D. A. 2015; 10 (11)


    Genetics, diet, and other environmental exposures are thought to be major factors in the development and composition of the intestinal microbiota of animals. However, the relative contributions of these factors in adult animals, as well as variation with time in a variety of important settings, are still not fully understood. We studied a population of inbred, female mice fed the same diet and housed under the same conditions. We collected fecal samples from 46 individual mice over two weeks, sampling four of these mice for periods as long as 236 days for a total of 190 samples, and determined the phylogenetic composition of their microbial communities after analyzing 1,849,990 high-quality pyrosequencing reads of the 16S rRNA gene V3 region. Even under these controlled conditions, we found significant inter-individual variation in community composition, as well as variation within an individual over time, including increases in alpha diversity during the first 2 months of co-habitation. Some variation was explained by mouse membership in different cage and vendor shipment groups. The differences among individual mice from the same shipment group and cage were still significant. Overall, we found that 23% of the variation in intestinal microbiota composition was explained by changes within the fecal microbiota of a mouse over time, 12% was explained by persistent differences among individual mice, 14% by cage, and 18% by shipment group. Our findings suggest that the microbiota of controlled populations of inbred laboratory animals may not be as uniform as previously thought, that animal rearing and handling may account for some variation, and that as yet unidentified factors may explain additional components of variation in the composition of the microbiota within populations and individuals over time. These findings have implications for the design and interpretation of experiments involving laboratory animals.

    View details for DOI 10.1371/journal.pone.0142825

    View details for Web of Science ID 000367628500058

    View details for PubMedCentralID PMC4643986

  • Temporal and spatial variation of the human microbiota during pregnancy PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA DiGiulio, D. B., Callahan, B. J., McMurdie, P. J., Costello, E. K., Lyell, D. J., Robaczewska, A., Sun, C. L., Goltsman, D. S., Wong, R. J., Shaw, G., Stevenson, D. K., Holmes, S. P., Relman, D. A. 2015; 112 (35): 11060-11065


    Despite the critical role of the human microbiota in health, our understanding of microbiota compositional dynamics during and after pregnancy is incomplete. We conducted a case-control study of 49 pregnant women, 15 of whom delivered preterm. From 40 of these women, we analyzed bacterial taxonomic composition of 3,767 specimens collected prospectively and weekly during gestation and monthly after delivery from the vagina, distal gut, saliva, and tooth/gum. Linear mixed-effects modeling, medoid-based clustering, and Markov chain modeling were used to analyze community temporal trends, community structure, and vaginal community state transitions. Microbiota community taxonomic composition and diversity remained remarkably stable at all four body sites during pregnancy (P > 0.05 for trends over time). Prevalence of a Lactobacillus-poor vaginal community state type (CST 4) was inversely correlated with gestational age at delivery (P = 0.0039). Risk for preterm birth was more pronounced for subjects with CST 4 accompanied by elevated Gardnerella or Ureaplasma abundances. This finding was validated with a set of 246 vaginal specimens from nine women (four of whom delivered preterm). Most women experienced a postdelivery disturbance in the vaginal community characterized by a decrease in Lactobacillus species and an increase in diverse anaerobes such as Peptoniphilus, Prevotella, and Anaerococcus species. This disturbance was unrelated to gestational age at delivery and persisted for up to 1 y. These findings have important implications for predicting premature labor, a major global health problem, and for understanding the potential impact of a persistent, altered postpartum microbiota on maternal health, including outcomes of pregnancies following short interpregnancy intervals.

    View details for DOI 10.1073/pnas.1502875112

    View details for Web of Science ID 000360383200068

  • Human NK cell repertoire diversity reflects immune experience and correlates with viral susceptibility SCIENCE TRANSLATIONAL MEDICINE Strauss-Albee, D. M., Fukuyama, J., Liang, E. C., Yao, Y., Jarrell, J. A., Drake, A. L., Kinuthia, J., Montgomery, R. R., John-Stewart, G., Holmes, S., Blish, C. A. 2015; 7 (297)


    Innate natural killer (NK) cells are diverse at the single-cell level because of variegated expressions of activating and inhibitory receptors, yet the developmental roots and functional consequences of this diversity remain unknown. Because NK cells are critical for antiviral and antitumor responses, a better understanding of their diversity could lead to an improved ability to harness them therapeutically. We found that NK diversity is lower at birth than in adults. During an antiviral response to either HIV-1 or West Nile virus, NK diversity increases, resulting in terminal differentiation and cytokine production at the cost of cell division and degranulation. In African women matched for HIV-1 exposure risk, high NK diversity is associated with increased risk of HIV-1 acquisition. Existing diversity may therefore decrease the flexibility of the antiviral response. Collectively, the data reveal that human NK diversity is a previously undefined metric of immune history and function that may be clinically useful in forecasting the outcomes of infection and malignancy.

    View details for DOI 10.1126/scitranslmed.aac5722

    View details for Web of Science ID 000358738700007

  • de Finetti Priors using Markov chain Monte Carlo computations STATISTICS AND COMPUTING Bacallado, S., Diaconis, P., Holmes, S. 2015; 25 (4): 797-808


    Recent advances in Monte Carlo methods allow us to revisit work by de Finetti who suggested the use of approximate exchangeability in the analyses of contingency tables. This paper gives examples of computational implementations using Metropolis Hastings, Langevin and Hamiltonian Monte Carlo to compute posterior distributions for test statistics relevant for testing independence, reversible or three way models for discrete exponential families using polynomial priors and Gröbner bases.

    View details for DOI 10.1007/s11222-015-9562-9

    View details for Web of Science ID 000356828600009

    View details for PubMedCentralID PMC4578810

  • Shiny-phyloseq: Web application for interactive microbiome analysis with provenance tracking. Bioinformatics McMurdie, P. J., Holmes, S. 2015; 31 (2): 282-283


    We have created a Shiny-based web application, called Shiny-phyloseq, for dynamic interaction with microbiome data that runs on any modern web-browser, and requires no programming to use - increasing the accessibility and decreasing the entrance requirement to using phyloseq and related R tools. Along with a data- and context-aware dynamic interface for exploring the effects of parameter and method choices, Shiny-phyloseq also records the complete user-input and subsequent graphical results of a user's session, allowing the user to archive, share, and reproduce the sequence of steps that created their result - without writing any new code themselves. Availability and Implementation: Shiny-phyloseq is implemented entirely in the R language. It can be hosted/launched by any system with R installed, including Windows, Mac OS, and most Linux distributions. Information technology administrators can also host Shiny-phyloseq from a remote server, in which case users need only have a web-browser installed. Shiny-phyloseq is provided free of charge under a GPL-3 open-source license through GitHub at

    View details for DOI 10.1093/bioinformatics/btu616

    View details for PubMedID 25262154

  • Enhanced natural killer-cell and T-cell responses to influenza A virus during pregnancy PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Kay, A. W., Fukuyama, J., Aziz, N., Dekker, C. L., Mackey, S., Swan, G. E., Davis, M. M., Holmes, S., Blish, C. A. 2014; 111 (40): 14506-14511


    Pregnant women experience increased morbidity and mortality after influenza infection, for reasons that are not understood. Although some data suggest that natural killer (NK)- and T-cell responses are suppressed during pregnancy, influenza-specific responses have not been previously evaluated. Thus, we analyzed the responses of women that were pregnant (n = 21) versus those that were not (n = 29) immediately before inactivated influenza vaccination (IIV), 7 d after vaccination, and 6 wk postpartum. Expression of CD107a (a marker of cytolysis) and production of IFN-γ and macrophage inflammatory protein (MIP) 1β were assessed by flow cytometry. Pregnant women had a significantly increased percentage of NK cells producing a MIP-1β response to pH1N1 virus compared with nonpregnant women pre-IIV [median, 6.66 vs. 0.90% (P = 0.0149)] and 7 d post-IIV [median, 11.23 vs. 2.81% (P = 0.004)], indicating a heightened chemokine response in pregnant women that was further enhanced by the vaccination. Pregnant women also exhibited significantly increased T-cell production of MIP-1β and polyfunctionality in NK and T cells to pH1N1 virus pre- and post-IIV. NK- and T-cell polyfunctionality was also enhanced in pregnant women in response to the H3N2 viral strain. In contrast, pregnant women had significantly reduced NK- and T-cell responses to phorbol 12-myristate 13-acetate and ionomycin. This type of stimulation led to the conclusion that NK- and T-cell responses during pregnancy are suppressed, but clearly this conclusion is not correct relative to the more biologically relevant assays described here. Robust cellular immune responses to influenza during pregnancy could drive pulmonary inflammation, explaining increased morbidity and mortality.

    View details for DOI 10.1073/pnas.1416569111

    View details for Web of Science ID 000342633900054

  • structSSI: Simultaneous and Selective Inference for Grouped or Hierarchically Structured Data JOURNAL OF STATISTICAL SOFTWARE Sankaran, K., Holmes, S. 2014; 59 (13)


    The 𝖱 package structSSI provides an accessible implementation of two recently developed simultaneous and selective inference techniques: the group Benjamini-Hochberg and hierarchical false discovery rate procedures. Unlike many multiple testing schemes, these methods specifically incorporate existing information about the grouped or hierarchical dependence between hypotheses under consideration while controlling the false discovery rate. Doing so increases statistical power and interpretability. Furthermore, these procedures provide novel approaches to the central problem of encoding complex dependency between hypotheses. We briefly describe the group Benjamini-Hochberg and hierarchical false discovery rate procedures and then illustrate them using two examples, one a measure of ecological microbial abundances and the other a global temperature time series. For both procedures, we detail the steps associated with the analysis of these particular data sets, including establishing the dependence structures, performing the test, and interpreting the results. These steps are encapsulated by 𝖱 functions, and we explain their applicability to general data sets.

    View details for Web of Science ID 000341807500001

    View details for PubMedCentralID PMC4764101

  • Connections and Extensions: A Discussion of the Paper by Girolami and Byrne SCANDINAVIAN JOURNAL OF STATISTICS Diaconis, P., Seiler, C., Holmes, S. 2014; 41 (1): 3–7

    View details for DOI 10.1111/sjos.12070

    View details for Web of Science ID 000331606500002

  • HIV-1 transmission networks in a small world. journal of infectious diseases Pennings, P. S., Holmes, S. P., Shafer, R. W. 2014; 209 (2): 180-182

    View details for DOI 10.1093/infdis/jit525

    View details for PubMedID 24151310

    View details for PubMedCentralID PMC3873789

  • Positive Curvature and Hamiltonian Monte Carlo Seiler, C., Rubinstein-Salzedo, S., Holmes, S., Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., Weinberger, K. Q. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2014
  • Nasal Microenvironments and Interspecific Interactions Influence Nasal Microbiota Complexity and S. aureus Carriage. Cell host & microbe Yan, M., Pamp, S. J., Fukuyama, J., Hwang, P. H., Cho, D., Holmes, S., Relman, D. A. 2013; 14 (6): 631-640


    The indigenous microbiota of the nasal cavity plays important roles in human health and disease. Patterns of spatial variation in microbiota composition may help explain Staphylococcus aureus colonization and reveal interspecies and species-host interactions. To assess the biogeography of the nasal microbiota, we sampled healthy subjects, representing both S. aureus carriers and noncarriers at three nasal sites (anterior naris, middle meatus, and sphenoethmoidal recess). Phylogenetic compositional and sparse linear discriminant analyses revealed communities that differed according to site epithelium type and S. aureus culture-based carriage status. Corynebacterium accolens and C. pseudodiphtheriticum were identified as the most important microbial community determinants of S. aureus carriage, and competitive interactions were only evident at sites with ciliated pseudostratified columnar epithelium. In vitro cocultivation experiments provided supporting evidence of interactions among these species. These results highlight spatial variation in nasal microbial communities and differences in community composition between S. aureus carriers and noncarriers.

    View details for DOI 10.1016/j.chom.2013.11.005

    View details for PubMedID 24331461

  • Detection of cytomegalovirus drug resistance mutations by next-generation sequencing. Journal of clinical microbiology Sahoo, M. K., Lefterova, M. I., Yamamoto, F., Waggoner, J. J., Chou, S., Holmes, S. P., Anderson, M. W., Pinsky, B. A. 2013; 51 (11): 3700-3710


    Antiviral therapy for cytomegalovirus (CMV) plays an important role in the clinical management of solid organ and hematopoietic stem cell transplant recipients. However, CMV antiviral therapy can be complicated by drug resistance associated with mutations in the phosphotransferase UL97 and the DNA polymerase UL54. We have developed an amplicon-based high-throughput sequencing strategy for detecting CMV drug resistance mutations in clinical plasma specimens using a microfluidics PCR platform for multiplexed library preparation and a benchtop next-generation sequencing instrument. Plasmid clones of the UL97 and UL54 genes were used to demonstrate the low overall empirical error rate of the assay (0.189%) and to develop a statistical algorithm for identifying authentic low-abundance variants. The ability of the assay to detect resistance mutations was tested with mixes of wild-type and mutant plasmids, as well as clinical CMV isolates and plasma samples that were known to contain mutations that confer resistance. Finally, 48 clinical plasma specimens with a range of viral loads (394 to 2,191,011 copies/ml plasma) were sequenced using multiplexing of up to 24 specimens per run. This led to the identification of seven resistance mutations, three of which were present in <20% of the sequenced population. Thus, this assay offers more sensitive detection of minor variants and a higher multiplexing capacity than current methods for the genotypic detection of CMV drug resistance mutations.

    View details for DOI 10.1128/JCM.01605-13

    View details for PubMedID 23985916

  • Genetically dictated change in host mucus carbohydrate landscape exerts a diet-dependent effect on the gut microbiota PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Kashyap, P. C., Marcobal, A., Ursell, L. K., Smits, S. A., Sonnenburg, E. D., Costello, E. K., Higginbottom, S. K., Domino, S. E., Holmes, S. P., Relman, D. A., Knight, R., Gordon, J. I., Sonnenburg, J. L. 2013; 110 (42): 17059-17064


    We investigate how host mucus glycan composition interacts with dietary carbohydrate content to influence the composition and expressed functions of a human gut community. The humanized gnotobiotic mice mimic humans with a nonsecretor phenotype due to knockout of their α1-2 fucosyltransferase (Fut2) gene. The fecal microbiota of Fut2(-) mice that lack fucosylated host glycans show decreased alpha diversity relative to Fut2(+) mice and exhibit significant differences in community composition. A glucose-rich plant polysaccharide-deficient (PD) diet exerted a strong effect on the microbiota membership but eliminated the effect of Fut2 genotype. Additionally fecal metabolites predicted host genotype in mice on a polysaccharide-rich standard diet but not on a PD diet. A more detailed mechanistic analysis of these interactions involved colonization of gnotobiotic Fut2(+) and Fut2(-) mice with Bacteroides thetaiotaomicron, a prominent member of the human gut microbiota known to adaptively forage host mucosal glycans when dietary polysaccharides are absent. Within Fut2(-) mice, the B. thetaiotaomicron fucose catabolic pathway was markedly down-regulated, whereas BT4241-4247, an operon responsive to terminal β-galactose, the precursor that accumulates in the Fut2(-) mice, was significantly up-regulated. These changes in B. thetaiotaomicron gene expression were only evident in mice fed a PD diet, wherein B. thetaiotaomicron relies on host mucus consumption. Furthermore, up-regulation of the BT4241-4247 operon was also seen in humanized Fut2(-) mice. Together, these data demonstrate that differences in host genotype that affect the carbohydrate landscape of the distal gut interact with diet to alter the composition and function of resident microbes in a diet-dependent manner.

    View details for DOI 10.1073/pnas.1306070110

    View details for Web of Science ID 000325634200076

    View details for PubMedID 24062455

    View details for PubMedCentralID PMC3800993

  • Prototypical Recombinant Multi-Protease-Inhibitor-Resistant Infectious Molecular Clones of Human Immunodeficiency Virus Type 1 ANTIMICROBIAL AGENTS AND CHEMOTHERAPY Varghese, V., Mitsuya, Y., Fessel, W. J., Liu, T. F., Melikian, G. L., Katzenstein, D. A., Schiffer, C. A., Holmes, S. P., Shafer, R. W. 2013; 57 (9): 4290-4299


    The many genetic manifestations of HIV-1 protease inhibitor (PI) resistance present challenges to research into the mechanisms of PI-resistance and the assessment of new PIs. To address these challenges, we created a panel of recombinant multi-PI resistant infectious molecular clones designed to represent the spectrum of clinically relevant multi-PI resistant viruses. To assess the representativeness of this panel, we examined the sequences of the panel's viruses in the context of a correlation network of PI-resistance amino acid substitutions in sequences from more than 10,000 patients. The panel of recombinant infectious molecular clones comprised 29 of 41 study-defined PI-resistance amino acid substitutions and 23 of the 27 tightest amino acid substitution clusters. Based on their phenotypic properties, the clones were classified into four groups with increasing cross-resistance to the PIs most commonly used for salvage therapy: lopinavir (LPV), tipranavir (TPV), and darunavir (DRV). The panel of recombinant infectious molecular clones has been made available without restriction through the NIH AIDS Research and Reference Reagent Program. The public availability of the panel makes it possible to compare the inhibitory activity of different PIs with one another. The diversity of the panel and the high-level PI resistance of its clones suggest that investigational PIs active against the clones in this panel will retain antiviral activity against most, if not all clinically relevant PI-resistant viruses.

    View details for DOI 10.1128/AAC.00614-13

    View details for Web of Science ID 000323285500025


    View details for DOI 10.1214/12-AAP884

    View details for Web of Science ID 000321678200014

  • Harvester ants use interactions to regulate forager activation and availability ANIMAL BEHAVIOUR Pinter-Wollman, N., Bala, A., Merrell, A., Queirolo, J., Stumpe, M. C., Holmes, S., Gordon, D. M. 2013; 86 (1): 197-207


    Social groups balance flexibility and robustness in their collective response to environmental changes using feedback between behavioural processes that operate at different timescales. Here we examine how behavioural processes operating at two timescales regulate the foraging activity of colonies of the harvester ant, Pogonomyrmex barbatus, allowing them to balance their response to food availability and predation. Previous work showed that the rate at which foragers return to the nest with food influences the rate at which foragers leave the nest. To investigate how interactions inside the nest link the rates of returning and outgoing foragers, we observed outgoing foragers inside the nest in field colonies using a novel observation method. We found that the interaction rate experienced by outgoing foragers inside the nest corresponded to forager return rate, and that the interactions of outgoing foragers were spatially clustered. Activation of a forager occurred on the timescale of seconds: a forager left the nest 3-8 s after a substantial increase in interactions with returning foragers. The availability of outgoing foragers to become activated was adjusted on the timescale of minutes: when forager return was interrupted for more than 4-5 min, available foragers waiting near the nest entrance went deeper into the nest. Thus, forager activation and forager availability both increased with the rate at which foragers returned to the nest. This process was checked by negative feedback between forager activation and forager availability. Regulation of foraging activation on the timescale of seconds provides flexibility in response to fluctuations in food abundance, whereas regulation of forager availability on the timescale of minutes provides robustness in response to sustained disturbance such as predation.

    View details for DOI 10.1016/j.anbehav.2013.05.012

    View details for Web of Science ID 000321758700026

    View details for PubMedCentralID PMC3767282

  • Nucleoside reverse transcriptase inhibitor resistance mutations associated with first-line stavudine-containing antiretroviral therapy: programmatic implications for countries phasing out stavudine. journal of infectious diseases Tang, M. W., Rhee, S., Bertagnolio, S., Ford, N., Holmes, S., Sigaloff, K. C., Hamers, R. L., de Wit, T. F., Fleury, H. J., Kanki, P. J., Ruxrungtham, K., Hawkins, C. A., Wallis, C. L., Stevens, W., van Zyl, G. U., Manosuthi, W., Hosseinipour, M. C., Ngo-Giang-Huong, N., Belec, L., Peeters, M., Aghokeng, A., Bunupuradah, T., Burda, S., Cane, P., Cappelli, G., Charpentier, C., Dagnra, A. Y., Deshpande, A. K., El-Katib, Z., Eshleman, S. H., Fokam, J., Gody, J., Katzenstein, D., Koyalta, D. D., Kumwenda, J. J., Lallemant, M., Lynen, L., Marconi, V. C., Margot, N. A., Moussa, S., Ndung'u, T., Nyambi, P. N., Orrell, C., Schapiro, J. M., Schuurman, R., Sirivichayakul, S., Smith, D., Zolfo, M., Jordan, M. R., Shafer, R. W. 2013; 207: S70-7


    Background The World Health Organization Antiretroviral Treatment Guidelines recommend phasing-out stavudine because of its risk of long-term toxicity. There are two mutational pathways of stavudine resistance with different implications for zidovudine and tenofovir cross-resistance, the primary candidates for replacing stavudine. However, because resistance testing is rarely available in resource-limited settings, it is critical to identify the cross-resistance patterns associated with first-line stavudine failure. Methods We analyzed HIV-1 resistance mutations following first-line stavudine failure from 35 publications comprising 1,825 individuals. We also assessed the influence of concomitant nevirapine vs. efavirenz, therapy duration, and HIV-1 subtype on the proportions of mutations associated with zidovudine vs. tenofovir cross-resistance. Results Mutations with preferential zidovudine activity, K65R or K70E, occurred in 5.3% of individuals. Mutations with preferential tenofovir activity, ≥two thymidine analog mutations (TAMs) or Q151M, occurred in 22% of individuals. Nevirapine increased the risk of TAMs, K65R, and Q151M. Longer therapy increased the risk of TAMs and Q151M but not K65R. Subtype C and CRF01_AE increased the risk of K65R, but only CRF01_AE increased the risk of K65R without Q151M. Conclusions Regardless of concomitant nevirapine vs. efavirenz, therapy duration, or subtype, tenofovir was more likely than zidovudine to retain antiviral activity following first-line d4T therapy.

    View details for DOI 10.1093/infdis/jit114

    View details for PubMedID 23687292

    View details for PubMedCentralID PMC3657117

  • Nucleoside reverse transcriptase inhibitor resistance mutations associated with first-line Stavudine-containing antiretroviral therapy: programmatic implications for countries phasing out Stavudine. journal of infectious diseases Tang, M. W., Rhee, S., Bertagnolio, S., Ford, N., Holmes, S., Sigaloff, K. C., Hamers, R. L., de Wit, T. F., Fleury, H. J., Kanki, P. J., Ruxrungtham, K., Hawkins, C. A., Wallis, C. L., Stevens, W., van Zyl, G. U., Manosuthi, W., Hosseinipour, M. C., Ngo-Giang-Huong, N., Belec, L., Peeters, M., Aghokeng, A., Bunupuradah, T., Burda, S., Cane, P., Cappelli, G., Charpentier, C., Dagnra, A. Y., Deshpande, A. K., El-Katib, Z., Eshleman, S. H., Fokam, J., Gody, J., Katzenstein, D., Koyalta, D. D., Kumwenda, J. J., Lallemant, M., Lynen, L., Marconi, V. C., Margot, N. A., Moussa, S., Ndung'u, T., Nyambi, P. N., Orrell, C., Schapiro, J. M., Schuurman, R., Sirivichayakul, S., Smith, D., Zolfo, M., Jordan, M. R., Shafer, R. W. 2013; 207: S70-7


    Background The World Health Organization Antiretroviral Treatment Guidelines recommend phasing-out stavudine because of its risk of long-term toxicity. There are two mutational pathways of stavudine resistance with different implications for zidovudine and tenofovir cross-resistance, the primary candidates for replacing stavudine. However, because resistance testing is rarely available in resource-limited settings, it is critical to identify the cross-resistance patterns associated with first-line stavudine failure. Methods We analyzed HIV-1 resistance mutations following first-line stavudine failure from 35 publications comprising 1,825 individuals. We also assessed the influence of concomitant nevirapine vs. efavirenz, therapy duration, and HIV-1 subtype on the proportions of mutations associated with zidovudine vs. tenofovir cross-resistance. Results Mutations with preferential zidovudine activity, K65R or K70E, occurred in 5.3% of individuals. Mutations with preferential tenofovir activity, ≥two thymidine analog mutations (TAMs) or Q151M, occurred in 22% of individuals. Nevirapine increased the risk of TAMs, K65R, and Q151M. Longer therapy increased the risk of TAMs and Q151M but not K65R. Subtype C and CRF01_AE increased the risk of K65R, but only CRF01_AE increased the risk of K65R without Q151M. Conclusions Regardless of concomitant nevirapine vs. efavirenz, therapy duration, or subtype, tenofovir was more likely than zidovudine to retain antiviral activity following first-line d4T therapy.

    View details for DOI 10.1093/infdis/jit114

    View details for PubMedID 23687292

    View details for PubMedCentralID PMC3657117

  • Interval Graph Limits ANNALS OF COMBINATORICS Diaconis, P., Holmes, S., Janson, S. 2013; 17 (1): 27-52


    We work out a graph limit theory for dense interval graphs. The theory developed departs from the usual description of a graph limit as a symmetric function W (x, y) on the unit square, with x and y uniform on the interval (0, 1). Instead, we fix a W and change the underlying distribution of the coordinates x and y. We find choices such that our limits are continuous. Connections to random interval graphs are given, including some examples. We also show a continuity result for the chromatic number and clique number of interval graphs. Some results on uniqueness of the limit description are given for general graph limits.

    View details for DOI 10.1007/s00026-012-0175-0

    View details for Web of Science ID 000319358600003

    View details for PubMedCentralID PMC4578824

  • phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PloS one McMurdie, P. J., Holmes, S. 2013; 8 (4)


    the analysis of microbial communities through dna sequencing brings many challenges: the integration of different types of data with methods from ecology, genetics, phylogenetics, multivariate statistics, visualization and testing. With the increased breadth of experimental designs now being pursued, project-specific statistical analyses are often needed, and these analyses are often difficult (or impossible) for peer researchers to independently reproduce. The vast majority of the requisite tools for performing these analyses reproducibly are already implemented in R and its extensions (packages), but with limited support for high throughput microbiome census data.Here we describe a software project, phyloseq, dedicated to the object-oriented representation and analysis of microbiome census data in R. It supports importing data from a variety of common formats, as well as many analysis techniques. These include calibration, filtering, subsetting, agglomeration, multi-table comparisons, diversity analysis, parallelized Fast UniFrac, ordination methods, and production of publication-quality graphics; all in a manner that is easy to document, share, and modify. We show how to apply functions from other R packages to phyloseq-represented data, illustrating the availability of a large number of open source analysis techniques. We discuss the use of phyloseq with tools for reproducible research, a practice common in other fields but still rare in the analysis of highly parallel microbiome census data. We have made available all of the materials necessary to completely reproduce the analysis and figures included in this article, an example of best practices for reproducible research.The phyloseq project for R is a new open-source software package, freely available on the web from both GitHub and Bioconductor.

    View details for DOI 10.1371/journal.pone.0061217

    View details for PubMedID 23630581

    View details for PubMedCentralID PMC3632530

  • Sampling from a Manifold Festchrift for Joe Eaton Diaconis, P., Holmes, S., Shahshahani, M. IMS. 2013
  • Low-Level Persistence of Drug Resistance Mutations in Hepatitis B Virus-Infected Subjects with a Past History of Lamivudine Treatment ANTIMICROBIAL AGENTS AND CHEMOTHERAPY Margeridon-Thermet, S., Svarovskaia, E. S., Babrzadeh, F., Martin, R., Liu, T. F., Pacold, M., Reuman, E. C., Holmes, S. P., Borroto-Esoda, K., Shafer, R. W. 2013; 57 (1): 343-349


    We sought to determine the prevalence of hepatitis B virus (HBV) lamivudine (LAM)-resistant minority variants in subjects who once received LAM but had discontinued it prior to virus sampling. We performed direct PCR Sanger sequencing and ultradeep pyrosequencing (UDPS) of HBV reverse transcriptase (RT) of plasma viruses from 45 LAM-naive subjects and 46 LAM-experienced subjects who had discontinued LAM a median of 24 months earlier. UDPS was performed to a depth of ∼3,000 reads per nucleotide. Minority variants were defined as differences from the Sanger sequence present in ≥0.5% of UDPS reads in a sample. Sanger sequencing identified ≥1 LAM resistance mutations (rtL80I/V, rtM204I, and rtA181T) in samples from 5 (11%) of 46 LAM-experienced and none of 45 LAM-naive subjects (0%; P = 0.06). UDPS detected ≥1 LAM resistance mutations (rtL80I/V, rtV173L, rtL180M, rtA181T, and rtM204I/V) in 10 (22%) of the 46 LAM-experienced subjects, including 5 in whom LAM resistance mutations were not identified by Sanger sequencing. Overall, LAM resistance mutations were more likely to be present in LAM-experienced (10/46, 22%) than LAM-naive subjects (0/45, 0%; P = 0.001). The median time since LAM discontinuation was 12.8 months in the 10 subjects with a LAM resistance mutation compared to 30.5 months in the 36 LAM-experienced subjects without a LAM resistance mutation (P < 0.001). The likelihood of detecting a LAM resistance mutation was significantly increased using UDPS compared to Sanger sequencing and was inversely associated with the time since LAM discontinuation.

    View details for DOI 10.1128/AAC.01601-12

    View details for Web of Science ID 000312958400044

    View details for PubMedID 23114756

    View details for PubMedCentralID PMC3535911

  • Advancing Our Understanding of the Human Microbiome Using QIIME. Methods in enzymology Navas-Molina, J. A., Peralta-Sánchez, J. M., González, A., McMurdie, P. J., Vázquez-Baeza, Y., Xu, Z., Ursell, L. K., Lauber, C., Zhou, H., Song, S. J., Huntley, J., Ackermann, G. L., Berg-Lyons, D., Holmes, S., Caporaso, J. G., Knight, R. 2013; 531: 371-444


    High-throughput DNA sequencing technologies, coupled with advanced bioinformatics tools, have enabled rapid advances in microbial ecology and our understanding of the human microbiome. QIIME (Quantitative Insights Into Microbial Ecology) is an open-source bioinformatics software package designed for microbial community analysis based on DNA sequence data, which provides a single analysis framework for analysis of raw sequence data through publication-quality statistical analyses and interactive visualizations. In this chapter, we demonstrate the use of the QIIME pipeline to analyze microbial communities obtained from several sites on the bodies of transgenic and wild-type mice, as assessed using 16S rRNA gene sequences generated on the Illumina MiSeq platform. We present our recommended pipeline for performing microbial community analysis and provide guidelines for making critical choices in the process. We present examples of some of the types of analyses that are enabled by QIIME and discuss how other tools, such as phyloseq and R, can be applied to expand upon these analyses.

    View details for DOI 10.1016/B978-0-12-407863-5.00019-8

    View details for PubMedID 24060131

  • PRC2/EED-EZH2 Complex Is Up-Regulated in Breast Cancer Lymph Node Metastasis Compared to Primary Tumor and Correlates with Tumor Proliferation In Situ PLOS ONE Yu, H., Simons, D. L., Segall, I., Carcamo-Cavazos, V., Schwartz, E. J., Yan, N., Zuckerman, N. S., Dirbas, F. M., Johnson, D. L., Holmes, S. P., Lee, P. P. 2012; 7 (12)


    Lymph node metastasis is a key event in the progression of breast cancer. Therefore it is important to understand the underlying mechanisms which facilitate regional lymph node metastatic progression.We performed gene expression profiling of purified tumor cells from human breast tumor and lymph node metastasis. By microarray network analysis, we found an increased expression of polycomb repression complex 2 (PRC2) core subunits EED and EZH2 in lymph node metastatic tumor cells over primary tumor cells which were validated through real-time PCR. Additionally, immunohistochemical (IHC) staining and quantitative image analysis of whole tissue sections showed a significant increase of EZH2 expressing tumor cells in lymph nodes over paired primary breast tumors, which strongly correlated with tumor cell proliferation in situ. We further explored the mechanisms of PRC2 gene up-regulation in metastatic tumor cells and found up-regulation of E2F genes, MYC targets and down-regulation of tumor suppressor gene E-cadherin targets in lymph node metastasis through GSEA analyses. Using IHC, the expression of potential EZH2 target, E-cadherin was examined in paired primary/lymph node samples and was found to be significantly decreased in lymph node metastases over paired primary tumors.This study identified an over expression of the epigenetic silencing complex PRC2/EED-EZH2 in breast cancer lymph node metastasis as compared to primary tumor and its positive association with tumor cell proliferation in situ. Concurrently, PRC2 target protein E-cadherin was significant decreased in lymph node metastases, suggesting PRC2 promotes epithelial mesenchymal transition (EMT) in lymph node metastatic process through repression of E-cadherin. These results indicate that epigenetic regulation mediated by PRC2 proteins may provide additional advantage for the outgrowth of metastatic tumor cells in lymph nodes. This opens up epigenetic drug development possibilities for the treatment and prevention of lymph node metastasis in breast cancer.

    View details for DOI 10.1371/journal.pone.0051239

    View details for PubMedID 23251464

  • Denoising PCR-amplified metagenome data BMC BIOINFORMATICS Rosen, M. J., Callahan, B. J., Fisher, D. S., Holmes, S. P. 2012; 13


    PCR amplification and high-throughput sequencing theoretically enable the characterization of the finest-scale diversity in natural microbial and viral populations, but each of these methods introduces random errors that are difficult to distinguish from genuine biological diversity. Several approaches have been proposed to denoise these data but lack either speed or accuracy.We introduce a new denoising algorithm that we call DADA (Divisive Amplicon Denoising Algorithm). Without training data, DADA infers both the sample genotypes and error parameters that produced a metagenome data set. We demonstrate performance on control data sequenced on Roche's 454 platform, and compare the results to the most accurate denoising software currently available, AmpliconNoise.DADA is more accurate and over an order of magnitude faster than AmpliconNoise. It eliminates the need for training data to establish error parameters, fully utilizes sequence-abundance information, and enables inclusion of context-dependent PCR error rates. It should be readily extensible to other sequencing platforms such as Illumina.

    View details for DOI 10.1186/1471-2105-13-283

    View details for Web of Science ID 000314687600001

    View details for PubMedID 23113967

    View details for PubMedCentralID PMC3563472

  • Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Chakerian, J., Holmes, S. 2012; 21 (3): 581-599
  • Nest site and weather affect the personality of harvester ant colonies BEHAVIORAL ECOLOGY Pinter-Wollman, N., Gordon, D. M., Holmes, S. 2012; 23 (5): 1022-1029


    Environmental conditions and physical constraints both influence an animal's behavior. We investigate whether behavioral variation among colonies of the black harvester ant, Messor andrei, remains consistent across foraging and disturbance situations and ask whether consistent colony behavior is affected by nest site and weather. We examined variation among colonies in responsiveness to food baits and to disturbance, measured as a change in numbers of active ants, and in the speed with which colonies retrieved food and removed debris. Colonies differed consistently, across foraging and disturbance situations, in both responsiveness and speed. Increased activity in response to food was associated with a smaller decrease in response to alarm. Speed of retrieving food was correlated with speed of removing debris. In all colonies, speed was greater in dry conditions, reducing the amount of time ants spent outside the nest. While a colony occupied a certain nest site, its responsiveness was consistent in both foraging and disturbance situations, suggesting that nest structure influences colony personality.

    View details for DOI 10.1093/beheco/ars066

    View details for Web of Science ID 000308228200017

    View details for PubMedCentralID PMC3431114

  • The Molecular Architecture of the Eukaryotic Chaperonin TRiC/CCT STRUCTURE Leitner, A., Joachimiak, L. A., Bracher, A., Moenkemeyer, L., Walzthoeni, T., Chen, B., Pechmann, S., Holmes, S., Cong, Y., Ma, B., Ludtke, S., Chiu, W., Hartl, F. U., Aebersold, R., Frydman, J. 2012; 20 (5): 814-825


    TRiC/CCT is a highly conserved and essential chaperonin that uses ATP cycling to facilitate folding of approximately 10% of the eukaryotic proteome. This 1 MDa hetero-oligomeric complex consists of two stacked rings of eight paralogous subunits each. Previously proposed TRiC models differ substantially in their subunit arrangements and ring register. Here, we integrate chemical crosslinking, mass spectrometry, and combinatorial modeling to reveal the definitive subunit arrangement of TRiC. In vivo disulfide mapping provided additional validation for the crosslinking-derived arrangement as the definitive TRiC topology. This subunit arrangement allowed the refinement of a structural model using existing X-ray diffraction data. The structure described here explains all available crosslink experiments, provides a rationale for previously unexplained structural features, and reveals a surprising asymmetry of charges within the chaperonin folding chamber.

    View details for DOI 10.1016/j.str.2012.03.007

    View details for Web of Science ID 000304214400008

    View details for PubMedID 22503819

    View details for PubMedCentralID PMC3350567

  • Phyloseq: a bioconductor package for handling and analysis of high-throughput phylogenetic sequence data. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing McMurdie, P. J., Holmes, S. 2012: 235-246


    We present a detailed description of a new Bioconductor package, phyloseq, for integrated data and analysis of taxonomically-clustered phylogenetic sequencing data in conjunction with related data types. The phyloseq package integrates abundance data, phylogenetic information and covariates so that exploratory transformations, plots, and confirmatory testing and diagnostic plots can be carried out seamlessly. The package is built following the S4 object-oriented framework of the R language so that once the data have been input the user can easily transform, plot and analyze the data. We present some examples that highlight the methods and the ease with which we can leverage existing packages.

    View details for PubMedID 22174279

  • Comparisons of distance methods for combining covariates and abundances in microbiome studies. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Fukuyama, J., McMurdie, P. J., Dethlefsen, L., Relman, D. A., Holmes, S. 2012: 213-224


    This article compares different methods for combining abundance data, phylogenetic trees and clinical covariates in a nonparametric setting. In particular we study the output from the principal coordinates analysis on UNIFRAC and WEIGHTED UNIFRAC distances and the output from a double principal coordinate analyses DPCOA using distances computed on the phylogenetic tree. We also present power comparisons for some of the standard tests of phylogenetic signal between different types of samples. These methods are compared both on simulated and real data sets. Our study shows that DPCoA is less robust to outliers, and more robust to small noisy fluctuations around zero.

    View details for PubMedID 22174277

  • A multifaceted analysis of HIV-1 protease multidrug resistance phenotypes BMC BIOINFORMATICS Doherty, K. M., Nakka, P., King, B. M., Rhee, S., Holmes, S. P., Shafer, R. W., Radhakrishnan, M. L. 2011; 12


    Great strides have been made in the effective treatment of HIV-1 with the development of second-generation protease inhibitors (PIs) that are effective against historically multi-PI-resistant HIV-1 variants. Nevertheless, mutation patterns that confer decreasing susceptibility to available PIs continue to arise within the population. Understanding the phenotypic and genotypic patterns responsible for multi-PI resistance is necessary for developing PIs that are active against clinically-relevant PI-resistant HIV-1 variants.In this work, we use globally optimal integer programming-based clustering techniques to elucidate multi-PI phenotypic resistance patterns using a data set of 398 HIV-1 protease sequences that have each been phenotyped for susceptibility toward the nine clinically-approved HIV-1 PIs. We validate the information content of the clusters by evaluating their ability to predict the level of decreased susceptibility to each of the available PIs using a cross validation procedure. We demonstrate the finding that as a result of phenotypic cross resistance, the considered clinical HIV-1 protease isolates are confined to ~6% or less of the clinically-relevant phenotypic space. Clustering and feature selection methods are used to find representative sequences and mutations for major resistance phenotypes to elucidate their genotypic signatures. We show that phenotypic similarity does not imply genotypic similarity, that different PI-resistance mutation patterns can give rise to HIV-1 isolates with similar phenotypic profiles.Rather than characterizing HIV-1 susceptibility toward each PI individually, our study offers a unique perspective on the phenomenon of PI class resistance by uncovering major multidrug-resistant phenotypic patterns and their often diverse genotypic determinants, providing a methodology that can be applied to understand clinically-relevant phenotypic patterns to aid in the design of novel inhibitors that target other rapidly evolving molecular targets as well.

    View details for DOI 10.1186/1471-2105-12-477

    View details for Web of Science ID 000301382000001

    View details for PubMedID 22172090

    View details for PubMedCentralID PMC3305535



    Today's data-heavy research environment requires the integration of different sources of information into structured datasets that can not be analyzed as simple matrices. We introduce an old technique, known in the European data analyses circles as the Duality Diagram Approach, put to new uses through the use of a variety of metrics and ways of combining different diagrams together. This issue of the Annals of Applied Statistics contains contemporary examples of how this approach provides solutions to hard problems in data integration. We present here the genesis of the technique and how it can be seen as a precursor of the modern kernel based approaches.

    View details for DOI 10.1214/10-AOAS408

    View details for Web of Science ID 000300382800002

    View details for PubMedCentralID PMC3265363

  • The effect of individual variation on the structure and function of interaction networks in harvester ants JOURNAL OF THE ROYAL SOCIETY INTERFACE Pinter-Wollman, N., Wollman, R., Guetz, A., Holmes, S., Gordon, D. M. 2011; 8 (64): 1562-1573


    Social insects exhibit coordinated behaviour without central control. Local interactions among individuals determine their behaviour and regulate the activity of the colony. Harvester ants are recruited for outside work, using networks of brief antennal contacts, in the nest chamber closest to the nest exit: the entrance chamber. Here, we combine empirical observations, image analysis and computer simulations to investigate the structure and function of the interaction network in the entrance chamber. Ant interactions were distributed heterogeneously in the chamber, with an interaction hot-spot at the entrance leading further into the nest. The distribution of the total interactions per ant followed a right-skewed distribution, indicating the presence of highly connected individuals. Numbers of ant encounters observed positively correlated with the duration of observation. Individuals varied in interaction frequency, even after accounting for the duration of observation. An ant's interaction frequency was explained by its path shape and location within the entrance chamber. Computer simulations demonstrate that variation among individuals in connectivity accelerates information flow to an extent equivalent to an increase in the total number of interactions. Individual variation in connectivity, arising from variation among ants in location and spatial behaviour, creates interaction centres, which may expedite information flow.

    View details for DOI 10.1098/rsif.2011.0059

    View details for Web of Science ID 000295211200003

    View details for PubMedID 21490001

    View details for PubMedCentralID PMC3177612

  • Adaptive importance sampling for network growth models ANNALS OF OPERATIONS RESEARCH Guetz, A. N., Holmes, S. P. 2011; 189 (1): 187-203


    Network Growth Models such as Preferential Attachment and Duplication/Divergence are popular generative models with which to study complex networks in biology, sociology, and computer science. However, analyzing them within the framework of model selection and statistical inference is often complicated and computationally difficult, particularly when comparing models that are not directly related or nested. In practice, ad hoc methods are often used with uncertain results. If possible, the use of standard likelihood-based statistical model selection techniques is desirable. With this in mind, we develop an Adaptive Importance Sampling algorithm for estimating likelihoods of Network Growth Models. We introduce the use of the classic Plackett-Luce model of rankings as a family of importance distributions. Updates to importance distributions are performed iteratively via the Cross-Entropy Method with an additional correction for degeneracy/over-fitting inspired by the Minimum Description Length principle. This correction can be applied to other estimation problems using the Cross-Entropy method for integration/approximate counting, and it provides an interpretation of Adaptive Importance Sampling as iterative model selection. Empirical results for the Preferential Attachment model are given, along with a comparison to an alternative established technique, Annealed Importance Sampling.

    View details for DOI 10.1007/s10479-010-0685-2

    View details for Web of Science ID 000294689000010

    View details for PubMedCentralID PMC4863242

  • Colonic Contribution to Uremic Solutes JOURNAL OF THE AMERICAN SOCIETY OF NEPHROLOGY Aronov, P. A., Luo, F. J., Plummer, N. S., Quan, Z., Holmes, S., Hostetter, T. H., Meyer, T. W. 2011; 22 (9): 1769-1776


    Microbes in the colon produce compounds, normally excreted by the kidneys, which are potential uremic toxins. Although p-cresol sulfate and indoxyl sulfate are well studied examples, few other compounds are known. Here, we compared plasma from hemodialysis patients with and without colons to identify and further characterize colon-derived uremic solutes. HPLC confirmed the colonic origin of p-cresol sulfate and indoxyl sulfate, but levels of hippurate, methylamine, and dimethylamine were not significantly lower in patients without colons. High-resolution mass spectrometry detected more than 1000 features in predialysis plasma samples. Hierarchical clustering based on these features clearly separated dialysis patients with and without colons. Compared with patients with colons, we identified more than 30 individual features in patients without colons that were either absent or present in lower concentration. Almost all of these features were more prominent in plasma from dialysis patients than normal subjects, suggesting that they represented uremic solutes. We used a panel of indole and phenyl standards to identify five colon-derived uremic solutes: α-phenylacetyl-l-glutamine, 5-hydroxyindole, indoxyl glucuronide, p-cresol sulfate, and indoxyl sulfate. However, compounds with accurate mass values matching most of the colon-derived solutes could not be found in standard metabolomic databases. These results suggest that colonic microbes may produce an important portion of uremic solutes, most of which remain unidentified.

    View details for DOI 10.1681/ASN.2010121220

    View details for Web of Science ID 000295705800024

    View details for PubMedID 21784895

    View details for PubMedCentralID PMC3171947

  • Site-Specific Mobilization of Vinyl Chloride Respiration Islands by a Mechanism Common in Dehalococcoides BMC GENOMICS McMurdie, P. J., Hug, L. A., Edwards, E. A., Holmes, S., Spormann, A. M. 2011; 12


    Vinyl chloride is a widespread groundwater pollutant and Group 1 carcinogen. A previous comparative genomic analysis revealed that the vinyl chloride reductase operon, vcrABC, of Dehalococcoides sp. strain VS is embedded in a horizontally-acquired genomic island that integrated at the single-copy tmRNA gene, ssrA.We targeted conserved positions in available genomic islands to amplify and sequence four additional vcrABC -containing genomic islands from previously-unsequenced vinyl chloride respiring Dehalococcoides enrichments. We identified a total of 31 ssrA-specific genomic islands from Dehalococcoides genomic data, accounting for 47 reductive dehalogenase homologous genes and many other non-core genes. Sixteen of these genomic islands contain a syntenic module of integration-associated genes located adjacent to the predicted site of integration, and among these islands, eight contain vcrABC as genetic 'cargo'. These eight vcrABC -containing genomic islands are syntenic across their ~12 kbp length, but have two phylogenetically discordant segments that unambiguously differentiate the integration module from the vcrABC cargo. Using available Dehalococcoides phylogenomic data we estimate that these ssrA-specific genomic islands are at least as old as the Dehalococcoides group itself, which in turn is much older than human civilization.The vcrABC -containing genomic islands are a recently-acquired subset of a diverse collection of ssrA-specific mobile elements that are a major contributor to strain-level diversity in Dehalococcoides, and may have been throughout its evolution. The high similarity between vcrABC sequences is quantitatively consistent with recent horizontal acquisition driven by ~100 years of industrial pollution with chlorinated ethenes.

    View details for DOI 10.1186/1471-2164-12-287

    View details for Web of Science ID 000293280200001

    View details for PubMedID 21635780

    View details for PubMedCentralID PMC3146451

  • Colony variation in the collective regulation of foraging by harvester ants BEHAVIORAL ECOLOGY Gordon, D. M., Guetz, A., Greene, M. J., Holmes, S. 2011; 22 (2): 429-435


    This study investigates variation in collective behavior in a natural population of colonies of the harvester ant, Pogonomyrmex barbatus. Harvester ant colonies regulate foraging activity to adjust to current food availability; the rate at which inactive foragers leave the nest on the next trip depends on the rate at which successful foragers return with food. This study investigates differences among colonies in foraging activity and how these differences are associated with variation among colonies in the regulation of foraging. Colonies differ in the baseline rate at which patrollers leave the nest, without stimulation from returning ants. This baseline rate predicts a colony's foraging activity, suggesting there is a colony-specific activity level that influences how quickly any ant leaves the nest. When a colony's foraging activity is high, the colony is more likely to regulate foraging. Moreover, colonies differ in the propensity to adjust the rate of outgoing foragers to the rate of forager return. Naturally occurring variation in the regulation of foraging may lead to variation in colony survival and reproductive success.

    View details for DOI 10.1093/beheco/arq218

    View details for Web of Science ID 000289299500033

    View details for PubMedCentralID PMC3071749

  • PhyloChip microarray analysis reveals altered gastrointestinal microbial communities in a rat model of colonic hypersensitivity NEUROGASTROENTEROLOGY AND MOTILITY Nelson, T. A., Holmes, S., ALEKSEYENKO, A. V., Shenoy, M., DeSantis, T., Wu, C. H., Andersen, G. L., Winston, J., Sonnenburg, J., Pasricha, P. J., Spormann, A. 2011; 23 (2)


    Irritable bowel syndrome (IBS) is a chronic, episodic gastrointestinal disorder that is prevalent in a significant fraction of western human populations; and changes in the microbiota of the large bowel have been implicated in the pathology of the disease.Using a novel comprehensive, high-density DNA microarray (PhyloChip) we performed a phylogenetic analysis of the microbial community of the large bowel in a rat model in which intracolonic acetic acid in neonates was used to induce long lasting colonic hypersensitivity and decreased stool water content and frequency, representing the equivalent of human constipation-predominant IBS.Our results revealed a significantly increased compositional difference in the microbial communities in rats with neonatal irritation as compared with controls. Even more striking was the dramatic change in the ratio of Firmicutes relative to Bacteroidetes, where neonatally irritated rats were enriched more with Bacteroidetes and also contained a different composition of species within this phylum. Our study also revealed differences at the level of bacterial families and species.The PhyloChip is a useful and convenient method to study enteric microflora. Further, this rat model system may be a useful experimental platform to study the causes and consequences of changes in microbial community composition associated with IBS.

    View details for DOI 10.1111/j.1365-2982.2010.01637.x

    View details for Web of Science ID 000286211600017

    View details for PubMedID 21129126

    View details for PubMedCentralID PMC3353725

  • Visualization and statistical comparisons of microbial communities using R packages on Phylochip data. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Holmes, S., Alekseyenko, A., Timme, A., Nelson, T., Pasricha, P. J., Spormann, A. 2011: 142-153


    This article explains the statistical and computational methodology used to analyze species abundances collected using the LNBL Phylochip in a study of Irritable Bowel Syndrome (IBS) in rats. Some tools already available for the analysis of ordinary microarray data are useful in this type of statistical analysis. For instance in correcting for multiple testing we use Family Wise Error rate control and step-down tests (available in the multtest package). Once the most significant species are chosen we use the hypergeometric tests familiar for testing GO categories to test specific phyla and families. We provide examples of normalization, multivariate projections, batch effect detection and integration of phylogenetic covariation, as well as tree equalization and robustification methods.

    View details for PubMedID 21121042

  • Bioinformatics and Data Analysis Handbook of Molecular and Cellular Methods in Biology and Medicine de La Cruz, O., Holmes, S. CRC. 2011; 3rd: 409–436
  • A classification model for G-to-A hypermutation in hepatitis B virus ultra-deep pyrosequencing reads BIOINFORMATICS Reuman, E. C., Margeridon-Thermet, S., Caudill, H. B., Liu, T., Borroto-Esoda, K., Svarovskaia, E. S., Holmes, S. P., Shafer, R. W. 2010; 26 (23): 2929-2932


    G → A hypermutation is an innate antiviral defense mechanism, mediated by host enzymes, which leads to the mutational impairment of viruses. Sensitive and specific identification of host-mediated G → A hypermutation is a novel sequence analysis challenge, particularly for viral deep sequencing studies. For example, two of the most common hepatitis B virus (HBV) reverse transcriptase (RT) drug-resistance mutations, A181T and M204I, arise from G → A changes and are routinely detected as low-abundance variants in nearly all HBV deep sequencing samples.We developed a classification model using measures of G → A excess and predicted indicators of lethal mutation and applied this model to 325 920 unique deep sequencing reads from plasma virus samples from 45 drug treatment-naïve HBV-infected individuals. The 2.9% of sequence reads that were classified as hypermutated by our model included most of the reads with A181T and/or M204I, indicating the usefulness of this model for distinguishing viral adaptive changes from host-mediated viral editing.Source code and sequence data are available at data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btq570

    View details for Web of Science ID 000284430900001

    View details for PubMedID 20937597

    View details for PubMedCentralID PMC2982158

  • Quantitative, Architectural Analysis of Immune Cell Subsets in Tumor-Draining Lymph Nodes from Breast Cancer Patients and Healthy Lymph Nodes PLOS ONE Setiadi, A. F., Ray, N. C., Kohrt, H. E., Kapelner, A., Carcamo-Cavazos, V., Levic, E. B., Yadegarynia, S., van der Loos, C. M., Schwartz, E. J., Holmes, S., Lee, P. P. 2010; 5 (8)


    To date, pathological examination of specimens remains largely qualitative. Quantitative measures of tissue spatial features are generally not captured. To gain additional mechanistic and prognostic insights, a need for quantitative architectural analysis arises in studying immune cell-cancer interactions within the tumor microenvironment and tumor-draining lymph nodes (TDLNs).We present a novel, quantitative image analysis approach incorporating 1) multi-color tissue staining, 2) high-resolution, automated whole-section imaging, 3) custom image analysis software that identifies cell types and locations, and 4) spatial statistical analysis. As a proof of concept, we applied this approach to study the architectural patterns of T and B cells within tumor-draining lymph nodes from breast cancer patients versus healthy lymph nodes. We found that the spatial grouping patterns of T and B cells differed between healthy and breast cancer lymph nodes, and this could be attributed to the lack of B cell localization in the extrafollicular region of the TDLNs.Our integrative approach has made quantitative analysis of complex visual data possible. Our results highlight spatial alterations of immune cells within lymph nodes from breast cancer patients as an independent variable from numerical changes. This opens up new areas of investigations in research and medicine. Future application of this approach will lead to a better understanding of immune changes in the tumor microenvironment and TDLNs, and how they affect clinical outcomes.

    View details for DOI 10.1371/journal.pone.0012420

    View details for Web of Science ID 000281234700034

    View details for PubMedID 20811638

    View details for PubMedCentralID PMC2928294

  • Constrained patterns of covariation and clustering of HIV-1 non-nucleoside reverse transcriptase inhibitor resistance mutations JOURNAL OF ANTIMICROBIAL CHEMOTHERAPY Reuman, E. C., Rhee, S., Holmes, S. P., Shafer, R. W. 2010; 65 (7): 1477-1485


    We characterized pairwise and higher order patterns of non-nucleoside reverse transcriptase inhibitor (NNRTI)-selected mutations because multiple mutations are usually required for clinically significant resistance to second-generation NNRTIs.We analysed viruses from 13 039 individuals with sequences containing at least one of 52 published NNRTI-selected mutations, including 1133 viruses from individuals who received efavirenz but no other NNRTI and 1510 viruses from individuals who received nevirapine but no other NNRTI. Of the 17 reported etravirine resistance-associated mutations (RAMs), Y181C/I/V, L100I, K101P and M230L were considered major based on published in vitro susceptibility data.Efavirenz preferentially selected for 16 mutations, including L100I (14% versus 0.1%, P < 0.001), K101P (3.3% versus 0.4%, P < 0.001) and M230L (2.8% versus 1.3%, P = 0.004), whereas nevirapine preferentially selected for 12 mutations, including Y181C/I/V (48% versus 6.9%, P < 0.001). Twenty-nine pairs of NNRTI-selected mutations covaried significantly, including Y181C with seven other mutations (A98G, K101E/H, V108I, G190A/S and H221Y), L100I with K103N, and K101P with K103S. Two pairs (Y181C + V179F and Y181C + G190S) were predicted to confer >10-fold decreased etravirine susceptibility. Seventeen percent of sequences had three or more NNRTI-selected mutations, mostly in clusters of covarying mutations. Many clusters had Y181C plus a non-major etravirine RAM; few had more than one major etravirine RAM.Although major etravirine RAMs rarely occur in combination, 2 of 29 pairs of covarying mutations were associated with >10-fold decreased etravirine susceptibility. Viruses with three or more NNRTI-selected mutations often contained Y181C in combination with one or more minor etravirine RAMs; however, phenotypic and clinical correlates for most of these higher order combinations have not been published.

    View details for DOI 10.1093/jac/dkq140

    View details for Web of Science ID 000279926500027

    View details for PubMedID 20462946

    View details for PubMedCentralID PMC2882873

  • Localized Plasticity in the Streamlined Genomes of Vinyl Chloride Respiring Dehalococcoides PLOS GENETICS McMurdie, P. J., Behrens, S. F., Mueller, J. A., Goeke, J., Ritalahti, K. M., Wagner, R., Goltsman, E., Lapidus, A., Holmes, S., Loeffler, F. E., Spormann, A. M. 2009; 5 (11)


    Vinyl chloride (VC) is a human carcinogen and widespread priority pollutant. Here we report the first, to our knowledge, complete genome sequences of microorganisms able to respire VC, Dehalococcoides sp. strains VS and BAV1. Notably, the respective VC reductase encoding genes, vcrAB and bvcAB, were found embedded in distinct genomic islands (GEIs) with different predicted integration sites, suggesting that these genes were acquired horizontally and independently by distinct mechanisms. A comparative analysis that included two previously sequenced Dehalococcoides genomes revealed a contextually conserved core that is interrupted by two high plasticity regions (HPRs) near the Ori. These HPRs contain the majority of GEIs and strain-specific genes identified in the four Dehalococcoides genomes, an elevated number of repeated elements including insertion sequences (IS), as well as 91 of 96 rdhAB, genes that putatively encode terminal reductases in organohalide respiration. Only three core rdhA orthologous groups were identified, and only one of these groups is supported by synteny. The low number of core rdhAB, contrasted with the high rdhAB numbers per genome (up to 36 in strain VS), as well as their colocalization with GEIs and other signatures for horizontal transfer, suggests that niche adaptation via organohalide respiration is a fundamental ecological strategy in Dehalococccoides. This adaptation has been exacted through multiple mechanisms of recombination that are mainly confined within HPRs of an otherwise remarkably stable, syntenic, streamlined genome among the smallest of any free-living microorganism.

    View details for DOI 10.1371/journal.pgen.1000714

    View details for Web of Science ID 000272419500010

    View details for PubMedID 19893622

    View details for PubMedCentralID PMC2764846

  • Nonpolymorphic Human Immunodeficiency Virus Type 1 Protease and Reverse Transcriptase Treatment-Selected Mutations ANTIMICROBIAL AGENTS AND CHEMOTHERAPY Shahriar, R., Rhee, S., Liu, T. F., Fessel, W. J., Scarsella, A., Towner, W., Holmes, S. P., Zolopa, A. R., Shafer, R. W. 2009; 53 (11): 4869-4878


    The spectrum of human immunodeficiency virus type 1 (HIV-1) protease and reverse transcriptase (RT) mutations selected by antiretroviral (ARV) drugs requires ongoing reassessment as ARV treatment patterns evolve and increasing numbers of protease and RT sequences of different viral subtypes are published. Accordingly, we compared the prevalences of protease and RT mutations in HIV-1 group M sequences from individuals with and without a history of previous treatment with protease inhibitors (PIs) or RT inhibitors (RTIs). Mutations in protease sequences from 26,888 individuals and in RT sequences from 25,695 individuals were classified according to whether they were nonpolymorphic in untreated individuals and whether their prevalence increased fivefold with ARV therapy. This analysis showed that 88 PI-selected and 122 RTI-selected nonpolymorphic mutations had a prevalence that was fivefold higher in individuals receiving ARVs than in ARV-naïve individuals. This was an increase of 47% and 77%, respectively, compared with the 60 PI- and 69 RTI-selected mutations identified in a similar analysis that we published in 2005 using subtype B sequences obtained from one-fourth as many individuals. In conclusion, many nonpolymorphic mutations in protease and RT are under ARV selection pressure. The spectrum of treatment-selected mutations is changing as data for more individuals are collected, treatment exposures change, and the number of available sequences from non-subtype B viruses increases.

    View details for DOI 10.1128/AAC.00592-09

    View details for Web of Science ID 000270881200040

    View details for PubMedID 19721070

    View details for PubMedCentralID PMC2772298

  • Impaired interferon signaling is a common immune defect in human cancer PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Critchley-Thorne, R. J., Simons, D. L., Yan, N., Miyahira, A. K., Dirbas, F. M., Johnson, D. L., Swetter, S. M., Carlson, R. W., Fisher, G. A., Koong, A., Holmes, S., Lee, P. P. 2009; 106 (22): 9010-9015


    Immune dysfunction develops in patients with many cancer types and may contribute to tumor progression and failure of immunotherapy. Mechanisms underlying cancer-associated immune dysfunction are not fully understood. Efficient IFN signaling is critical to lymphocyte function; animals rendered deficient in IFN signaling develop cancer at higher rates. We hypothesized that altered IFN signaling may be a key mechanism of immune dysfunction common to cancer. To address this, we assessed the functional responses to IFN in peripheral blood lymphocytes from patients with 3 major cancers: breast cancer, melanoma, and gastrointestinal cancer. Type-I IFN (IFN-alpha)-induced signaling was reduced in T cells and B cells from all 3 cancer-patient groups compared to healthy controls. Type-II IFN (IFN-gamma)-induced signaling was reduced in B cells from all 3 cancer patient groups, but not in T cells or natural killer cells. Impaired-IFN signaling was equally evident in stage II, III, and IV breast cancer patients, and downstream functional defects in T cell activation were identified. Taken together, these findings indicate that defects in lymphocyte IFN signaling arise in patients with breast cancer, melanoma, and gastrointestinal cancer, and these defects may represent a common cancer-associated mechanism of immune dysfunction.

    View details for DOI 10.1073/pnas.0901329106

    View details for PubMedID 19451644

  • An Interactive Java Statistical Image Segmentation System: GemIdent JOURNAL OF STATISTICAL SOFTWARE Holmes, S., Kapelner, A., Lee, P. P. 2009; 30 (10): 1-20
  • How site fidelity leads to individual differences in the foraging activity of harvester ants BEHAVIORAL ECOLOGY Beverly, B. D., McLendon, H., Nacu, S., Holmes, S., Gordon, D. M. 2009; 20 (3): 633-638
  • Ultra-Deep Pyrosequencing of Hepatitis B Virus Quasispecies from Nucleoside and Nucleotide Reverse-Transcriptase Inhibitor (NRTI)-Treated Patients and NRTI-Naive Patients 15th Conference on Retroviruses and Opportunistic Infections Margeridon-Thermet, S., Shulman, N. S., Ahmed, A., Shahriar, R., Liu, T., Wang, C., Holmes, S. P., Babrzadeh, F., Gharizadeh, B., Hanczaruk, B., Simen, B. B., Egholm, M., Shafer, R. W. OXFORD UNIV PRESS INC. 2009: 1275–85


    The dynamics of emerging nucleoside and nucleotide reverse-transcriptase inhibitor (NRTI) resistance in hepatitis B virus (HBV) are not well understood because standard dideoxynucleotide direct polymerase chain reaction (PCR) sequencing assays detect drug-resistance mutations only after they have become dominant. To obtain insight into NRTI resistance, we used a new sequencing technology to characterize the spectrum of low-prevalence NRTI-resistance mutations in HBV obtained from 20 plasma samples from 11 NRTI-treated patients and 17 plasma samples from 17 NRTI-naive patients, by using standard direct PCR sequencing and ultra-deep pyrosequencing (UDPS). UDPS detected drug-resistance mutations that were not detected by PCR in 10 samples from 5 NRTI-treated patients, including the lamivudine-resistance mutation V173L (in 5 samples), the entecavir-resistance mutations T184S (in 2 samples) and S202G (in 1 sample), the adefovir-resistance mutation N236T (in 1 sample), and the lamivudine and adefovir-resistance mutations V173L, L180M, A181T, and M204V (in 1 sample). G-to-A hypermutation mediated by the apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like family of cytidine deaminases was estimated to be present in 0.6% of reverse-transcriptase genes. Genotype A coinfection was detected by UDPS in each of 3 patients in whom genotype G virus was detected by direct PCR sequencing. UDPS detected low-prevalence HBV variants with NRTI-resistance mutations, G-to-A hypermutation, and low-level dual genotype infection with a sensitivity not previously possible.

    View details for DOI 10.1086/597808

    View details for Web of Science ID 000265035500007

    View details for PubMedID 19301976

    View details for PubMedCentralID PMC3353721

  • Minority Human Immunodeficiency Virus Type 1 Variants in Antiretroviral-Naive Persons with Reverse Transcriptase Codon 215 Revertant Mutations JOURNAL OF VIROLOGY Mitsuya, Y., Varghese, V., Wang, C., Liu, T. F., Holmes, S. P., Jayakumar, P., Gharizadeh, B., Ronaghi, M., Klein, D., Fessel, W. J., Shafer, R. W. 2008; 82 (21): 10747-10755


    T215 revertant mutations such as T215C/D/E/S that evolve from the nucleoside reverse transcriptase (RT) inhibitor mutations T215Y/F have been found in about 3% of human immunodeficiency virus type 1 (HIV-1) isolates from newly diagnosed HIV-1-infected persons. We used a newly developed sequencing method-ultradeep pyrosequencing (UDPS; 454 Life Sciences)--to determine the frequency with which T215Y/F or other RT inhibitor resistance mutations could be detected as minority variants in samples from untreated persons that contain T215 revertants ("revertant" samples) compared with samples from untreated persons that lack such revertants ("control" samples). Among the 22 revertant and 29 control samples, UDPS detected a mean of 3.8 and 4.8 additional RT amino acid mutations, respectively. In 6 of 22 (27%) revertant samples and in 4 of 29 control samples (14%; P = 0.4), UDPS detected one or more RT inhibitor resistance mutations. T215Y or T215F was not detected in any of the revertant or control samples; however, 4 of 22 revertant samples had one or more T215 revertants that were detected by UDPS but not by direct PCR sequencing. The failure to detect viruses with T215Y/F in the 22 revertant samples in this study may result from the overwhelming replacement of transmitted T215Y variants by the more fit T215 revertants or from the primary transmission of a T215 revertant in a subset of persons with T215 revertants.

    View details for DOI 10.1128/JVI.01827-07

    View details for Web of Science ID 000260109600041

    View details for PubMedID 18715933

    View details for PubMedCentralID PMC2573178


    View details for DOI 10.1214/08-AOAS165

    View details for Web of Science ID 000261057900001

  • Natural variation of HIV-1 group M integrase: Implications for a new class of antiretroviral inhibitors RETROVIROLOGY Rhee, S., Liu, T. F., Kiuchi, M., Zioni, R., Gifford, R. J., Holmes, S. P., Shafer, R. W. 2008; 5


    HIV-1 integrase is the third enzymatic target of antiretroviral (ARV) therapy. However, few data have been published on the distribution of naturally occurring amino acid variation in this enzyme. We therefore characterized the distribution of integrase variants among more than 1,800 published group M HIV-1 isolates from more than 1,500 integrase inhibitor (INI)-naïve individuals. Polymorphism rates equal or above 0.5% were found for 34% of the central core domain positions, 42% of the C-terminal domain positions, and 50% of the N-terminal domain positions. Among 727 ARV-naïve individuals in whom the complete pol gene was sequenced, integrase displayed significantly decreased inter- and intra-subtype diversity and a lower Shannon's entropy than protease or RT. All primary INI-resistance mutations with the exception of E157Q--which was present in 1.1% of sequences--were nonpolymorphic. Several accessory INI-resistance mutations including L74M, T97A, V151I, G163R, and S230N were also polymorphic with polymorphism rates ranging between 0.5% to 2.0%.

    View details for DOI 10.1186/1742-4690-5-74

    View details for Web of Science ID 000259485200001

    View details for PubMedID 18687142

    View details for PubMedCentralID PMC2546438

  • Genomic interrogation of ancestral Mycobacterium tuberculosis from south India 8th International Congress on Molecular Epidemiology and Evolutionary Genetics of Infectious Diseases Narayanan, S., Gagneux, S., Hari, L., Tsolaki, A. G., Rajasekhar, S., Narayanan, P. R., Small, P. M., Holmes, S., DeRiemer, K. ELSEVIER SCIENCE BV. 2008: 474–83


    Mycobacterium tuberculosis is a very important global pathogen. One quarter of the world's TB cases occur in India. The tuberculosis strains isolated from south Indian patients exhibit certain phenotypic characteristics like low virulence in guinea-pigs, resistance to isoniazid, thiophene-2-carboxylic acid hydrazide (TCH) and para-amino salicylic acid (PAS), and enhanced susceptibility to H2O2. Besides this, a large percentage of the isolates harbor only a single copy of IS 6110 which makes these strains distinct. Hence, we have studied the genotypic characteristics of these strains by using advanced techniques like Deletion Micro array, deletion PCR, allelic discrimination RT-PCR using several lineage specific markers and KatG G1388T (non-synonymous) polymorphism along with spoligotyping. The analysis of 1215 tuberculosis patient isolates from south India revealed that 85.2% belonged to the ancestral lineage of M. tuberculosis. Comparative whole-genome hybridization identified six new genomic regions within this lineage that were variably deleted.

    View details for DOI 10.1016/j.meegid.2007.09.007

    View details for Web of Science ID 000257001400012

    View details for PubMedID 18024233

  • Threshold Graph Limits and Random Threshold Graphs. Internet mathematics Diaconis, P., Holmes, S., Janson, S. 2008; 5 (3): 267-320


    We study the limit theory of large threshold graphs and apply this to a variety of models for random threshold graphs. The results give a nice set of examples for the emerging theory of graph limits.

    View details for PubMedID 20811581

  • The short-term regulation of foraging in harvester ants BEHAVIORAL ECOLOGY Gordon, D. M., Holmes, S., Nacu, S. 2008; 19 (1): 217-222
  • Dynamical bias in the coin toss SIAM REVIEW Diaconis, P., Holmes, S., Montgomery, R. 2007; 49 (2): 211-235
  • HIV-1 subtype B protease and reverse transcriptase amino acid covariation PLOS COMPUTATIONAL BIOLOGY Rhee, S., Liu, T. F., Holmes, S. P., Shafer, R. W. 2007; 3 (5): 836-843


    Despite the high degree of HIV-1 protease and reverse transcriptase (RT) mutation in the setting of antiretroviral therapy, the spectrum of possible virus variants appears to be limited by patterns of amino acid covariation. We analyzed patterns of amino acid covariation in protease and RT sequences from more than 7,000 persons infected with HIV-1 subtype B viruses obtained from the Stanford HIV Drug Resistance Database ( In addition, we examined the relationship between conditional probabilities associated with a pair of mutations and the order in which those mutations developed in viruses for which longitudinal sequence data were available. Patterns of RT covariation were dominated by the distinct clustering of Type I and Type II thymidine analog mutations and the Q151M-associated mutations. Patterns of protease covariation were dominated by the clustering of nelfinavir-associated mutations (D30N and N88D), two main groups of protease inhibitor (PI)-resistance mutations associated either with V82A or L90M, and a tight cluster of mutations associated with decreased susceptibility to amprenavir and the most recently approved PI darunavir. Different patterns of covariation were frequently observed for different mutations at the same position including the RT mutations T69D versus T69N, L74V versus L74I, V75I versus V75M, T215F versus T215Y, and K219Q/E versus K219N/R, and the protease mutations M46I versus M46L, I54V versus I54M/L, and N88D versus N88S. Sequence data from persons with correlated mutations in whom earlier sequences were available confirmed that the conditional probabilities associated with correlated mutation pairs could be used to predict the order in which the mutations were likely to have developed. Whereas accessory nucleoside RT inhibitor-resistance mutations nearly always follow primary nucleoside RT inhibitor-resistance mutations, accessory PI-resistance mutations often preceded primary PI-resistance mutations.

    View details for DOI 10.1371/journal.pcbi.0030087

    View details for Web of Science ID 000249105100007

    View details for PubMedID 17500586

    View details for PubMedCentralID PMC1866358

  • Down-regulation of the interferon signaling pathway in T lymphocytes from patients with metastatic melanoma PLOS MEDICINE Critchley-Thorne, R. J., Yan, N., Nacu, S., Weber, J., Holmes, S. P., Lee, P. P. 2007; 4 (5): 897-911


    Dysfunction of the immune system has been documented in many types of cancers. The precise nature and molecular basis of immune dysfunction in the cancer state are not well defined.To gain insights into the molecular mechanisms of immune dysfunction in cancer, gene expression profiles of pure sorted peripheral blood lymphocytes from 12 patients with melanoma were compared to 12 healthy controls. Of 25 significantly altered genes in T cells and B cells from melanoma patients, 17 are interferon (IFN)-stimulated genes. These microarray findings were further confirmed by quantitative PCR and functional responses to IFNs. The median percentage of lymphocytes that phosphorylate STAT1 in response to interferon-alpha was significantly reduced (Delta = 16.8%; 95% confidence interval, 0.98% to 33.35%) in melanoma patients (n = 9) compared to healthy controls (n = 9) in Phosflow analysis. The Phosflow results also identified two subgroups of patients with melanoma: IFN-responsive (33%) and low-IFN-response (66%). The defect in IFN signaling in the melanoma patient group as a whole was partially overcome at the level of expression of IFN-stimulated genes by prolonged stimulation with the high concentration of IFN-alpha that is achievable only in IFN therapy used in melanoma. The lowest responders to IFN-alpha in the Phosflow assay also showed the lowest gene expression in response to IFN-alpha. Finally, T cells from low-IFN-response patients exhibited functional abnormalities, including decreased expression of activation markers CD69, CD25, and CD71; TH1 cytokines interleukin-2, IFN-gamma, and tumor necrosis factor alpha, and reduced survival following stimulation with anti-CD3/CD28 antibodies compared to controls.Defects in interferon signaling represent novel, dominant mechanisms of immune dysfunction in cancer. These findings may be used to design therapies to counteract immune dysfunction in melanoma and to improve cancer immunotherapy.

    View details for DOI 10.1371/journal.pmed.0040176

    View details for Web of Science ID 000246889700018

    View details for PubMedID 17488182

    View details for PubMedCentralID PMC1865558

  • Unusual codon bias in vinyl chloride reductase genes of Dehalococcoides species APPLIED AND ENVIRONMENTAL MICROBIOLOGY McMurdie, P. J., Behrens, S. F., Holmes, S., Spormann, A. M. 2007; 73 (8): 2744-2747


    Vinyl chloride reductases (VC-RDase) are the key enzymes for complete microbial reductive dehalogenation of chloroethenes, including the groundwater pollutants tetrachloroethene and trichloroethene. Analysis of the codon usage of the VC-RDase genes vcrA and bvcA showed that these genes are highly unusual and are characterized by a low G+C fraction at the third position. The third position of codons in VC-RDase genes is biased toward the nucleotide T, even though available Dehalococcoides genome sequences indicate the absence of any tRNAs matching codons that end in T. The comparatively high level of abnormality in the codon usage of VC-RDase genes suggests an evolutionary history that is different from that of most other Dehalococcoides genes.

    View details for DOI 10.1128/AEM.02768-06

    View details for Web of Science ID 000246542400039

    View details for PubMedID 17308190

    View details for PubMedCentralID PMC1855607

  • Gene expression network analysis and applications to immunology BIOINFORMATICS Nacu, S., Critchley-Thorne, R., Lee, P., Holmes, S. 2007; 23 (7): 850-858


    We address the problem of using expression data and prior biological knowledge to identify differentially expressed pathways or groups of genes. Following an idea of Ideker et al. (2002), we construct a gene interaction network and search for high-scoring subnetworks. We make several improvements in terms of scoring functions and algorithms, resulting in higher speed and accuracy and easier biological interpretation. We also assign significance levels to our results, adjusted for multiple testing. Our methods are successfully applied to three human microarray data sets, related to cancer and the immune system, retrieving several known and potential pathways. The method, denoted by the acronym GXNA (Gene eXpression Network Analysis) is implemented in software that is publicly available and can be used on virtually any microarray data set.The source code and executable for the software, as well as certain supplemental materials, can be downloaded from

    View details for DOI 10.1093/bioinformatics/btm019

    View details for Web of Science ID 000246120400010

    View details for PubMedID 17267429

  • An interactive statistical image segmentation and visualization system 4th International Conference on Medical Information Visualisation Kapelner, A., Lee, P. R., Holmes, S. IEEE COMPUTER SOC. 2007: 81–86
  • Differences in the temporal emergence of primary and secondary NRTI and PI drug-resistance mutations (DRMs) Rhee, S. Y., Liu, T. F., Holmes, S. P., Shafe, R. W. INT MEDICAL PRESS LTD. 2007: S67
  • Differences in the temporal emergence of primary and secondary NRTI and PI drug-resistance mutations (DRMs) 16th International HIV Drug Resistance Workshop Rhee, S. Y., Liu, T. F., Holmes, S. P., Shafer, R. W. INT MEDICAL PRESS LTD. 2007: S67–S67
  • Forager activation and food availability in harvester ants ANIMAL BEHAVIOUR Schafer, R. J., Holmes, S., Gordon, D. M. 2006; 71: 815-822
  • Constraints on Yukawa-type deviations from Newtonian gravity at 20 microns PHYSICAL REVIEW D Smullin, S. J., Geraci, A. A., Weld, D. M., Chiaverini, J., Holmes, S., Kapitulnik, A. 2005; 72 (12)
  • Profile of immune cells in axillary lymph nodes predicts disease-free survival in breast cancer PLOS MEDICINE Kohrt, H. E., Nouri, N., Nowels, K., Johnson, D., Holmes, S., Lee, P. P. 2005; 2 (9): 904-919


    While lymph node metastasis is among the strongest predictors of disease-free and overall survival for patients with breast cancer, the immunological nature of tumor-draining lymph nodes is often ignored, and may provide additional prognostic information on clinical outcome.We performed immunohistochemical analysis of 47 sentinel and 104 axillary (nonsentinel) nodes from 77 breast cancer patients with 5 y of follow-up to determine if alterations in CD4, CD8, and CD1a cell populations predict nodal metastasis or disease-free survival. Sentinel and axillary node CD4 and CD8 T cells were decreased in breast cancer patients compared to control nodes. CD1a dendritic cells were also diminished in sentinel and tumor-involved axillary nodes, but increased in tumor-free axillary nodes. Axillary node, but not sentinel node, CD4 T cell and dendritic cell populations were highly correlated with disease-free survival, independent of axillary metastasis. Immune profiling of ALN from a test set of 48 patients, applying CD4 T cell and CD1a dendritic cell population thresholds of CD4 > or = 7.0% and CD1a > or = 0.6%, determined from analysis of a learning set of 29 patients, provided significant risk stratification into favorable and unfavorable prognostic groups superior to clinicopathologic characteristics including tumor size, extent or size of nodal metastasis (CD4, p < 0.001 and CD1a, p < 0.001). Moreover, axillary node CD4 T cell and CD1a dendritic cell populations allowed more significant stratification of disease-free survival of patients with T1 (primary tumor size 2 cm or less) and T2 (5 cm or larger) tumors than all other patient characteristics. Finally, sentinel node immune profiles correlated primarily with the presence of infiltrating tumor cells, while axillary node immune profiles appeared largely independent of nodal metastases, raising the possibility that, within axillary lymph nodes, immune profile changes and nodal metastases represent independent processes.These findings demonstrate that the immune profile of tumor-draining lymph nodes is of novel biologic and clinical importance for patients with early stage breast cancer.

    View details for DOI 10.1371/journal.pmed.0020284

    View details for Web of Science ID 000232433600019

    View details for PubMedID 16124834

    View details for PubMedCentralID PMC1198041

  • Rapid assessment of recognition efficiency and functional capacity of antigen-specific T-cell responses JOURNAL OF IMMUNOTHERAPY Kohrt, H. E., Shu, C. T., Stuge, T. B., Holmes, S. P., Weber, J., Lee, P. P. 2005; 28 (4): 297-305


    It is increasingly recognized that cells within an antigen-specific CD8 T-cell population may be diverse in recognition efficiency for target, which may significantly affect the overall efficacy of the response in clinical settings such as viral infections and cancer. CD8 T cells with seemingly identical antigen specificity, particularly those elicited by cancer vaccines, may be heterogeneous for sensitivity and recognition efficiency for the cognate peptide and functional state in vivo. Analysis of individual T-cell clones derived from an antigen-specific T-cell population would provide an accurate assessment of the overall response; however, this is time- and labor-intensive, preventing rapid and routine assessment of patient samples from clinical trials. By stimulating antigen-specific T cells that otherwise appear homogeneous on tetramer staining with graded amounts of cognate peptides, the authors show that individual cells downmodulate surface T-cell receptors (TCR) and thus lose tetramer reactivity with variable dynamics within the T-cell population. The dynamics of TCR downregulation represent an accurate assessment of an individual cell's antigen sensitivity, recognition efficiency, and relative functional state within an antigen-specific population and have direct correlation to killing capacity by chromium release as well as degranulation by CD107 mobilization. Furthermore, despite correlation of average T-cell function by all three techniques, TCR downregulation uncovered heterogeneity in T-cell responses after vaccination among patient samples directly ex vivo. When examined using this novel technique, antigen-specific T cells elicited by vaccination with heteroclitic peptides exhibited significantly different recognition efficiencies for the heteroclitic versus native peptides, translating into differences in functional responses. With advancing cancer vaccine trials, the capacity to detect and functionally characterize antigen-specific T-cell responses in detail is critical. Techniques, as presented here, that rapidly assess the overall antigen sensitivity, recognition efficiency, and functional status of patients' T-cell responses will guide future vaccine trials and immunotherapies.

    View details for Web of Science ID 000230219200004

    View details for PubMedID 16000947

  • Memory T cells have gene expression patterns intermediate between naive and effector PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Holmes, S., He, M., Xu, T., Lee, P. P. 2005; 102 (15): 5519-5523


    The biological basis underlying differentiation of naive (NAI) T cells into effector (EFFE) and memory (MEM) cells is incompletely understood. Furthermore, whether NAI T cells serially differentiate into EFFE and then MEM cells (linear differentiation) or whether they concurrently differentiate into either EFFE or MEM cells (parallel differentiation) remains unresolved. We isolated NAI, EFFE, and MEM CD8(+) T cell subsets from human peripheral blood and analyzed their gene expression by using microarrays. We identified 156 genes that strongly differentiate NAI, EFFE, and MEM CD8(+) T cells; these genes provide previously unrecognized markers to help identify each cell type. Using several statistical approaches to analyze and group the data (standard heat-map and hierarchical clustering, a unique circular representation, multivariate analyses based on principal components, and a clustering method based on phylogenetic parsimony analysis), we assessed the lineage relationships between these subsets and showed that MEM cells have gene expression patterns intermediate between NAI and EFFE T cells. Our analysis suggests a common differentiation pathway to an intermediate state followed by a split into EFFE or MEM cells, hence supporting the parallel differentiation model. As such, conditions under which NAI T cells are activated may determine the magnitude of both EFFE and MEM cells, which arise subsequently. A better understanding of these conditions may be very useful in the design of future vaccine strategies to maximize MEM cell generation.

    View details for DOI 10.1073/pnas.0501437102

    View details for Web of Science ID 000228376600042

    View details for PubMedID 15809420

    View details for PubMedCentralID PMC556264

  • Sequential Monte Carlo methods for statistical analysis of tables JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Chen, Y. G., Diaconis, P., Holmes, S. R., Liu, J. S. 2005; 100 (469): 109-120
  • Immune profile in tumor-draining nodes predicts disease-free survival in breast cancer Kohrt, H. E., Nowels, K., Holmes, S., Lee, P. P., Johnson, D. SPRINGER. 2005: S17
  • Error distribution for gene expression data STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY Purdom, E., Holmes, S. P. 2005; 4


    We present a new instance of Laplace's second Law of Errors and show how it can be used in the analysis of data from microarray experiments. This error distribution is shown to fit microarray expression data much better than a normal distribution. The use of this distribution in a parametric bootstrap leads to more powerful tests as we show that the t-test is conservative in this setting. We propose a biological explanations for this distribution based on the Pareto distribution of the variables used to compute the log ratios.

    View details for Web of Science ID 000238478100022

    View details for PubMedID 16646833

  • Gene expression diversity among Mycobacterium tuberculosis clinical isolates MICROBIOLOGY-SGM Gao, Q., Kripke, K. E., Saldanha, A. J., Yan, W. H., Holmes, S., Small, P. M. 2005; 151: 5-14


    Intraspecies genetic diversity has been demonstrated to be important in the pathogenesis and epidemiology of several pathogens, such as HIV, influenza, Helicobacter and Salmonella. It is also important to consider strain-to-strain variation when identifying drug targets and vaccine antigens and developing tools for molecular diagnostics. Here, the authors present a description of the variability in gene expression patterns among ten clinical isolates of Mycobacterium tuberculosis, plus the laboratory strains H37Rv and H37Ra, growing in liquid culture. They identified 527 genes (15 % of those tested) that are variably expressed among the isolates studied. The remaining genes were divided into three categories based on their expression levels: unexpressed (38 %), low to undetectable expression (31 %) and consistently expressed (16 %). The expression categories were compared with functional categories and three biologically interesting gene lists: genes that are deleted among clinical isolates, T-cell antigens and essential genes. There were significant associations between expression variability and the classification of genes as T-cell antigens, involved in lipid metabolism, PE/PPE, insertion sequences and phages, and deleted among clinical isolates. This survey of mRNA expression among clinical isolates of M. tuberculosis demonstrates that genes with important functions can vary in their expression levels between strains grown under identical conditions.

    View details for Web of Science ID 000226352800002

    View details for PubMedID 15632420

  • Heterogeneity within antigen-specific T cell responses revealed by differential dynamics of TCR downregulation. 46th Annual Meeting of the American-Society-of-Hematology Kohrt, H. E., Shu, C. T., Holmes, S. P., Weber, J., Lee, P. P. AMER SOC HEMATOLOGY. 2004: 50B–50B
  • Diversity and recognition efficiency of T cell responses to cancer PLOS MEDICINE Stuge, T. B., Holmes, S. P., Saharan, S., Tuettenberg, A., Roederer, M., Weber, J. S., Lee, P. P. 2004; 1 (2): 149-160


    Melanoma patients vaccinated with tumor-associated antigens frequently develop measurable peptide-specific CD8+ T cell responses; however, such responses often do not confer clinical benefit. Understanding why vaccine-elicited responses are beneficial in some patients but not in others will be important to improve targeted cancer immunotherapies.We analyzed peptide-specific CD8+ T cell responses in detail, by generating and characterizing over 200 cytotoxic T lymphocyte clones derived from T cell responses to heteroclitic peptide vaccination, and compared these responses to endogenous anti-tumor T cell responses elicited naturally (a heteroclitic peptide is a modification of a native peptide sequence involving substitution of an amino acid at an anchor residue to enhance the immunogenicity of the peptide). We found that vaccine-elicited T cells are diverse in T cell receptor variable chain beta expression and exhibit a different recognition profile for heteroclitic versus native peptide. In particular, vaccine-elicited T cells respond to native peptide with predominantly low recognition efficiency--a measure of the sensitivity of a T cell to different cognate peptide concentrations for stimulation--and, as a result, are inefficient in tumor lysis. In contrast, endogenous tumor-associated-antigen-specific T cells show a predominantly high recognition efficiency for native peptide and efficiently lyse tumor targets.These results suggest that factors that shape the peptide-specific T cell repertoire after vaccination may be different from those that affect the endogenous response. Furthermore, our findings suggest that current heteroclitic peptide vaccination protocols drive expansion of peptide-specific T cells with a diverse range of recognition efficiencies, a significant proportion of which are unable to respond to melanoma cells. Therefore, it is critical that the recognition efficiency of vaccine-elicited T cells be measured, with the goal of advancing those modalities that elicit T cells with the greatest potential of tumor reactivity.

    View details for DOI 10.1371/journal.pmed.0010028

    View details for Web of Science ID 000227378800015

    View details for PubMedID 15578105

    View details for PubMedCentralID PMC529423

  • Microarray analysis reveals differences in gene expression of circulating CD8+ T cells in melanoma patients and healthy donors CANCER RESEARCH Xu, T., Shu, C. T., Purdom, E., Dang, D., Ilsley, D., Guo, Y., Weber, J., Holmes, S. P., Lee, P. P. 2004; 64 (10): 3661-3667


    Circulating T cells from many cancer patients are known to be dysfunctional and undergo spontaneous apoptosis. We used microarray technology to determine whether gene expression differences exist in T cells from melanoma patients versus healthy subjects, which may underlie these abnormalities. To maximize the resolution of our data, we sort purified CD8(+) subsets and amplified the extracted RNA for microarray analysis. These analyses show subtle but statistically significant expression differences for 10 genes in T cells from melanoma patients versus healthy controls, which were additionally confirmed by quantitative real-time PCR analysis. Whereas none of these genes are members of the classical apoptosis pathways, several may be linked to apoptosis. To additionally investigate the significance of these 10 genes, we combined them into a classifier and found that they provide a much better discrimination between melanoma and healthy T cells as compared with a classifier built uniquely with classical apoptosis-related genes. These results suggest the possible engagement of an alternative apoptosis pathway in circulating T cells from cancer patients.

    View details for Web of Science ID 000221419100043

    View details for PubMedID 15150126

  • Bioinformatics and management science: Some common tools and techniques OPERATIONS RESEARCH Abbas, A. E., Holmes, S. R. 2004; 52 (2): 165-190
  • An in vitro human cell-based assay to rank the relative immunogenicity of proteins TOXICOLOGICAL SCIENCES Stickler, M., Rochanayon, N., Razo, O. J., Mucha, J., Gebel, W., Faravashi, N., Chin, R., Holmes, S., Harding, F. A. 2004; 77 (2): 280-289


    A method to rank proteins based on their relative immunogenicity has been devised. A statistical analysis of peptide-specific responses in large human donor pools provides a structure index value metric that ranked four industrial enzymes in the order determined by both mouse and guinea pig exposure models. The ranking method also compared favorably with human sensitization rates measured in occupationally exposed workers. Structure index values for other proteins known to cause immune responses in humans were also determined and found to be higher than the value determined for human beta2-microglobulin. Using values from known immunogenic and putative nonimmunogenic proteins, a cut-off value was established. The structure index value calculation provides a comparative method to predict subsequent immunogenicity on a human population basis without the need to use animal models. Information provided by this assay can be used in the early development of protein therapies and other protein-based applications to select or create reduced immunogenicity variants.

    View details for DOI 10.1093/toxsci/kfh021

    View details for Web of Science ID 000188987000013

    View details for PubMedID 14691215

  • Stein's method: expository lectures and applications Institute of Mathematical Statistics Lecture Notes—Monograph Series. edited by Diaconis, P., Holmes, S. Inst. Math. Statist. . 2004
  • Probability by surprise Mathematical Adventures for Students and Amateurs Holmes, S. edited by Shubin, T., Hayes, D. MAA. 2004: 135–153
  • Human population-based identification of CD4(+) T-cell peptide epitope determinants JOURNAL OF IMMUNOLOGICAL METHODS Stickler, M., Chin, R., Faravashi, N., Gebel, W., Razo, O. J., Rochanayon, N., Power, S., Valdes, A. M., Holmes, S., Harding, F. A. 2003; 281 (1-2): 95-108


    A human cell-based method to identify functional CD4(+) T-cell epitopes in any protein has been developed. Proteins are tested as synthetic 15-mer peptides offset by three amino acids. Percent responses within a large donor population are tabulated for each peptide in the set. Peptide epitope regions are designated by difference in response frequency from the overall background response rate for the compiled dataset. Epitope peptide responses are reproducible, with a median coefficient of variance of 21% when tested on multiple random-donor sets. The overall average response rate within the dataset increases with increasing putative human population antigenic exposure to a given protein. The background rate was high for HPV16 E6, and was low for human-derived cytokine proteins. The assay identified recall epitope regions within the donor population for the protein staphylokinase. For an industrial protease with minimal presumed population exposure, immunodominant epitope peptides were identified that were found to bind promiscuously to many HLA class II molecules in vitro. The peptide epitope regions identified in presumably unexposed donors represent a subset of the total recall epitopes. Finally, as a negative control, the assay found no peptide epitope regions in human beta2-microglobulin. This method identifies functional CD4(+) T-cell epitopes in any protein without pre-selection for HLA class II, suggests whether a donor population is pre-exposed to a protein of interest, and does not require sensitized donors for in vitro testing.

    View details for DOI 10.1016/S0022-1759(03)00279-5

    View details for Web of Science ID 000186487500009

    View details for PubMedID 14580884

  • Bootstrapping phylogenetic trees: Theory and methods STATISTICAL SCIENCE Holmes, S. 2003; 18 (2): 241-255
  • Bradley Efron: A conversation with good friends STATISTICAL SCIENCE Holmes, S., Morris, C., Tibshirani, R. 2003; 18 (2): 268–81
  • The i-mune assay for the identification of functional CD4+T cell epitopes in any protein: population-based immunodominant primary and memory epitope regions Harding, F. A., Chin, R., Faravashi, N., Gebel, W., Razo, J., Rochanayon, N., Holmes, S., Valdes, A. M., Stickler, M. M. FEDERATION AMER SOC EXP BIOL. 2003: C308
  • Statistics for phylogenetic trees THEORETICAL POPULATION BIOLOGY Holmes, S. 2003; 63 (1): 17-32


    This paper poses the problem of estimating and validating phylogenetic trees in statistical terms. The problem is hard enough to warrant several tacks: we reason by analogy to rounding real numbers, and dealing with ranking data. These are both cases where, as in phylogeny the parameters of interest are not real numbers. Then we pose the problem in geometrical terms, using distances and measures on a natural space of trees. We do not solve the problems of inference on tree space, but suggest some coherent ways of tackling them.

    View details for Web of Science ID 000180151400002

    View details for PubMedID 12464492

  • A back migration from Asia to sub-Saharan Africa is supported by high-resolution analysis of human Y-chromosome haplotypes AMERICAN JOURNAL OF HUMAN GENETICS Cruciani, F., Santolamazza, P., Shen, P. D., Macaulay, V., Moral, P., Olckers, A., Modiano, D., Holmes, S., Destro-Bisol, G., Coia, V., Wallace, D. C., Oefner, P. J., Torroni, A., Cavalli-Sforza, L. L., Scozzari, R., Underhill, P. A. 2002; 70 (5): 1197-1214


    The variation of 77 biallelic sites located in the nonrecombining portion of the Y chromosome was examined in 608 male subjects from 22 African populations. This survey revealed a total of 37 binary haplotypes, which were combined with microsatellite polymorphism data to evaluate internal diversities and to estimate coalescence ages of the binary haplotypes. The majority of binary haplotypes showed a nonuniform distribution across the continent. Analysis of molecular variance detected a high level of interpopulation diversity (PhiST=0.342), which appears to be partially related to the geography (PhiCT=0.230). In sub-Saharan Africa, the recent spread of a set of haplotypes partially erased pre-existing diversity, but a high level of population (PhiST=0.332) and geographic (PhiCT=0.179) structuring persists. Correspondence analysis shows that three main clusters of populations can be identified: northern, eastern, and sub-Saharan Africans. Among the latter, the Khoisan, the Pygmies, and the northern Cameroonians are clearly distinct from a tight cluster formed by the Niger-Congo-speaking populations from western, central western, and southern Africa. Phylogeographic analyses suggest that a large component of the present Khoisan gene pool is eastern African in origin and that Asia was the source of a back migration to sub-Saharan Africa. Haplogroup IX Y chromosomes appear to have been involved in such a migration, the traces of which can now be observed mostly in northern Cameroon.

    View details for Web of Science ID 000175012400010

    View details for PubMedID 11910562

    View details for PubMedCentralID PMC447595

  • Geometry of the space of phylogenetic trees ADVANCES IN APPLIED MATHEMATICS Billera, L. J., Holmes, S. P., Vogtmann, K. 2001; 27 (4): 733-767
  • Statistical problems involving permutations with restricted positions Symposium on State of the Art in Probability and Statistics: Festschrift for Willem R VanZwet Diaconis, P., Graham, R., Holmes, S. P. INST MATHEMATICAL STATISTICS. 2001: 195–222
  • Analysis of a nonreversible Markov chain sampler ANNALS OF APPLIED PROBABILITY Diaconis, P., Holmes, S., Neal, R. M. 2000; 10 (3): 726-752
  • Matchings and phylogenetic trees PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Diaconis, P. W., Holmes, S. P. 1998; 95 (25): 14600-14602


    This paper presents a natural coordinate system for phylogenetic trees using a correspondence with the set of perfect matchings in the complete graph. This correspondence produces a distance between phylogenetic trees, and a way of enumerating all trees in a minimal step order. It is useful in randomized algorithms because it enables moves on the space of trees that make random optimization strategies "mix" quickly. It also promises a generalization to intermediary trees when data are not decisive as to their choice of tree, and a new way of constructing Bayesian priors on tree space.

    View details for Web of Science ID 000077436700005

    View details for PubMedID 9843935

  • The next linear collider test accelerator's rf pulse compression and transmission systems 17th Particle Accelerator Conference Tantawi, S. G., Adolphsen, C., Holmes, S., LAVINE, T., Loewen, R. J., Nantista, C., Pearson, C., Pope, R., Rifkin, J., Ruth, R. D., Vlieks, A. E. IEEE. 1998: 3192–3194
  • Are there still things to do in Bayesian statistics? Conference on Probability, Dynamics and Causality Diaconis, P., Holmes, S. KLUWER ACADEMIC PUBL. 1997: 5–18
  • Bootstrap confidence levels for phylogenetic trees (vol 93, pg 7085, 1996) PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Efron, B., Halloran, E., Holmes, S. 1996; 93 (23): 13429-13434


    Evolutionary trees are often estimated from DNA or RNA sequence data. How much confidence should we have in the estimated trees? In 1985, Felsenstein [Felsenstein, J. (1985) Evolution 39, 783-791] suggested the use of the bootstrap to answer this question. Felsenstein's method, which in concept is a straightforward application of the bootstrap, is widely used, but has been criticized as biased in the genetics literature. This paper concerns the use of the bootstrap in the tree problem. We show that Felsenstein's method is not biased, but that it can be corrected to better agree with standard ideas of confidence levels and hypothesis testing. These corrections can be made by using the more elaborate bootstrap method presented here, at the expense of considerably more computation.

    View details for Web of Science ID A1996VT05400135

    View details for PubMedID 8917608

    View details for PubMedCentralID PMC24110