Education & Certifications
BA, Hendrix College, Interdisciplinary Studies: Bioinformatics (2011)
Findings from clinical and biological studies are often not reproducible when tested in independent cohorts. Due to the testing of a large number of hypotheses and relatively small sample sizes, results from whole-genome expression studies in particular are often not reproducible. Compared to single-study analysis, gene expression meta-analysis can improve reproducibility by integrating data from multiple studies. However, there are multiple choices in designing and carrying out a meta-analysis. Yet, clear guidelines on best practices are scarce. Here, we hypothesized that studying subsets of very large meta-analyses would allow for systematic identification of best practices to improve reproducibility. We therefore constructed three very large gene expression meta-analyses from clinical samples, and then examined meta-analyses of subsets of the datasets (all combinations of datasets with up to N/2 samples and K/2 datasets) compared to a 'silver standard' of differentially expressed genes found in the entire cohort. We tested three random-effects meta-analysis models using this procedure. We showed relatively greater reproducibility with more-stringent effect size thresholds with relaxed significance thresholds; relatively lower reproducibility when imposing extraneous constraints on residual heterogeneity; and an underestimation of actual false positive rate by Benjamini-Hochberg correction. In addition, multivariate regression showed that the accuracy of a meta-analysis increased significantly with more included datasets even when controlling for sample size.
View details for DOI 10.1093/nar/gkw797
View details for PubMedID 27634930
View details for PubMedCentralID PMC5224496
A major contributor to the scientific reproducibility crisis has been that the results from homogeneous, single-center studies do not generalize to heterogeneous, real world populations. Multi-cohort gene expression analysis has helped to increase reproducibility by aggregating data from diverse populations into a single analysis. To make the multi-cohort analysis process more feasible, we have assembled an analysis pipeline which implements rigorously studied meta-analysis best practices. We have compiled and made publicly available the results of our own multi-cohort gene expression analysis of 103 diseases, spanning 615 studies and 36,915 samples, through a novel and interactive web application. As a result, we have made both the process of and the results from multi-cohort gene expression analysis more approachable for non-technical users.
View details for PubMedID 27896970
View details for PubMedCentralID PMC5167529
Life science technologies generate a deluge of data that hold the keys to unlocking the secrets of important biological functions and disease mechanisms. We present DEAP, Differential Expression Analysis for Pathways, which capitalizes on information about biological pathways to identify important regulatory patterns from differential expression data. DEAP makes significant improvements over existing approaches by including information about pathway structure and discovering the most differentially expressed portion of the pathway. On simulated data, DEAP significantly outperformed traditional methods: with high differential expression, DEAP increased power by two orders of magnitude; with very low differential expression, DEAP doubled the power. DEAP performance was illustrated on two different gene and protein expression studies. DEAP discovered fourteen important pathways related to chronic obstructive pulmonary disease and interferon treatment that existing approaches omitted. On the interferon study, DEAP guided focus towards a four protein path within the 26 protein Notch signalling pathway.
View details for DOI 10.1371/journal.pcbi.1002967
View details for Web of Science ID 000316864200038
View details for PubMedID 23516350
View details for PubMedCentralID PMC3597535
Type I interferon (IFN) signaling is a central pathogenic pathway in systemic lupus erythematosus (SLE), and therapeutics targeting type I IFN signaling are in development. Multiple proteins with overlapping functions play a role in IFN signaling, but the signaling events downstream of receptor engagement are unclear. This study was undertaken to investigate the roles of the type I and type II IFN signaling components IFN-?/?/? receptor 2 (IFNAR-2), IFN regulatory factor 9 (IRF-9), and STAT-1 in a mouse model of SLE.We used immunohistochemical staining and highly multiplexed assays to characterize pathologic changes in histology, autoantibody production, cytokine/chemokine profiles, and STAT phosphorylation in order to investigate the individual roles of IFNAR-2, IRF-9, and STAT-1 in MRL/lpr mice.We found that STAT-1(-/-) mice, but not IRF-9(-/-) or IFNAR-2(-/-) mice, developed interstitial nephritis characterized by infiltration with retinoic acid receptor-related orphan nuclear receptor ?t-positive lymphocytes, macrophages, and eosinophils. Despite pronounced interstitial kidney disease and abnormal kidney function, STAT-1(-/-) mice had decreased proteinuria, glomerulonephritis, and autoantibody production. Phosphospecific flow cytometry revealed shunting of STAT phosphorylation from STAT-1 to STAT-3/4.We describe unique contributions of STAT-1 to pathology in different kidney compartments in a mouse model, and provide potentially novel insight into tubulointerstitial nephritis, a poorly understood complication that predicts end-stage kidney disease in SLE patients.
View details for DOI 10.1002/art.39535
View details for Web of Science ID 000375551600023
View details for PubMedID 26636548
View details for PubMedCentralID PMC4848130
This case study evaluates and tracks vitality of a city (Seattle), based on a data-driven approach, using strategic, robust, and sustainable metrics. This case study was collaboratively conducted by the Downtown Seattle Association (DSA) and CDO Analytics teams. The DSA is a nonprofit organization focused on making the city of Seattle and its Downtown a healthy and vibrant place to Live, Work, Shop, and Play. DSA primarily operates through public policy advocacy, community and business development, and marketing. In 2010, the organization turned to CDO Analytics ( cdoanalytics.org ) to develop a process that can guide and strategically focus DSA efforts and resources for maximal benefit to the city of Seattle and its Downtown. CDO Analytics was asked to develop clear, easily understood, and robust metrics for a baseline evaluation of the health of the city, as well as for ongoing monitoring and comparisons of the vitality, sustainability, and growth. The DSA and CDO Analytics teams strategized on how to effectively assess and track the vitality of Seattle and its Downtown. The two teams filtered a variety of data sources, and evaluated the veracity of multiple diverse metrics. This iterative process resulted in the development of a small number of strategic, simple, reliable, and sustainable metrics across four pillars of activity: Live, Work, Shop, and Play. Data during the 5 years before 2010 were used for the development of the metrics and model and its training, and data during the 5 years from 2010 and on were used for testing and validation. This work enabled DSA to routinely track these strategic metrics, use them to monitor the vitality of Downtown Seattle, prioritize improvements, and identify new value-added programs. As a result, the four-pillar approach became an integral part of the data-driven decision-making and execution of the Seattle community's improvement activities. The approach described in this case study is actionable, robust, inexpensive, and easy to adopt and sustain. It can be applied to cities, districts, counties, regions, states, or countries, enabling cross-comparisons and improvements of vitality, sustainability, and growth.
View details for DOI 10.1089/big.2015.0043
View details for PubMedID 27441585
Individuals who suffer from schizophrenia comprise I percent of the United States population and are four times more likely to die of suicide than the general US population. Identification of at-risk individuals with schizophrenia is challenging when they do not seek treatment. Microblogging platforms allow users to share their thoughts and emotions with the world in short snippets of text. In this work, we leveraged the large corpus of Twitter posts and machine-learning methodologies to detect individuals with schizophrenia. Using features from tweets such as emoticon use, posting time of day, and dictionary terms, we trained, built, and validated several machine learning models. Our support vector machine model achieved the best performance with 92% precision and 71% recall on the held-out test set. Additionally, we built a web application that dynamically displays summary statistics between cohorts. This enables outreach to undiagnosed individuals, improved physician diagnoses, and destigmatization of schizophrenia.
View details for PubMedID 26306253
View details for PubMedCentralID PMC4525233
The Model Organism Protein Expression Database (MOPED, http://moped.proteinspire.org) is an expanding proteomics resource to enable biological and biomedical discoveries. MOPED aggregates simple, standardized and consistently processed summaries of protein expression and metadata from proteomics (mass spectrometry) experiments from human and model organisms (mouse, worm, and yeast). The latest version of MOPED adds new estimates of protein abundance and concentration as well as relative (differential) expression data. MOPED provides a new updated query interface that allows users to explore information by organism, tissue, localization, condition, experiment, or keyword. MOPED supports the Human Proteome Project's efforts to generate chromosome- and diseases-specific proteomes by providing links from proteins to chromosome and disease information as well as many complementary resources. MOPED supports a new omics metadata checklist to harmonize data integration, analysis, and use. MOPED's development is driven by the user community, which spans 90 countries and guides future development that will transform MOPED into a multiomics resource. MOPED encourages users to submit data in a simple format. They can use the metadata checklist to generate a data publication for this submission. As a result, MOPED will provide even greater insights into complex biological processes and systems and enable deeper and more comprehensive biological and biomedical discoveries.
View details for DOI 10.1021/pr400884c
View details for Web of Science ID 000329472700012
View details for PubMedID 24350770
View details for PubMedCentralID PMC4039175
The integrative personal omics profile (iPOP) is a pioneering study that combines genomics, transcriptomics, proteomics, metabolomics and autoantibody profiles from a single individual over a 14-month period. The observation period includes two episodes of viral infection: a human rhinovirus and a respiratory syncytial virus. The profile studies give an informative snapshot into the biological functioning of an organism. We hypothesize that pathway expression levels are associated with disease status. To test this hypothesis, we use biological pathways to integrate metabolomics and proteomics iPOP data. The approach computes the pathways' differential expression levels at each time point, while taking into account the pathway structure and the longitudinal design. The resulting pathway levels show strong association with the disease status. Further, we identify temporal patterns in metabolite expression levels. The changes in metabolite expression levels also appear to be consistent with the disease status. The results of the integrative analysis suggest that changes in biological pathways may be used to predict and monitor the disease. The iPOP experimental design, data acquisition and analysis issues are discussed within the broader context of personal profiling.
View details for DOI 10.3390/metabo3030741
View details for PubMedID 24958148
View details for PubMedCentralID PMC3901289
Large numbers of mass spectrometry proteomics studies are being conducted to understand all types of biological processes. The size and complexity of proteomics data hinders efforts to easily share, integrate, query and compare the studies. The Model Organism Protein Expression Database (MOPED, htttp://moped.proteinspire.org) is a new and expanding proteomics resource that enables rapid browsing of protein expression information from publicly available studies on humans and model organisms. MOPED is designed to simplify the comparison and sharing of proteomics data for the greater research community. MOPED uniquely provides protein level expression data, meta-analysis capabilities and quantitative data from standardized analysis. Data can be queried for specific proteins, browsed based on organism, tissue, localization and condition and sorted by false discovery rate and expression. MOPED empowers users to visualize their own expression data and compare it with existing studies. Further, MOPED links to various protein and pathway databases, including GeneCards, Entrez, UniProt, KEGG and Reactome. The current version of MOPED contains over 43,000 proteins with at least one spectral match and more than 11 million high certainty spectra.
View details for DOI 10.1093/nar/gkr1177
View details for PubMedID 22139914
View details for PubMedCentralID PMC3245040
In high-throughput mass spectrometry proteomics, peptides and proteins are not simply identified as present or not present in a sample, rather the identifications are associated with differing levels of confidence. The false discovery rate (FDR) has emerged as an accepted means for measuring the confidence associated with identifications. We have developed the Systematic Protein Investigative Research Environment (SPIRE) for the purpose of integrating the best available proteomics methods. Two successful approaches to estimating the FDR for MS protein identifications are the MAYU and our current SPIRE methods. We present here a method to combine these two approaches to estimating the FDR for MS protein identifications into an integrated protein model (IPM). We illustrate the high quality performance of this IPM approach through testing on two large publicly available proteomics datasets. MAYU and SPIRE show remarkable consistency in identifying proteins in these datasets. Still, IPM results in a more robust FDR estimation approach and additional identifications, particularly among low abundance proteins. IPM is now implemented as a part of the SPIRE system.
View details for DOI 10.1016/j.jprot.2011.06.003
View details for Web of Science ID 000298710400012
View details for PubMedID 21718813
The SPIRE (Systematic Protein Investigative Research Environment) provides web-based experiment-specific mass spectrometry (MS) proteomics analysis (https://www.proteinspire.org). Its emphasis is on usability and integration of the best analytic tools. SPIRE provides an easy to use web-interface and generates results in both interactive and simple data formats. In contrast to run-based approaches, SPIRE conducts the analysis based on the experimental design. It employs novel methods to generate false discovery rates and local false discovery rates (FDR, LFDR) and integrates the best and complementary open-source search and data analysis methods. The SPIRE approach of integrating X!Tandem, OMSSA and SpectraST can produce an increase in protein IDs (52-88%) over current combinations of scoring and single search engines while also providing accurate multi-faceted error estimation. One of SPIRE's primary assets is combining the results with data on protein function, pathways and protein expression from model organisms. We demonstrate some of SPIRE's capabilities by analyzing mitochondrial proteins from the wild type and 3 mutants of C. elegans. SPIRE also connects results to publically available proteomics data through its Model Organism Protein Expression Database (MOPED). SPIRE can also provide analysis and annotation for user supplied protein ID and expression data.
View details for DOI 10.1016/j.jprot.2011.05.009
View details for Web of Science ID 000298710400013
View details for PubMedID 21609792
To address the monumental challenge of assigning function to millions of sequenced proteins, we completed the first of a kind all-versus-all sequence alignments using BLAST for 9.9 million proteins in the UniRef100 database. Microsoft Windows Azure produced over 3 billion filtered records in 6 days using 475 eight-core virtual machines. Protein classification into functional groups was then performed using Hive and custom jars implemented on top of Apache Hadoop utilizing the MapReduce paradigm. First, using the Clusters of Orthologous Genes (COG) database, a length normalized bit score (LNBS) was determined to be the best similarity measure for classification of proteins. LNBS achieved sensitivity and specificity of 98% each. Second, out of 5.1 million bacterial proteins, about two-thirds were assigned to significantly extended COG groups, encompassing 30 times more assigned proteins. Third, the remaining proteins were classified into protein functional groups using an innovative implementation of a single-linkage algorithm on an in-house Hadoop compute cluster. This implementation significantly reduces the run time for nonindexed queries and optimizes efficient clustering on a large scale. The performance was also verified on Amazon Elastic MapReduce. This clustering assigned nearly 2 million proteins to approximately half a million different functional groups. A similar approach was applied to classify 2.8 million eukaryotic sequences resulting in over 1 million proteins being assign to existing KOG groups and the remainder clustered into 100,000 functional groups.
View details for DOI 10.1089/omi.2011.0101
View details for Web of Science ID 000293440600010
View details for PubMedID 21809957
To gauge the current commitment to scientific research in the United States of America (US), we compared federal research funding (FRF) with the US gross domestic product (GDP) and industry research spending during the past six decades. In order to address the recent globalization of scientific research, we also focused on four key indicators of research activities: research and development (R&D) funding, total science and engineering doctoral degrees, patents, and scientific publications. We compared these indicators across three major population and economic regions: the US, the European Union (EU) and the People's Republic of China (China) over the past decade. We discovered a number of interesting trends with direct relevance for science policy. The level of US FRF has varied between 0.2% and 0.6% of the GDP during the last six decades. Since the 1960s, the US FRF contribution has fallen from twice that of industrial research funding to roughly equal. Also, in the last two decades, the portion of the US government R&D spending devoted to research has increased. Although well below the US and the EU in overall funding, the current growth rate for R&D funding in China greatly exceeds that of both. Finally, the EU currently produces more science and engineering doctoral graduates and scientific publications than the US in absolute terms, but not per capita. This study's aim is to facilitate a serious discussion of key questions by the research community and federal policy makers. In particular, our results raise two questions with respect to: a) the increasing globalization of science: "What role is the US playing now, and what role will it play in the future of international science?"; and b) the ability to produce beneficial innovations for society: "How will the US continue to foster its strengths?"
View details for DOI 10.1371/journal.pone.0012203
View details for Web of Science ID 000280968000026
View details for PubMedID 20808949
View details for PubMedCentralID PMC2922381
Large amounts of mass spectrometry (MS) proteomics data are now publicly available; however, little attention has been given to how to best combine these data and assess the error rates for protein identification. The objective of this article is to show how variation in the type and amount of data included with each study impacts coverage of the yeast proteome and estimation of the false discovery rate (FDR). Our analysis of a subset of the publicly available yeast data showed that failure to reevaluate the FDR when combining protein IDs from different experiments resulted in an underestimation of the FDR by approximately threefold. A worst-case approximation of the FDR was only slightly larger than estimating the FDR by randomized database matches. The use of a weighted model to emphasize the most informative experimental data provided an increase in the number of IDs at a 1% FDR when compared to other meta-analysis approaches. Also, using an FDR higher than 1% results in a very high rate of false discoveries for IDs above the 1% threshold. Ideally, raw MS data will be made publicly available for complete and consistent reanalysis. In the circumstance that raw data is not available, determining a combined FDR on the basis of the worst-case estimation provides a reasonable approximation of the FDR. When combining experimental results, adding additional experiments results in diminishing and in some cases negative returns on protein identifications. It may be beneficial to include only those experiments generating the most unique identifications due to solid experimental design and sensitive instrumentation.
View details for DOI 10.1089/omi.2010.0034
View details for Web of Science ID 000279046800009
View details for PubMedID 20569183
View details for PubMedCentralID PMC3133781