Education & Certifications
BA, Hendrix College, Interdisciplinary Studies: Bioinformatics (2011)
Individuals who suffer from schizophrenia comprise I percent of the United States population and are four times more likely to die of suicide than the general US population. Identification of at-risk individuals with schizophrenia is challenging when they do not seek treatment. Microblogging platforms allow users to share their thoughts and emotions with the world in short snippets of text. In this work, we leveraged the large corpus of Twitter posts and machine-learning methodologies to detect individuals with schizophrenia. Using features from tweets such as emoticon use, posting time of day, and dictionary terms, we trained, built, and validated several machine learning models. Our support vector machine model achieved the best performance with 92% precision and 71% recall on the held-out test set. Additionally, we built a web application that dynamically displays summary statistics between cohorts. This enables outreach to undiagnosed individuals, improved physician diagnoses, and destigmatization of schizophrenia.
View details for PubMedID 26306253
The Model Organism Protein Expression Database (MOPED, http://moped.proteinspire.org) is an expanding proteomics resource to enable biological and biomedical discoveries. MOPED aggregates simple, standardized and consistently processed summaries of protein expression and metadata from proteomics (mass spectrometry) experiments from human and model organisms (mouse, worm, and yeast). The latest version of MOPED adds new estimates of protein abundance and concentration as well as relative (differential) expression data. MOPED provides a new updated query interface that allows users to explore information by organism, tissue, localization, condition, experiment, or keyword. MOPED supports the Human Proteome Project's efforts to generate chromosome- and diseases-specific proteomes by providing links from proteins to chromosome and disease information as well as many complementary resources. MOPED supports a new omics metadata checklist to harmonize data integration, analysis, and use. MOPED's development is driven by the user community, which spans 90 countries and guides future development that will transform MOPED into a multiomics resource. MOPED encourages users to submit data in a simple format. They can use the metadata checklist to generate a data publication for this submission. As a result, MOPED will provide even greater insights into complex biological processes and systems and enable deeper and more comprehensive biological and biomedical discoveries.
View details for DOI 10.1021/pr400884c
View details for Web of Science ID 000329472700012
View details for PubMedID 24350770
Life science technologies generate a deluge of data that hold the keys to unlocking the secrets of important biological functions and disease mechanisms. We present DEAP, Differential Expression Analysis for Pathways, which capitalizes on information about biological pathways to identify important regulatory patterns from differential expression data. DEAP makes significant improvements over existing approaches by including information about pathway structure and discovering the most differentially expressed portion of the pathway. On simulated data, DEAP significantly outperformed traditional methods: with high differential expression, DEAP increased power by two orders of magnitude; with very low differential expression, DEAP doubled the power. DEAP performance was illustrated on two different gene and protein expression studies. DEAP discovered fourteen important pathways related to chronic obstructive pulmonary disease and interferon treatment that existing approaches omitted. On the interferon study, DEAP guided focus towards a four protein path within the 26 protein Notch signalling pathway.
View details for DOI 10.1371/journal.pcbi.1002967
View details for Web of Science ID 000316864200038
View details for PubMedID 23516350
The integrative personal omics profile (iPOP) is a pioneering study that combines genomics, transcriptomics, proteomics, metabolomics and autoantibody profiles from a single individual over a 14-month period. The observation period includes two episodes of viral infection: a human rhinovirus and a respiratory syncytial virus. The profile studies give an informative snapshot into the biological functioning of an organism. We hypothesize that pathway expression levels are associated with disease status. To test this hypothesis, we use biological pathways to integrate metabolomics and proteomics iPOP data. The approach computes the pathways' differential expression levels at each time point, while taking into account the pathway structure and the longitudinal design. The resulting pathway levels show strong association with the disease status. Further, we identify temporal patterns in metabolite expression levels. The changes in metabolite expression levels also appear to be consistent with the disease status. The results of the integrative analysis suggest that changes in biological pathways may be used to predict and monitor the disease. The iPOP experimental design, data acquisition and analysis issues are discussed within the broader context of personal profiling.
View details for DOI 10.3390/metabo3030741
View details for PubMedID 24958148
In high-throughput mass spectrometry proteomics, peptides and proteins are not simply identified as present or not present in a sample, rather the identifications are associated with differing levels of confidence. The false discovery rate (FDR) has emerged as an accepted means for measuring the confidence associated with identifications. We have developed the Systematic Protein Investigative Research Environment (SPIRE) for the purpose of integrating the best available proteomics methods. Two successful approaches to estimating the FDR for MS protein identifications are the MAYU and our current SPIRE methods. We present here a method to combine these two approaches to estimating the FDR for MS protein identifications into an integrated protein model (IPM). We illustrate the high quality performance of this IPM approach through testing on two large publicly available proteomics datasets. MAYU and SPIRE show remarkable consistency in identifying proteins in these datasets. Still, IPM results in a more robust FDR estimation approach and additional identifications, particularly among low abundance proteins. IPM is now implemented as a part of the SPIRE system.
View details for DOI 10.1016/j.jprot.2011.06.003
View details for Web of Science ID 000298710400012
View details for PubMedID 21718813
The SPIRE (Systematic Protein Investigative Research Environment) provides web-based experiment-specific mass spectrometry (MS) proteomics analysis (https://www.proteinspire.org). Its emphasis is on usability and integration of the best analytic tools. SPIRE provides an easy to use web-interface and generates results in both interactive and simple data formats. In contrast to run-based approaches, SPIRE conducts the analysis based on the experimental design. It employs novel methods to generate false discovery rates and local false discovery rates (FDR, LFDR) and integrates the best and complementary open-source search and data analysis methods. The SPIRE approach of integrating X!Tandem, OMSSA and SpectraST can produce an increase in protein IDs (52-88%) over current combinations of scoring and single search engines while also providing accurate multi-faceted error estimation. One of SPIRE's primary assets is combining the results with data on protein function, pathways and protein expression from model organisms. We demonstrate some of SPIRE's capabilities by analyzing mitochondrial proteins from the wild type and 3 mutants of C. elegans. SPIRE also connects results to publically available proteomics data through its Model Organism Protein Expression Database (MOPED). SPIRE can also provide analysis and annotation for user supplied protein ID and expression data.
View details for DOI 10.1016/j.jprot.2011.05.009
View details for Web of Science ID 000298710400013
View details for PubMedID 21609792
To address the monumental challenge of assigning function to millions of sequenced proteins, we completed the first of a kind all-versus-all sequence alignments using BLAST for 9.9 million proteins in the UniRef100 database. Microsoft Windows Azure produced over 3 billion filtered records in 6 days using 475 eight-core virtual machines. Protein classification into functional groups was then performed using Hive and custom jars implemented on top of Apache Hadoop utilizing the MapReduce paradigm. First, using the Clusters of Orthologous Genes (COG) database, a length normalized bit score (LNBS) was determined to be the best similarity measure for classification of proteins. LNBS achieved sensitivity and specificity of 98% each. Second, out of 5.1 million bacterial proteins, about two-thirds were assigned to significantly extended COG groups, encompassing 30 times more assigned proteins. Third, the remaining proteins were classified into protein functional groups using an innovative implementation of a single-linkage algorithm on an in-house Hadoop compute cluster. This implementation significantly reduces the run time for nonindexed queries and optimizes efficient clustering on a large scale. The performance was also verified on Amazon Elastic MapReduce. This clustering assigned nearly 2 million proteins to approximately half a million different functional groups. A similar approach was applied to classify 2.8 million eukaryotic sequences resulting in over 1 million proteins being assign to existing KOG groups and the remainder clustered into 100,000 functional groups.
View details for DOI 10.1089/omi.2011.0101
View details for Web of Science ID 000293440600010
View details for PubMedID 21809957
To gauge the current commitment to scientific research in the United States of America (US), we compared federal research funding (FRF) with the US gross domestic product (GDP) and industry research spending during the past six decades. In order to address the recent globalization of scientific research, we also focused on four key indicators of research activities: research and development (R&D) funding, total science and engineering doctoral degrees, patents, and scientific publications. We compared these indicators across three major population and economic regions: the US, the European Union (EU) and the People's Republic of China (China) over the past decade. We discovered a number of interesting trends with direct relevance for science policy. The level of US FRF has varied between 0.2% and 0.6% of the GDP during the last six decades. Since the 1960s, the US FRF contribution has fallen from twice that of industrial research funding to roughly equal. Also, in the last two decades, the portion of the US government R&D spending devoted to research has increased. Although well below the US and the EU in overall funding, the current growth rate for R&D funding in China greatly exceeds that of both. Finally, the EU currently produces more science and engineering doctoral graduates and scientific publications than the US in absolute terms, but not per capita. This study's aim is to facilitate a serious discussion of key questions by the research community and federal policy makers. In particular, our results raise two questions with respect to: a) the increasing globalization of science: "What role is the US playing now, and what role will it play in the future of international science?"; and b) the ability to produce beneficial innovations for society: "How will the US continue to foster its strengths?"
View details for DOI 10.1371/journal.pone.0012203
View details for Web of Science ID 000280968000026
View details for PubMedID 20808949
Large amounts of mass spectrometry (MS) proteomics data are now publicly available; however, little attention has been given to how to best combine these data and assess the error rates for protein identification. The objective of this article is to show how variation in the type and amount of data included with each study impacts coverage of the yeast proteome and estimation of the false discovery rate (FDR). Our analysis of a subset of the publicly available yeast data showed that failure to reevaluate the FDR when combining protein IDs from different experiments resulted in an underestimation of the FDR by approximately threefold. A worst-case approximation of the FDR was only slightly larger than estimating the FDR by randomized database matches. The use of a weighted model to emphasize the most informative experimental data provided an increase in the number of IDs at a 1% FDR when compared to other meta-analysis approaches. Also, using an FDR higher than 1% results in a very high rate of false discoveries for IDs above the 1% threshold. Ideally, raw MS data will be made publicly available for complete and consistent reanalysis. In the circumstance that raw data is not available, determining a combined FDR on the basis of the worst-case estimation provides a reasonable approximation of the FDR. When combining experimental results, adding additional experiments results in diminishing and in some cases negative returns on protein identifications. It may be beneficial to include only those experiments generating the most unique identifications due to solid experimental design and sensitive instrumentation.
View details for DOI 10.1089/omi.2010.0034
View details for Web of Science ID 000279046800009
View details for PubMedID 20569183