Atul Butte, MD, PhD is the new Director of the new Institute of Computational Health Sciences (ICHS) at the University of California, San Francisco, and a Professor of Pediatrics.  Dr. Butte trained in Computer Science at Brown University, worked as a software engineer at Apple and Microsoft, received his MD at Brown University, trained in Pediatrics and Pediatric Endocrinology at Children's Hospital Boston, then received his PhD from Harvard Medical School and MIT.   Dr. Butte has authored nearly 200 publications, with research repeatedly featured in Wired Magazine, the New York Times, and the Wall Street Journal.  Dr. Butte is also the principal investigator of ImmPort,  the archival and dissemination repository for clinical and molecular datasets funded by the National Institute of Allergy and Infectious Diseases.  In 2013, Dr. Butte was recognized by the White House as an Open Science Champion of Change for promoting science through publicly available data.  Other recent awards include the 2014 E. Mead Johnson Award for Research in Pediatrics, 2013 induction into the American Society for Clinical Investigation, the 2012 FierceBiotech IT “Top 10 Biotech Techies”, and the 2011 National Human Genome Research Institute Genomic Advance of the Month.  Dr. Butte is also a founder of three investor-backed data-driven companies: Personalis, providing clinical interpretation of whole genome sequences, Carmenta, discovering diagnostics for pregnancy complications, and NuMedii, finding new uses for drugs through open molecular data.

Administrative Appointments

  • Director, Institute for Computational Health Sciences, University of California, San Francisco (2015 - Present)
  • External Scientific Advisory Board, Geisinger Health System (2011 - Present)
  • Scientific Program Chair, Big Data in Biomedicine Conference, Stanford University (2012 - 2013)
  • Principal Investigator, ImmPort Bioinformatics Support Contract (2012 - Present)
  • Division Chief, Division of Systems Medicine, Department of Pediatrics (2011 - Present)
  • External Scientific Advisory Board, Center for Human Immunology, National Institutes of Health (2009 - 2011)
  • Associate Director, CTSA Translational Informatics Program, Stanford Center for Clinical and Translational Education and Research (2008 - 2012)
  • Chair, External Advisory Board, Department of Biomedical Informatics, University of Pittsburgh (2007 - Present)
  • Director, Biomedical Informatics Scholarly Area (2007 - 2011)
  • Board of Directors, American Medical Informatics Association (2007 - 2012)
  • Scientific Program Committee, American Medical Informatics Association (2007 - 2007)
  • Study Section Reviewer, NIH Biomedical Computing and Health Informatics Study Section (2005 - 2009)
  • Scientific Program Committee, American Medical Informatics Association (2005 - 2005)
  • Study Section Reviewer, Special Emphasis Panel, National Heart, Lung and Blood Institute (2002 - 2002)
  • Study Section Reviewer, Special Emphasis Panel reviewing Planning Grants, NIH National Programs of Excellence in Biomedical Computing (2001 - 2002)
  • Scientific Program Committee, American Medical Informatics Association (2000 - 2002)
  • Informatics Committee Member, National Heart, Lung and Blood Institute Program of Genomic Applications (2000 - 2003)
  • Study Section Reviewer (Chartered), Biomedical Library and Informatics Review Committee (BLIRC), National Library of Medicine, NIH (2013 - Present)

Honors & Awards

  • E. Mead Johnson Award, Society for Pediatric Research (2014)
  • Elected member, American Society of Clinical Investigation (ASCI) (2013-)
  • Invited Fellow for the Indonesian-American Symposium, National Academy of Sciences and Kavli Frontiers of Science (2013)
  • White House Champion of Change in Open Science, Office of Science and Technology Policy, White House (2013)
  • Outstanding Scientific Accomplishment recognized by NIH Director (Wed Afternoon Lecture Series), National Institutes of Health (2012)
  • “Top 10 Biotech Techies”, FierceBiotech IT (2012)
  • Genomic Advance of the Month, National Human Genome Research Institute (NHGRI) (2011)
  • Young Investigator Award, Society for Pediatric Research (2010)
  • Elected Fellow, American College of Medical Informatics (2009)
  • New Investigator Award, American Medical Informatics Association (2008)
  • Award for Outstanding Short Course, Society for Medical Decision Making (2007)
  • Tomorrow's Principal Investigator, Genome Technology Magazine (2007)
  • HHMI Physician-Scientist Early Career Award, Howard Hughes Medical Institute (2006-2011)
  • Research Starter Grant in Informatics, Pharmaceutical Research and Manufacturers of America Foundation (2006-2008)
  • Outstanding Speaker Award, American Association for Clinical Chemistry (2003)
  • Pathology Residents' Choice Award, Emory University School of Medicine (2003)
  • Outstanding Speaker Award, American Association for Clinical Chemistry (2002)
  • Travel Grant Award for exceptional research presented at the 84th Annual Meeting, The Endocrine Society (2002)
  • Clinical Scholar Award, Lawson Wilkins Pediatric Endocrine Society and NovoNordisk (2001)
  • Scholar-In-Training Award, American Association for Cancer Research and Pharmacia (2001)
  • Best Paper Finalist, American Medical Informatics Association (2000)
  • Best Student Paper, Third Place, American Medical Informatics Association (2000)
  • Fellow, Merck and Massachusetts Institute of Technology (2000)
  • Farley Fellow, Children's Hospital, Boston (1999)
  • Associate member, Sigma Xi (1995)
  • Research Training Fellowship for Medical Students, Howard Hughes Medical Institute (1994)
  • Research Scholars Program, Howard Hughes Medical Institute / National Institutes of Health (1993)
  • Eagle Scout, Boy Scouts of America (1984)

Professional Education

  • Ph.D., HST: MIT and Harvard Medical School, Health Sciences and Technology (2004)
  • M.S., MIT, Medical Informatics (2002)
  • Fellowship, Children's Hospital Boston, Pediatric Endocrinology (2001)
  • Residency, Children's Hospital Boston, Pediatrics (1998)
  • M.D., Brown Univ. School of Medicine (1995)
  • M.S., Brown Univ. School of Medicine, Medical Science (1995)
  • B.A. Honors, Brown University, Computer Science (1991)

Research & Scholarship

Current Research and Scholarly Interests

Atul Butte, MD, PhD is Chief of the Division of Systems Medicine and Associate Professor of Pediatrics, Medicine, and by courtesy, Computer Science, at Stanford University and Lucile Packard Children's Hospital. Dr. Butte trained in Computer Science at Brown University, worked as a software engineer at Apple and Microsoft, received his MD at Brown University, trained in Pediatrics and Pediatric Endocrinology at Children's Hospital Boston, then received his PhD in Health Sciences and Technology from Harvard Medical School and MIT.

The Butte Laboratory at Stanford builds and applies tools that convert more than 300 billion points of molecular, clinical, and epidemiological data measured by researchers and clinicians over the past decade  into diagnostics, therapeutics, and new insights into disease.  The Butte Laboratory currently has been funded by HHMI and under fifteen NIH grants.  The Butte Laboratory has developed bioinformatics methods to take genomic, genetic, and phenotypic data from multiple sources and diseases, and reason over these data to create novel diagnostics, therapeutics, and discover novel molecular mechanisms of disease.  Examples of this method includes work on cancer drug discovery published in the Proceedings of the National Academy of Science (2000), on type 2 diabetes published in the Proceedings of the National Academy of Science (2003), on fat cell formation published in Nature Cell Biology (2005), on obesity in Bioinformatics (2007), and in transplantation published in Proceedings of the National Academy of Science (2009).  To facilitate this, the Butte Lab has developed tools to automatically index and find genomic data sets based on the phenotypic and contextual details of each experiment, published in Nature Biotechnology (2006), to re-map microarray data, published in Nature Methods (2007), and to deconvolve multi-cellular samples, published in Nature Methods (2010).  The Butte Lab has also been developing novel methods in comparing clinical data from electronic health record systems with gene expression data, as described in Science (2008), and was part of the team performing the first clinical annotation of a patient presenting with a whole genome, as described in the Lancet (2010). The Butte Laboratory currently has been funded by HHMI and under sixteen NIH grants.

Dr. Butte has authored more than 100 publications and delivered more than 120 invited presentations in personalized and systems medicine, biomedical informatics, and molecular diabetes, including 20 at the National Institutes of Health or NIH-related meetings. Dr. Butte's research has been featured in the New York Times Science Times and the International Herald Tribune (2008), Wall Street Journal (2010 and 2011), and San Jose Mercury News (2010). Dr. Butte's recent awards include the 2010 Society for Pediatric Research Young Investigator Award, induction into the American College of Medical Informatics in 2009, the 2008 AMIA New Investigator Award, the 2007 Genome Technology "Tomorrow's Principal Investigator" Award, the 2007 Society for Medical Decision Making Award for Outstanding Short Course, the 2006 Howard Hughes Medical Institute Early Career Award, the 2006 PhRMA Foundation Research Starter Grant in Informatics, and the 2002 and 2003 American Association for Clinical Chemistry Outstanding Speaker Award. Dr. Butte also co-authored one of the first books on microarray analysis titled "Microarrays for an Integrative Genomics" published by MIT Press.

Clinical Trials

  • Genome, Proteome and Tissue Microarray in Childhood Acute Leukemia Recruiting

    We will study gene and protein expression in leukemia cells of children diagnosed with acute leukemia. We hope to identify genes or proteins which can help us grade leukemia at diagnosis in order to: (a) develop better means of diagnosis and (b) more accurately choose the best therapy for each patient.

    View full details

  • Phase IIa Desipramine in Small Cell Lung Cancer and Other High-Grade Neuroendocrine Tumors Not Recruiting

    Intrapatient dose escalation of desipramine. Start at 75 mg daily. Increase by 75 mg weekly to maximum of 450 mg daily. Taper desipramine upon disease progression, unacceptable toxicity or patient withdrawal from study.

    Stanford is currently not accepting patients for this trial. For more information, please contact CCTO, 650-498-7061.

    View full details


2014-15 Courses


Journal Articles

  • Disease risk factors identified through shared genetic architecture and electronic medical records. Science translational medicine Li, L., Ruau, D. J., Patel, C. J., Weber, S. C., Chen, R., Tatonetti, N. P., Dudley, J. T., Butte, A. J. 2014; 6 (234): 234ra57-?


    Genome-wide association studies have identified genetic variants for thousands of diseases and traits. We evaluated the relationships between specific risk factors (for example, blood cholesterol level) and diseases on the basis of their shared genetic architecture in a comprehensive human disease-single-nucleotide polymorphism association database (VARIMED), analyzing the findings from 8962 published association studies. Similarity between traits and diseases was statistically evaluated on the basis of their association with shared gene variants. We identified 120 disease-trait pairs that were statistically similar, and of these, we tested and validated five previously unknown disease-trait associations by searching electronic medical records (EMRs) from three independent medical centers for evidence of the trait appearing in patients within 1 year of first diagnosis of the disease. We validated that the mean corpuscular volume is elevated before diagnosis of acute lymphoblastic leukemia; both have associated variants in the gene IKZF1. Platelet count is decreased before diagnosis of alcohol dependence; both are associated with variants in the gene C12orf51. Alkaline phosphatase level is elevated in patients with venous thromboembolism; both share variants in ABO. Similarly, we found that prostate-specific antigen and serum magnesium levels were altered before the diagnosis of lung cancer and gastric cancer, respectively. Disease-trait associations identify traits that could serve as future prognostics, if validated through EMR and subsequent prospective trials.

    View details for DOI 10.1126/scitranslmed.3007191

    View details for PubMedID 24786325

  • Ethnic Differences in the Relationship Between Insulin Sensitivity and Insulin Response A systematic review and meta-analysis DIABETES CARE Kodama, K., Tojjar, D., Yamada, S., Toda, k., Patel, C. J., Butte, A. J. 2013; 36 (6): 1789-1796


    OBJECTIVE Human blood glucose levels have likely evolved toward their current point of stability over hundreds of thousands of years. The robust population stability of this trait is called canalization. It has been represented by a hyperbolic function of two variables: insulin sensitivity and insulin response. Environmental changes due to global migration may have pushed some human subpopulations to different points of stability. We hypothesized that there may be ethnic differences in the optimal states in the relationship between insulin sensitivity and insulin response. RESEARCH DESIGN AND METHODS We identified studies that measured the insulin sensitivity index (SI) and acute insulin response to glucose (AIRg) in three major ethnic groups: Africans, Caucasians, and East Asians. We identified 74 study cohorts comprising 3,813 individuals (19 African cohorts, 31 Caucasian, and 24 East Asian). We calculated the hyperbolic relationship using the mean values of SI and AIRg in the healthy cohorts with normal glucose tolerance. RESULTS We found that Caucasian subpopulations were located around the middle point of the hyperbola, while African and East Asian subpopulations are located around unstable extreme points, where a small change in one variable is associated with a large nonlinear change in the other variable. CONCLUSIONS Our findings suggest that the genetic background of Africans and East Asians makes them more and differentially susceptible to diabetes than Caucasians. This ethnic stratification could be implicated in the different natural courses of diabetes onset.

    View details for DOI 10.2337/dc12-1235

    View details for Web of Science ID 000321472600056

    View details for PubMedID 23704681

  • Systematic identification of interaction effects between genome- and environment-wide associations in type 2 diabetes mellitus HUMAN GENETICS Patel, C. J., Chen, R., Kodama, K., Ioannidis, J. P., Butte, A. J. 2013; 132 (5): 495-508


    Diseases such as type 2 diabetes (T2D) result from environmental and genetic factors, and risk varies considerably in the population. T2D-related genetic loci discovered to date explain only a small portion of the T2D heritability. Some heritability may be due to gene-environment interactions. However, documenting these interactions has been difficult due to low availability of concurrent genetic and environmental measures, selection bias, and challenges in controlling for multiple hypothesis testing. Through genome-wide association studies (GWAS), investigators have identified over 90 single nucleotide polymorphisms (SNPs) associated to T2D. Using a method analogous to GWAS [environment-wide association study (EWAS)], we found five environmental factors associated with the disease. By focusing on risk factors that emerge from GWAS and EWAS, it is possible to overcome difficulties in uncovering gene-environment interactions. Using data from the National Health and Nutrition Examination Survey (NHANES), we screened 18 SNPs and 5 serum-based environmental factors for interaction in association to T2D. We controlled for multiple hypotheses using false discovery rate (FDR) and Bonferroni correction and found four interactions with FDR <20 %. The interaction between rs13266634 (SLC30A8) and trans-?-carotene withstood Bonferroni correction (corrected p = 0.006, FDR <1.5 %). The per-risk-allele effect sizes in subjects with low levels of trans-?-carotene were 40 % greater than the marginal effect size [odds ratio (OR) 1.8, 95 % CI 1.3-2.6]. We hypothesize that impaired function driven by rs13266634 increases T2D risk when combined with serum levels of nutrients. Unbiased consideration of environmental and genetic factors may help identify larger and more relevant effect sizes for disease associations.

    View details for DOI 10.1007/s00439-012-1258-z

    View details for Web of Science ID 000317691100002

    View details for PubMedID 23334806

  • Analysis of the Genetic Basis of Disease in the Context of Worldwide Human Relationships and Migration PLOS GENETICS Corona, E., Chen, R., Sikora, M., Morgan, A. A., Patel, C. J., Ramesh, A., Bustamante, C. D., Butte, A. J. 2013; 9 (5)


    Genetic diversity across different human populations can enhance understanding of the genetic basis of disease. We calculated the genetic risk of 102 diseases in 1,043 unrelated individuals across 51 populations of the Human Genome Diversity Panel. We found that genetic risk for type 2 diabetes and pancreatic cancer decreased as humans migrated toward East Asia. In addition, biliary liver cirrhosis, alopecia areata, bladder cancer, inflammatory bowel disease, membranous nephropathy, systemic lupus erythematosus, systemic sclerosis, ulcerative colitis, and vitiligo have undergone genetic risk differentiation. This analysis represents a large-scale attempt to characterize genetic risk differentiation in the context of migration. We anticipate that our findings will enable detailed analysis pertaining to the driving forces behind genetic risk differentiation.

    View details for DOI 10.1371/journal.pgen.1003447

    View details for Web of Science ID 000320030000003

    View details for PubMedID 23717210

  • Expression-based genome-wide association study links the receptor CD44 in adipose tissue with type 2 diabetes PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Kodama, K., Horikoshi, M., Toda, k., Yamada, S., Hara, K., Irie, J., Sirota, M., Morgan, A. A., Chen, R., Ohtsu, H., Maeda, S., Kadowaki, T., Butte, A. J. 2012; 109 (18): 7049-7054


    Type 2 diabetes (T2D) is a complex, polygenic disease affecting nearly 300 million people worldwide. T2D is primarily characterized by insulin resistance, and growing evidence has indicated the causative link between adipose tissue inflammation and the development of insulin resistance. Genetic association studies have successfully revealed a number of important genes consistently associated with T2D to date. However, these robust T2D-associated genes do not fully elucidate the mechanisms underlying the development and progression of the disease. Here, we report an alternative approach, gene expression-based genome-wide association study (eGWAS): searching for genes repeatedly implicated in functional microarray experiments (often publicly available). We performed an eGWAS across 130 independent experiments (totally 1,175 T2D case-control microarrays) to find additional genes implicated in the molecular pathogenesis of T2D and identified the immune-cell receptor CD44 as our top candidate (P = 8.5 × 10(-20)). We found CD44 deficiency in a diabetic mouse model ameliorates insulin resistance and adipose tissue inflammation and also found that anti-CD44 antibody treatment decreases blood glucose levels and adipose tissue macrophage accumulation in a high-fat, diet-fed mouse model. Further, in humans, we observed CD44 is expressed in inflammatory cells in obese adipose tissue and discovered serum CD44 levels were positively correlated with insulin resistance and glycemic control. CD44 likely plays a causative role in the development of adipose tissue inflammation and insulin resistance in rodents and humans. Genes repeatedly implicated in publicly available experimental data may have unique functionally important roles in T2D and other complex diseases.

    View details for DOI 10.1073/pnas.1114513109

    View details for Web of Science ID 000303602100060

    View details for PubMedID 22499789

  • Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes CELL Chen, R., Mias, G. I., Li-Pook-Than, J., Jiang, L., Lam, H. Y., Chen, R., Miriami, E., Karczewski, K. J., Hariharan, M., Dewey, F. E., Cheng, Y., Clark, M. J., Im, H., Habegger, L., Balasubramanian, S., O'Huallachain, M., Dudley, J. T., Hillenmeyer, S., Haraksingh, R., Sharon, D., Euskirchen, G., Lacroute, P., Bettinger, K., Boyle, A. P., Kasowski, M., Grubert, F., Seki, S., Garcia, M., Whirl-Carrillo, M., Gallardo, M., Blasco, M. A., Greenberg, P. L., Snyder, P., Klein, T. E., Altman, R. B., Butte, A. J., Ashley, E. A., Gerstein, M., Nadeau, K. C., Tang, H., Snyder, M. 2012; 148 (6): 1293-1307


    Personalized medicine is expected to benefit from combining genomic information with regular monitoring of physiological states by multiple high-throughput methods. Here, we present an integrative personal omics profile (iPOP), an analysis that combines genomic, transcriptomic, proteomic, metabolomic, and autoantibody profiles from a single individual over a 14 month period. Our iPOP analysis revealed various medical risks, including type 2 diabetes. It also uncovered extensive, dynamic changes in diverse molecular components and biological pathways across healthy and diseased conditions. Extremely high-coverage genomic and transcriptomic data, which provide the basis of our iPOP, revealed extensive heteroallelic changes during healthy and diseased states and an unexpected RNA editing mechanism. This study demonstrates that longitudinal iPOP can be used to interpret healthy and diseased states by connecting genomic information with additional dynamic omics activity.

    View details for DOI 10.1016/j.cell.2012.02.009

    View details for Web of Science ID 000301889500023

    View details for PubMedID 22424236

  • Performance comparison of exome DNA sequencing technologies NATURE BIOTECHNOLOGY Clark, M. J., Chen, R., Lam, H. Y., Karczewski, K. J., Chen, R., Euskirchen, G., Butte, A. J., Snyder, M. 2011; 29 (10): 908-U206


    Whole exome sequencing by high-throughput sequencing of target-enriched genomic DNA (exome-seq) has become common in basic and translational research as a means of interrogating the interpretable part of the human genome at relatively low cost. We present a comparison of three major commercial exome sequencing platforms from Agilent, Illumina and Nimblegen applied to the same human blood sample. Our results suggest that the Nimblegen platform, which is the only one to use high-density overlapping baits, covers fewer genomic regions than the other platforms but requires the least amount of sequencing to sensitively detect small variants. Agilent and Illumina are able to detect a greater total number of variants with additional sequencing. Illumina captures untranslated regions, which are not targeted by the Nimblegen and Agilent platforms. We also compare exome sequencing and whole genome sequencing (WGS) of the same sample, demonstrating that exome sequencing can detect additional small variants missed by WGS.

    View details for DOI 10.1038/nbt.1975

    View details for Web of Science ID 000296273000017

    View details for PubMedID 21947028

  • Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data SCIENCE TRANSLATIONAL MEDICINE Sirota, M., Dudley, J. T., Kim, J., Chiang, A. P., Morgan, A. A., Sweet-Cordero, A., Sage, J., Butte, A. J. 2011; 3 (96)


    The application of established drug compounds to new therapeutic indications, known as drug repositioning, offers several advantages over traditional drug development, including reduced development costs and shorter paths to approval. Recent approaches to drug repositioning use high-throughput experimental approaches to assess a compound's potential therapeutic qualities. Here, we present a systematic computational approach to predict novel therapeutic indications on the basis of comprehensive testing of molecular signatures in drug-disease pairs. We integrated gene expression measurements from 100 diseases and gene expression measurements on 164 drug compounds, yielding predicted therapeutic potentials for these drugs. We recovered many known drug and disease relationships using computationally derived therapeutic potentials and also predict many new indications for these 164 drugs. We experimentally validated a prediction for the antiulcer drug cimetidine as a candidate therapeutic in the treatment of lung adenocarcinoma, and demonstrate its efficacy both in vitro and in vivo using mouse xenograft models. This computational method provides a systematic approach for repositioning established drugs to treat a wide range of human diseases.

    View details for DOI 10.1126/scitranslmed.3001318

    View details for Web of Science ID 000293953100005

    View details for PubMedID 21849665

  • Computational Repositioning of the Anticonvulsant Topiramate for Inflammatory Bowel Disease SCIENCE TRANSLATIONAL MEDICINE Dudley, J. T., Sirota, M., Shenoy, M., Pai, R. K., Roedder, S., Chiang, A. P., Morgan, A. A., Sarwal, M. M., Pasricha, P. J., Butte, A. J. 2011; 3 (96)


    Inflammatory bowel disease (IBD) is a chronic inflammatory disorder of the gastrointestinal tract for which there are few safe and effective therapeutic options for long-term treatment and disease maintenance. Here, we applied a computational approach to discover new drug therapies for IBD in silico, using publicly available molecular data reporting gene expression in IBD samples and 164 small-molecule drug compounds. Among the top compounds predicted to be therapeutic for IBD by our approach were prednisolone, a corticosteroid used to treat IBD, and topiramate, an anticonvulsant drug not previously described to have efficacy for IBD or any related disorders of inflammation or the gastrointestinal tract. Using a trinitrobenzenesulfonic acid (TNBS)-induced rodent model of IBD, we experimentally validated our topiramate prediction in vivo. Oral administration of topiramate significantly reduced gross pathological signs and microscopic damage in primary affected colon tissue in the TNBS-induced rodent model of IBD. These findings suggest that topiramate might serve as a therapeutic option for IBD in humans and support the use of public molecular data and computational approaches to discover new therapeutic options for disease.

    View details for DOI 10.1126/scitranslmed.3002648

    View details for Web of Science ID 000293953100004

    View details for PubMedID 21849664

  • Differentially Expressed RNA from Public Microarray Data Identifies Serum Protein Biomarkers for Cross-Organ Transplant Rejection and Other Conditions PLOS COMPUTATIONAL BIOLOGY Chen, R., Sigdel, T. K., Li, L., Kambham, N., Dudley, J. T., Hsieh, S., Klassen, R. B., Chen, A., Caohuu, T., Morgan, A. A., Valantine, H. A., Khush, K. K., Sarwal, M. M., Butte, A. J. 2010; 6 (9)


    Serum proteins are routinely used to diagnose diseases, but are hard to find due to low sensitivity in screening the serum proteome. Public repositories of microarray data, such as the Gene Expression Omnibus (GEO), contain RNA expression profiles for more than 16,000 biological conditions, covering more than 30% of United States mortality. We hypothesized that genes coding for serum- and urine-detectable proteins, and showing differential expression of RNA in disease-damaged tissues would make ideal diagnostic protein biomarkers for those diseases. We showed that predicted protein biomarkers are significantly enriched for known diagnostic protein biomarkers in 22 diseases, with enrichment significantly higher in diseases for which at least three datasets are available. We then used this strategy to search for new biomarkers indicating acute rejection (AR) across different types of transplanted solid organs. We integrated three biopsy-based microarray studies of AR from pediatric renal, adult renal and adult cardiac transplantation and identified 45 genes upregulated in all three. From this set, we chose 10 proteins for serum ELISA assays in 39 renal transplant patients, and discovered three that were significantly higher in AR. Interestingly, all three proteins were also significantly higher during AR in the 63 cardiac transplant recipients studied. Our best marker, serum PECAM1, identified renal AR with 89% sensitivity and 75% specificity, and also showed increased expression in AR by immunohistochemistry in renal, hepatic and cardiac transplant biopsies. Our results demonstrate that integrating gene expression microarray measurements from disease samples and even publicly-available data sets can be a powerful, fast, and cost-effective strategy for the discovery of new diagnostic serum protein biomarkers.

    View details for DOI 10.1371/journal.pcbi.1000940

    View details for Web of Science ID 000282372600010

    View details for PubMedID 20885780

  • Extreme Evolutionary Disparities Seen in Positive Selection across Seven Complex Diseases PLOS ONE Corona, E., Dudley, J. T., Butte, A. J. 2010; 5 (8)


    Positive selection is known to occur when the environment that an organism inhabits is suddenly altered, as is the case across recent human history. Genome-wide association studies (GWASs) have successfully illuminated disease-associated variation. However, whether human evolution is heading towards or away from disease susceptibility in general remains an open question. The genetic-basis of common complex disease may partially be caused by positive selection events, which simultaneously increased fitness and susceptibility to disease. We analyze seven diseases studied by the Wellcome Trust Case Control Consortium to compare evidence for selection at every locus associated with disease. We take a large set of the most strongly associated SNPs in each GWA study in order to capture more hidden associations at the cost of introducing false positives into our analysis. We then search for signs of positive selection in this inclusive set of SNPs. There are striking differences between the seven studied diseases. We find alleles increasing susceptibility to Type 1 Diabetes (T1D), Rheumatoid Arthritis (RA), and Crohn's Disease (CD) underwent recent positive selection. There is more selection in alleles increasing, rather than decreasing, susceptibility to T1D. In the 80 SNPs most associated with T1D (p-value <7.01 x 10(-5)) showing strong signs of positive selection, 58 alleles associated with disease susceptibility show signs of positive selection, while only 22 associated with disease protection show signs of positive selection. Alleles increasing susceptibility to RA are under selection as well. In contrast, selection in SNPs associated with CD favors protective alleles. These results inform the current understanding of disease etiology, shed light on potential benefits associated with the genetic-basis of disease, and aid in the efforts to identify causal genetic factors underlying complex disease.

    View details for DOI 10.1371/journal.pone.0012236

    View details for Web of Science ID 000280968100028

    View details for PubMedID 20808933

  • An Environment-Wide Association Study (EWAS) on Type 2 Diabetes Mellitus PLOS ONE Patel, C. J., Bhattacharya, J., Butte, A. J. 2010; 5 (5)


    Type 2 Diabetes (T2D) and other chronic diseases are caused by a complex combination of many genetic and environmental factors. Few methods are available to comprehensively associate specific physical environmental factors with disease. We conducted a pilot Environmental-Wide Association Study (EWAS), in which epidemiological data are comprehensively and systematically interpreted in a manner analogous to a Genome Wide Association Study (GWAS).We performed multiple cross-sectional analyses associating 266 unique environmental factors with clinical status for T2D defined by fasting blood sugar (FBG) concentration > or =126 mg/dL. We utilized available Centers for Disease Control (CDC) National Health and Nutrition Examination Survey (NHANES) cohorts from years 1999 to 2006. Within cohort sample numbers ranged from 503 to 3,318. Logistic regression models were adjusted for age, sex, body mass index (BMI), ethnicity, and an estimate of socioeconomic status (SES). As in GWAS, multiple comparisons were controlled and significant findings were validated with other cohorts. We discovered significant associations for the pesticide-derivative heptachlor epoxide (adjusted OR in three combined cohorts of 1.7 for a 1 SD change in exposure amount; p<0.001), and the vitamin gamma-tocopherol (adjusted OR 1.5; p<0.001). Higher concentrations of polychlorinated biphenyls (PCBs) such as PCB170 (adjusted OR 2.2; p<0.001) were also found. Protective factors associated with T2D included beta-carotenes (adjusted OR 0.6; p<0.001).Despite difficulty in ascertaining causality, the potential for novel factors of large effect associated with T2D justify the use of EWAS to create hypotheses regarding the broad contribution of the environment to disease. Even in this study based on prior collected epidemiological measures, environmental factors can be found with effect sizes comparable to the best loci yet found by GWAS.

    View details for DOI 10.1371/journal.pone.0010746

    View details for Web of Science ID 000278017300017

    View details for PubMedID 20505766

  • Clinical assessment incorporating a personal genome LANCET Ashley, E. A., Butte, A. J., Wheeler, M. T., Chen, R., Klein, T. E., Dewey, F. E., Dudley, J. T., Ormond, K. E., Pavlovic, A., Morgan, A. A., Pushkarev, D., Neff, N. F., Hudgins, L., Gong, L., Hodges, L. M., Berlin, D. S., Thorn, C. F., Sangkuhl, K., Hebert, J. M., Woon, M., Sagreiya, H., Whaley, R., Knowles, J. W., Chou, M. F., Thakuria, J. V., Rosenbaum, A. M., Zaranek, A. W., Church, G. M., Greely, H. T., Quake, S. R., Altman, R. B. 2010; 375 (9725): 1525-1535


    The cost of genomic information has fallen steeply, but the clinical translation of genetic risk estimates remains unclear. We aimed to undertake an integrated analysis of a complete human genome in a clinical context.We assessed a patient with a family history of vascular disease and early sudden death. Clinical assessment included analysis of this patient's full genome sequence, risk prediction for coronary artery disease, screening for causes of sudden cardiac death, and genetic counselling. Genetic analysis included the development of novel methods for the integration of whole genome and clinical risk. Disease and risk analysis focused on prediction of genetic risk of variants associated with mendelian disease, recognised drug responses, and pathogenicity for novel variants. We queried disease-specific mutation databases and pharmacogenomics databases to identify genes and mutations with known associations with disease and drug response. We estimated post-test probabilities of disease by applying likelihood ratios derived from integration of multiple common variants to age-appropriate and sex-appropriate pre-test probabilities. We also accounted for gene-environment interactions and conditionally dependent risks.Analysis of 2.6 million single nucleotide polymorphisms and 752 copy number variations showed increased genetic risk for myocardial infarction, type 2 diabetes, and some cancers. We discovered rare variants in three genes that are clinically associated with sudden cardiac death-TMEM43, DSP, and MYBPC3. A variant in LPA was consistent with a family history of coronary artery disease. The patient had a heterozygous null mutation in CYP2C19 suggesting probable clopidogrel resistance, several variants associated with a positive response to lipid-lowering therapy, and variants in CYP4F2 and VKORC1 that suggest he might have a low initial dosing requirement for warfarin. Many variants of uncertain importance were reported.Although challenges remain, our results suggest that whole-genome sequencing can yield useful and clinically relevant information for individual patients.National Institute of General Medical Sciences; National Heart, Lung And Blood Institute; National Human Genome Research Institute; Howard Hughes Medical Institute; National Library of Medicine, Lucile Packard Foundation for Children's Health; Hewlett Packard Foundation; Breetwor Family Foundation.

    View details for Web of Science ID 000277655100025

    View details for PubMedID 20435227

  • Cell type-specific gene expression differences in complex tissues NATURE METHODS Shen-Orr, S. S., Tibshirani, R., Khatri, P., Bodian, D. L., Staedtler, F., Perry, N. M., Hastie, T., Sarwal, M. M., Davis, M. M., Butte, A. J. 2010; 7 (4): 287-289


    We describe cell type-specific significance analysis of microarrays (csSAM) for analyzing differential gene expression for each cell type in a biological sample from microarray data and relative cell-type frequencies. First, we validated csSAM with predesigned mixtures and then applied it to whole-blood gene expression datasets from stable post-transplant kidney transplant recipients and those experiencing acute transplant rejection, which revealed hundreds of differentially expressed genes that were otherwise undetectable.

    View details for DOI 10.1038/NMETH.1439

    View details for Web of Science ID 000276150600017

    View details for PubMedID 20208531

  • Translational Bioinformatics: Coming of Age JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Butte, A. J. 2008; 15 (6): 709-714


    The American Medical Informatics Association (AMIA) recently augmented the scope of its activities to encompass translational bioinformatics as a third major domain of informatics. The AMIA has defined translational bioinformatics as "... the development of storage, analytic, and interpretive methods to optimize the transformation of increasingly voluminous biomedical data into proactive, predictive, preventative, and participatory health." In this perspective, I will list eight reasons why this is an excellent time to be studying translational bioinformatics, including the significant increase in funding opportunities available for informatics from the United States National Institutes of Health, and the explosion of publicly-available data sets of molecular measurements. I end with the significant challenges we face in building a community of future investigators in Translational Bioinformatics.

    View details for DOI 10.1197/jamia.M2824

    View details for Web of Science ID 000260905500001

    View details for PubMedID 18755990

  • Medicine - The ultimate model organism SCIENCE Butte, A. J. 2008; 320 (5874): 325-327

    View details for DOI 10.1126/science.1158343

    View details for Web of Science ID 000255026100028

    View details for PubMedID 18420921

  • The use and analysis of microarray data NATURE REVIEWS DRUG DISCOVERY Butte, A. 2002; 1 (12): 951-960


    Functional genomics is the study of gene function through the parallel expression measurements of genomes, most commonly using the technologies of microarrays and serial analysis of gene expression. Microarray usage in drug discovery is expanding, and its applications include basic research and target discovery, biomarker determination, pharmacology, toxicogenomics, target selectivity, development of prognostic tests and disease-subclass determination. This article reviews the different ways to analyse large sets of microarray data, including the questions that can be asked and the challenges in interpreting the measurements.

    View details for DOI 10.1038/nrd.961

    View details for Web of Science ID 000179554800014

    View details for PubMedID 12461517

  • Organ Size Control Is Dominant over Rb Family Inactivation to Restrict Proliferation In Vivo. Cell reports Ehmer, U., Zmoos, A., Auerbach, R. K., Vaka, D., Butte, A. J., Kay, M. A., Sage, J. 2014; 8 (2): 371-381


    In mammals, a cell's decision to divide is thought to be under the control of the Rb/E2F pathway. We previously found that inactivation of the Rb family of cell cycle inhibitors (Rb, p107, and p130) in quiescent liver progenitors leads to uncontrolled division and cancer initiation. Here, we show that, in contrast, deletion of the entire Rb gene family in mature hepatocytes is not sufficient for their long-term proliferation. The cell cycle block in Rb family mutant hepatocytes is independent of the Arf/p53/p21 checkpoint but can be abrogated upon decreasing liver size. At the molecular level, we identify YAP, a transcriptional regulator involved in organ size control, as a factor required for the sustained expression of cell cycle genes in hepatocytes. These experiments identify a higher level of regulation of the cell cycle in vivo in which signals regulating organ size are dominant regulators of the core cell cycle machinery.

    View details for DOI 10.1016/j.celrep.2014.06.025

    View details for PubMedID 25017070

  • Organ Size Control Is Dominant over Rb Family Inactivation to Restrict Proliferation In Vivo CELL REPORTS Ehmer, U., Zmoos, A., Auerbach, R. K., Vaka, D., Butte, A. J., Kay, M. A., Sage, J. 2014; 8 (2): 370-380
  • Investigation of maternal environmental exposures in association with self-reported preterm birth REPRODUCTIVE TOXICOLOGY Patel, C. J., Yang, T., Hu, Z., Wen, Q., Sung, J., El-Sayed, Y. Y., Cohen, H., Gould, J., Stevenson, D. K., Shaw, G. M., Ling, X. B., Butte, A. J. 2014; 45: 1-7


    Identification of maternal environmental factors influencing preterm birth risks is important to understand the reasons for the increase in prematurity since 1990. Here, we utilized a health survey, the US National Health and Nutrition Examination Survey (NHANES) to search for personal environmental factors associated with preterm birth. 201 urine and blood markers of environmental factors, such as allergens, pollutants, and nutrients were assayed in mothers (range of N: 49-724) who answered questions about any children born preterm (delivery <37 weeks). We screened each of the 201 factors for association with any child born preterm adjusting by age, race/ethnicity, education, and household income. We attempted to verify the top finding, urinary bisphenol A, in an independent study of pregnant women attending Lucile Packard Children's Hospital. We conclude that the association between maternal urinary levels of bisphenol A and preterm birth should be evaluated in a larger epidemiological investigation.

    View details for DOI 10.1016/j.reprotox.2013.12.005

    View details for Web of Science ID 000336415800001

  • Whole-Exome Sequencing Reveals TopBP1 as a Novel Gene in Idiopathic Pulmonary Arterial Hypertension AMERICAN JOURNAL OF RESPIRATORY AND CRITICAL CARE MEDICINE Perez, V. A., Yuan, K., Lyuksyutova, M. A., Dewey, F., Orcholski, M. E., Shuffle, E. M., Mathur, M., Yancy, L., Rojas, V., Li, C. G., Cao, A., Alastalo, T., Khazeni, N., Cimprich, K. A., Butte, A. J., Ashley, E., Zamanian, R. T. 2014; 189 (10): 1260-1272


    Rationale: Idiopathic pulmonary arterial hypertension (IPAH) is a life-threatening disorder characterized by progressive loss of pulmonary microvessels. While mutations in the bone morphogenetic receptor (BMPR) 2 are found in 80% of heritable and ±15% of IPAH patients, their low penetrance (±20%) suggests that other as-yet unidentified genetic modifiers are required for manifestation of the disease phenotype. Use of whole exome sequencing (WES) has recently led to the discovery of novel susceptibility genes in heritable PAH but whether WES can also accelerate gene discovery in IPAH remains unknown. Objectives: To determine whether WES can help identify novel gene modifiers in IPAH patients. Methods and Measurements: Exome capture and sequencing was performed on genomic DNA isolated from 12 unrelated IPAH patients lacking BMPR2 mutations. Observed genetic variants were prioritized according to their pathogenic potential using ANNOVAR. Main Results: A total of 10 genes were identified as high priority candidates. Our top hit was TopBP1, a gene involved in the response to DNA damage and replication stress. We found that TopBP1 expression was reduced in vascular lesions and pulmonary endothelial cells isolated from IPAH patients. While TopBP1 deficiency made endothelial cells susceptible to DNA damage and apoptosis in response to hydroxyurea, its restoration resulted in less DNA damage and improved cell survival. Conclusions: WES led to the discovery of TopBP1, a gene whose deficiency may increase susceptibly to small vessel loss in IPAH. We predict that use of WES will help identify gene modifiers that influence an individual's risk of developing IPAH.

    View details for DOI 10.1164/rccm.201310-17490C

    View details for Web of Science ID 000336017200018

  • A Meta-analysis of Lung Cancer Gene Expression Identifies PTK7 as a Survival Gene in Lung Adenocarcinoma CANCER RESEARCH Chen, R., Khatri, P., Mazur, P. K., Polin, M., Zheng, Y., Vaka, D., Hoang, C. D., Shrager, J., Xu, Y., Vicent, S., Butte, A. J., Sweet-Cordero, E. A. 2014; 74 (10): 2892-2902


    Lung cancer remains the most common cause of cancer-related death worldwide and it continues to lack effective treatment. The increasingly large and diverse public databases of lung cancer gene expression constitute a rich source of candidate oncogenic drivers and therapeutic targets. To define novel targets for lung adenocarcinoma, we conducted a large-scale meta-analysis of genes specifically overexpressed in adenocarcinoma. We identified an 11-gene signature that was overexpressed consistently in adenocarcinoma specimens relative to normal lung tissue. Six genes in this signature were specifically overexpressed in adenocarcinoma relative to other subtypes of non-small cell lung cancer (NSCLC). Among these genes was the little studied protein tyrosine kinase PTK7. Immunohistochemical analysis confirmed that PTK7 is highly expressed in primary adenocarcinoma patient samples. RNA interference-mediated attenuation of PTK7 decreased cell viability and increased apoptosis in a subset of adenocarcinoma cell lines. Further, loss of PTK7 activated the MKK7-JNK stress response pathway and impaired tumor growth in xenotransplantation assays. Our work defines PTK7 as a highly and specifically expressed gene in adenocarcinoma and a potential therapeutic target in this subset of NSCLC. Cancer Res; 74(10); 2892-902. ©2014 AACR.

    View details for DOI 10.1158/0008-5472.CAN-13-2775

    View details for Web of Science ID 000336720700024

    View details for PubMedID 24654231

  • Disease Risk Factors Identified Through Shared Genetic Architecture and Electronic Medical Records SCIENCE TRANSLATIONAL MEDICINE Li, L., Ruau, D. J., Patel, C. J., Weber, S. C., Chen, R., Tatonetti, N. P., Dudley, J. T., Butte, A. J. 2014; 6 (234)
  • Clinical interpretation and implications of whole-genome sequencing. JAMA-the journal of the American Medical Association Dewey, F. E., Grove, M. E., Pan, C., Goldstein, B. A., Bernstein, J. A., Chaib, H., Merker, J. D., Goldfeder, R. L., Enns, G. M., David, S. P., Pakdaman, N., Ormond, K. E., Caleshu, C., Kingham, K., Klein, T. E., Whirl-Carrillo, M., Sakamoto, K., Wheeler, M. T., Butte, A. J., Ford, J. M., Boxer, L., Ioannidis, J. P., Yeung, A. C., Altman, R. B., Assimes, T. L., Snyder, M., Ashley, E. A., Quertermous, T. 2014; 311 (10): 1035-1045


    Whole-genome sequencing (WGS) is increasingly applied in clinical medicine and is expected to uncover clinically significant findings regardless of sequencing indication.To examine coverage and concordance of clinically relevant genetic variation provided by WGS technologies; to quantitate inherited disease risk and pharmacogenomic findings in WGS data and resources required for their discovery and interpretation; and to evaluate clinical action prompted by WGS findings.An exploratory study of 12 adult participants recruited at Stanford University Medical Center who underwent WGS between November 2011 and March 2012. A multidisciplinary team reviewed all potentially reportable genetic findings. Five physicians proposed initial clinical follow-up based on the genetic findings.Genome coverage and sequencing platform concordance in different categories of genetic disease risk, person-hours spent curating candidate disease-risk variants, interpretation agreement between trained curators and disease genetics databases, burden of inherited disease risk and pharmacogenomic findings, and burden and interrater agreement of proposed clinical follow-up.Depending on sequencing platform, 10% to 19% of inherited disease genes were not covered to accepted standards for single nucleotide variant discovery. Genotype concordance was high for previously described single nucleotide genetic variants (99%-100%) but low for small insertion/deletion variants (53%-59%). Curation of 90 to 127 genetic variants in each participant required a median of 54 minutes (range, 5-223 minutes) per genetic variant, resulted in moderate classification agreement between professionals (Gross κ, 0.52; 95% CI, 0.40-0.64), and reclassified 69% of genetic variants cataloged as disease causing in mutation databases to variants of uncertain or lesser significance. Two to 6 personal disease-risk findings were discovered in each participant, including 1 frameshift deletion in the BRCA1 gene implicated in hereditary breast and ovarian cancer. Physician review of sequencing findings prompted consideration of a median of 1 to 3 initial diagnostic tests and referrals per participant, with fair interrater agreement about the suitability of WGS findings for clinical follow-up (Fleiss κ, 0.24; P < 001).In this exploratory study of 12 volunteer adults, the use of WGS was associated with incomplete coverage of inherited disease genes, low reproducibility of detection of genetic variation with the highest potential clinical effects, and uncertainty about clinically reportable findings. In certain cases, WGS will identify clinically actionable genetic variants warranting early medical intervention. These issues should be considered when determining the role of WGS in clinical medicine.

    View details for DOI 10.1001/jama.2014.1717

    View details for PubMedID 24618965

  • Clinical Interpretation and Implications of Whole-Genome Sequencing JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION Dewey, F. E., Grove, M. E., Pan, C., Goldstein, B. A., Bernstein, J. A., Chaib, H., Merker, J. D., Goldfeder, R. L., Enns, G. M., David, S. P., Pakdaman, N., Ormond, K. E., Caleshu, C., Kingham, K., Klein, T. E., Whirl-Carrillo, M., Sakamoto, K., Wheeler, M. T., Butte, A. J., Ford, J. M., Boxer, L., Ioannidis, J. P., Yeung, A. C., Altman, R. B., Assimes, T. L., Snyder, M., Ashley, E. A., Quertermous, T. 2014; 311 (10): 1035-1044
  • Collaborative biomedicine in the age of big data: the case of cancer. Journal of medical Internet research Shaikh, A. R., Butte, A. J., Schully, S. D., Dalton, W. S., Khoury, M. J., Hesse, B. W. 2014; 16 (4): e101


    Biomedicine is undergoing a revolution driven by high throughput and connective computing that is transforming medical research and practice. Using oncology as an example, the speed and capacity of genomic sequencing technologies is advancing the utility of individual genetic profiles for anticipating risk and targeting therapeutics. The goal is to enable an era of "P4" medicine that will become increasingly more predictive, personalized, preemptive, and participative over time. This vision hinges on leveraging potentially innovative and disruptive technologies in medicine to accelerate discovery and to reorient clinical practice for patient-centered care. Based on a panel discussion at the Medicine 2.0 conference in Boston with representatives from the National Cancer Institute, Moffitt Cancer Center, and Stanford University School of Medicine, this paper explores how emerging sociotechnical frameworks, informatics platforms, and health-related policy can be used to encourage data liquidity and innovation. This builds on the Institute of Medicine's vision for a "rapid learning health care system" to enable an open source, population-based approach to cancer prevention and control.

    View details for DOI 10.2196/jmir.2496

    View details for PubMedID 24711045

  • Mutations in NGLY1 cause an inherited disorder of the endoplasmic reticulum-associated degradation pathway. Genetics in medicine : official journal of the American College of Medical Genetics Enns, G. M., Shashi, V., Bainbridge, M., Gambello, M. J., Zahir, F. R., Bast, T., Crimian, R., Schoch, K., Platt, J., Cox, R., Bernstein, J. A., Scavina, M., Walter, R. S., Bibb, A., Jones, M., Hegde, M., Graham, B. H., Need, A. C., Oviedo, A., Schaaf, C. P., Boyle, S., Butte, A. J., Chen, R., Clark, M. J., Haraksingh, R., Cowan, T. M., He, P., Langlois, S., Zoghbi, H. Y., Snyder, M., Gibbs, R. A., Freeze, H. H., Goldstein, D. B. 2014


    Purpose:The endoplasmic reticulum-associated degradation pathway is responsible for the translocation of misfolded proteins across the endoplasmic reticulum membrane into the cytosol for subsequent degradation by the proteasome. To define the phenotype associated with a novel inherited disorder of cytosolic endoplasmic reticulum-associated degradation pathway dysfunction, we studied a series of eight patients with deficiency of N-glycanase 1.Methods:Whole-genome, whole-exome, or standard Sanger sequencing techniques were employed. Retrospective chart reviews were performed in order to obtain clinical data.Results:All patients had global developmental delay, a movement disorder, and hypotonia. Other common findings included hypolacrima or alacrima (7/8), elevated liver transaminases (6/7), microcephaly (6/8), diminished reflexes (6/8), hepatocyte cytoplasmic storage material or vacuolization (5/6), and seizures (4/8). The nonsense mutation c.1201A>T (p.R401X) was the most common deleterious allele.Conclusion:NGLY1 deficiency is a novel autosomal recessive disorder of the endoplasmic reticulum-associated degradation pathway associated with neurological dysfunction, abnormal tear production, and liver disease. The majority of patients detected to date carry a specific nonsense mutation that appears to be associated with severe disease. The phenotypic spectrum is likely to enlarge as cases with a broader range of mutations are detected.Genet Med advance online publication 20 March 2014Genetics in Medicine (2014); doi:10.1038/gim.2014.22.

    View details for DOI 10.1038/gim.2014.22

    View details for PubMedID 24651605

  • A Drug Repositioning Approach Identifies Tricyclic Antidepressants as Inhibitors of Small Cell Lung Cancer and Other Neuroendocrine Tumors CANCER DISCOVERY Jahchan, N. S., Dudley, J. T., Mazur, P. K., Flores, N., Yang, D., Palmerton, A., Zmoos, A., Vaka, D., Tran, K. Q., Zhou, M., Krasinska, K., Riess, J. W., Neal, J. W., Khatri, P., Park, K. S., Butte, A. J., Sage, J. 2013; 3 (12): 1364-1377


    Small cell lung cancer (SCLC) is an aggressive neuroendocrine subtype of lung cancer with high mortality. We used a systematic drug repositioning bioinformatics approach querying a large compendium of gene expression profiles to identify candidate U.S. Food and Drug Administration (FDA)-approved drugs to treat SCLC. We found that tricyclic antidepressants and related molecules potently induce apoptosis in both chemonaïve and chemoresistant SCLC cells in culture, in mouse and human SCLC tumors transplanted into immunocompromised mice, and in endogenous tumors from a mouse model for human SCLC. The candidate drugs activate stress pathways and induce cell death in SCLC cells, at least in part by disrupting autocrine survival signals involving neurotransmitters and their G protein-coupled receptors. The candidate drugs inhibit the growth of other neuroendocrine tumors, including pancreatic neuroendocrine tumors and Merkel cell carcinoma. These experiments identify novel targeted strategies that can be rapidly evaluated in patients with neuroendocrine tumors through the repurposing of approved drugs.Our work shows the power of bioinformatics-based drug approaches to rapidly repurpose FDA-approved drugs and identifies a novel class of molecules to treat patients with SCLC, a cancer for which no effective novel systemic treatments have been identified in several decades. In addition, our experiments highlight the importance of novel autocrine mechanisms in promoting the growth of neuroendocrine tumor cells.

    View details for DOI 10.1158/2159-8290.CD-13-0183

    View details for Web of Science ID 000328257500023

    View details for PubMedID 24078773

  • Network Medicine in Disease Analysis and Therapeutics CLINICAL PHARMACOLOGY & THERAPEUTICS Chen, B., Butte, A. J. 2013; 94 (6): 627-629


    Two parallel trends are occurring in drug discovery. The first is that we are moving away from a symptom-based disease classification system to a system based on molecules and molecular states. The second is that we are shifting from targeting a single molecule toward targeting multiple molecules, pathways, or networks. Network medicine is an approach to understanding disease and discovering therapeutics looking at many molecules and how they interrelate, and it may play a critical role in the adoption of both trends.

    View details for DOI 10.1038/clpt.2013.181

    View details for Web of Science ID 000327168400016

    View details for PubMedID 24241637

  • Systematic evaluation of personal genome services for Japanese individuals JOURNAL OF HUMAN GENETICS Kido, T., Kawashima, M., Nishino, S., Swan, M., Kamatani, N., Butte, A. J. 2013; 58 (11): 734-741


    Disease risk prediction (DRP) is one of the most important challenges in personal genome research. Although many direct-to-consumer genetic test (DTC) companies have begun to offer personal genome services for DRP, there is still no consensus on what constitutes a gold-standard service. Here, we systematically evaluated the distributions of DRPs from three DTC companies, that is, 23andMe, Navigenics and deCODEme, for 22 diseases using three Japanese samples. We systematically quantified and analyzed the differences between each DTC company's DRPs. Our independency test showed that the overall prediction results were correlated with each other, but not perfectly matched; less than onethird mismatching of the opposite direction occurred in eight diseases. Moreover, we found that the differences could mainly be attributed to four factors: (1) single nucleotide polymorphism (SNP) selection, (2) average risk estimation, (3) the disease risk calculation algorithm and (4) ethnicity adjustment. In particular, only 7.1% of SNPs over 22 diseases were reviewed by all three companies. Therefore, development of a universal core SNPs list for non-Caucasian samples will be important for achieving better prediction capacity for Japanese samples. This systematic methodology provides useful insights for improving the capacity of DRPs in future personal genome services.

    View details for DOI 10.1038/jhg.2013.96

    View details for Web of Science ID 000327598100005

    View details for PubMedID 24067293

  • A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics for organ transplantation JOURNAL OF EXPERIMENTAL MEDICINE Khatri, P., Roedder, S., Kimura, N., De Vusser, K., Morgan, A. A., Gong, Y., Fischbein, M. P., Robbins, R. C., Naesens, M., Butte, A. J., Sarwal, M. M. 2013; 210 (11): 2205-2221


    Using meta-analysis of eight independent transplant datasets (236 graft biopsy samples) from four organs, we identified a common rejection module (CRM) consisting of 11 genes that were significantly overexpressed in acute rejection (AR) across all transplanted organs. The CRM genes could diagnose AR with high specificity and sensitivity in three additional independent cohorts (794 samples). In another two independent cohorts (151 renal transplant biopsies), the CRM genes correlated with the extent of graft injury and predicted future injury to a graft using protocol biopsies. Inferred drug mechanisms from the literature suggested that two FDA-approved drugs (atorvastatin and dasatinib), approved for nontransplant indications, could regulate specific CRM genes and reduce the number of graft-infiltrating cells during AR. We treated mice with HLA-mismatched mouse cardiac transplant with atorvastatin and dasatinib and showed reduction of the CRM genes, significant reduction of graft-infiltrating cells, and extended graft survival. We further validated the beneficial effect of atorvastatin on graft survival by retrospective analysis of electronic medical records of a single-center cohort of 2,515 renal transplant patients followed for up to 22 yr. In conclusion, we identified a CRM in transplantation that provides new opportunities for diagnosis, drug repositioning, and rational drug design.

    View details for DOI 10.1084/jem.20122709

    View details for Web of Science ID 000325997600007

    View details for PubMedID 24127489

  • Whole genome sequencing in support of wellness and health maintenance GENOME MEDICINE Patel, C. J., Sivadas, A., Tabassum, R., Preeprem, T., Zhao, J., Arafat, D., Chen, R., Morgan, A. A., Martin, G. S., Brigham, K. L., Butte, A. J., Gibson, G. 2013; 5

    View details for DOI 10.1186/gm462

    View details for Web of Science ID 000328544000001

  • Systematic identification of DNA variants associated with ultraviolet radiation using a novel Geographic-Wide Association Study (GeoWAS) BMC MEDICAL GENETICS Hsu, I., Chen, R., Ramesh, A., Corona, E., Kang, H. P., Ruau, D., Butte, A. J. 2013; 14
  • Systematic functional regulatory assessment of disease-associated variants PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Karczewski, K. J., Dudley, J. T., Kukurba, K. R., Chen, R., Butte, A. J., Montgomery, S. B., Snyder, M. 2013; 110 (23): 9607-9612


    Genome-wide association studies have discovered many genetic loci associated with disease traits, but the functional molecular basis of these associations is often unresolved. Genome-wide regulatory and gene expression profiles measured across individuals and diseases reflect downstream effects of genetic variation and may allow for functional assessment of disease-associated loci. Here, we present a unique approach for systematic integration of genetic disease associations, transcription factor binding among individuals, and gene expression data to assess the functional consequences of variants associated with hundreds of human diseases. In an analysis of genome-wide binding profiles of NFκB, we find that disease-associated SNPs are enriched in NFκB binding regions overall, and specifically for inflammatory-mediated diseases, such as asthma, rheumatoid arthritis, and coronary artery disease. Using genome-wide variation in transcription factor-binding data, we find that NFκB binding is often correlated with disease-associated variants in a genotype-specific and allele-specific manner. Furthermore, we show that this binding variation is often related to expression of nearby genes, which are also found to have altered expression in independent profiling of the variant-associated disease condition. Thus, using this integrative approach, we provide a unique means to assign putative function to many disease-associated SNPs.

    View details for DOI 10.1073/pnas.1219099110

    View details for Web of Science ID 000320503000086

  • Transdisciplinary translational science and the case of preterm birth JOURNAL OF PERINATOLOGY Stevenson, D. K., Shaw, G. M., Wise, P. H., Norton, M. E., Druzin, M. L., Valantine, H. A., McFarland, D. A. 2013; 33 (4): 251-258


    Medical researchers have called for new forms of translational science that can solve complex medical problems. Mainstream science has made complementary calls for heterogeneous teams of collaborators who conduct transdisciplinary research so as to solve complex social problems. Is transdisciplinary translational science what the medical community needs? What challenges must the medical community overcome to successfully implement this new form of translational science? This article makes several contributions. First, it clarifies the concept of transdisciplinary research and distinguishes it from other forms of collaboration. Second, it presents an example of a complex medical problem and a concrete effort to solve it through transdisciplinary collaboration: for example, the problem of preterm birth and the March of Dimes effort to form a transdisciplinary research center that synthesizes knowledge on it. The presentation of this example grounds discussion on new medical research models and reveals potential means by which they can be judged and evaluated. Third, this article identifies the challenges to forming transdisciplines and the practices that overcome them. Departments, universities and disciplines tend to form intellectual silos and adopt reductionist approaches. Forming a more integrated (or 'constructionist'), problem-based science reflective of transdisciplinary research requires the adoption of novel practices to overcome these obstacles.

    View details for DOI 10.1038/jp.2012.133

    View details for Web of Science ID 000316833300001

  • Immune response profiling identifies autoantibodies specific to Moyamoya patients ORPHANET JOURNAL OF RARE DISEASES Sigdel, T. K., Shoemaker, L. D., Chen, R., Li, L., Butte, A. J., Sarwal, M. M., Steinberg, G. K. 2013; 8
  • Altering physiological networks using drugs: steps towards personalized physiology. BMC medical genomics Grossman, A. D., Cohen, M. J., Manley, G. T., Butte, A. J. 2013; 6 Suppl 2: S7


    The rise of personalized medicine has reminded us that each patient must be treated as an individual. One factor in making treatment decisions is the physiological state of each patient, but definitions of relevant states and methods to visualize state-related physiologic changes are scarce. We constructed correlation networks from physiologic data to demonstrate changes associated with pressor use in the intensive care unit.We collected 29 physiological variables at one-minute intervals from nineteen trauma patients in the intensive care unit of an academic hospital and grouped each minute of data as receiving or not receiving pressors. For each group we constructed Spearman correlation networks of pairs of physiologic variables. To visualize drug-associated changes we split the networks into three components: an unchanging network, a network of connections with changing correlation sign, and a network of connections only present in one group.Out of a possible 406 connections between the 29 physiological measures, 64, 39, and 48 were present in each of the three component networks. The static network confirms expected physiological relationships while the network of associations with changed correlation sign suggests putative changes due to the drugs. The network of associations present only with pressors suggests new relationships that could be worthy of study.We demonstrated that visualizing physiological relationships using correlation networks provides insight into underlying physiologic states while also showing that many of these relationships change when the state is defined by the presence of drugs. This method applied to targeted experiments could change the way critical care patients are monitored and treated.

    View details for PubMedID 23819503

  • Integrating multiple 'omics' analyses identifies serological protein biomarkers for preeclampsia. BMC medicine Liu, L. Y., Yang, T., Ji, J., Wen, Q., Morgan, A. A., Jin, B., Chen, G., Lyell, D. J., Stevenson, D. K., Ling, X. B., Butte, A. J. 2013; 11: 236-?


    Preeclampsia (PE) is a pregnancy-related vascular disorder which is the leading cause of maternal morbidity and mortality. We sought to identify novel serological protein markers to diagnose PE with a multi-'omics' based discovery approach.Seven previous placental expression studies were combined for a multiplex analysis, and in parallel, two-dimensional gel electrophoresis was performed to compare serum proteomes in PE and control subjects. The combined biomarker candidates were validated with available ELISA assays using gestational age-matched PE (n=32) and control (n=32) samples. With the validated biomarkers, a genetic algorithm was then used to construct and optimize biomarker panels in PE assessment.In addition to the previously identified biomarkers, the angiogenic and antiangiogenic factors (soluble fms-like tyrosine kinase (sFlt-1) and placental growth factor (PIGF)), we found 3 up-regulated and 6 down-regulated biomakers in PE sera. Two optimal biomarker panels were developed for early and late onset PE assessment, respectively.Both early and late onset PE diagnostic panels, constructed with our PE biomarkers, were superior over sFlt-1/PIGF ratio in PE discrimination. The functional significance of these PE biomarkers and their associated pathways were analyzed which may provide new insights into the pathogenesis of PE.

    View details for DOI 10.1186/1741-7015-11-236

    View details for PubMedID 24195779

  • Systematic identification of interaction effects between validated genome- and environment-wide associations on Type 2 Diabetes Mellitus. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science Patel, C. J., Chen, R., Kodama, K., Ioannidis, J. P., Butte, A. J. 2013; 2013: 135-?

    View details for PubMedID 24303322

  • Making it personal: translational bioinformatics. Journal of the American Medical Informatics Association : JAMIA Butte, A. J., Ohno-Machado, L. 2013; 20 (4): 595-6

    View details for PubMedID 23757438

  • Peptidomic Identification of Serum Peptides Diagnosing Preeclampsia. PloS one Wen, Q., Liu, L. Y., Yang, T., Alev, C., Wu, S., Stevenson, D. K., Sheng, G., Butte, A. J., Ling, X. B. 2013; 8 (6): e65571


    We sought to identify serological markers capable of diagnosing preeclampsia (PE). We performed serum peptide analysis (liquid chromatography mass spectrometry) of 62 unique samples from 31 PE patients and 31 healthy pregnant controls, with two-thirds used as a training set and the other third as a testing set. Differential serum peptide profiling identified 52 significant serum peptides, and a 19-peptide panel collectively discriminating PE in training sets (n?=?21 PE, n?=?21 control; specificity?=?85.7% and sensitivity?=?100%) and testing sets (n?=?10 PE, n?=?10 control; specificity?=?80% and sensitivity?=?100%). The panel peptides were derived from 6 different protein precursors: 13 from fibrinogen alpha (FGA), 1 from alpha-1-antitrypsin (A1AT), 1 from apolipoprotein L1 (APO-L1), 1 from inter-alpha-trypsin inhibitor heavy chain H4 (ITIH4), 2 from kininogen-1 (KNG1), and 1 from thymosin beta-4 (TMSB4). We concluded that serum peptides can accurately discriminate active PE. Measurement of a 19-peptide panel could be performed quickly and in a quantitative mass spectrometric platform available in clinical laboratories. This serum peptide panel quantification could provide clinical utility in predicting PE or differential diagnosis of PE from confounding chronic hypertension.

    View details for PubMedID 23840341

  • Immune response profiling identifies autoantibodies specific to Moyamoya patients. Orphanet journal of rare diseases Sigdel, T. K., Shoemaker, L. D., Chen, R., Li, L., Butte, A. J., Sarwal, M. M., Steinberg, G. K. 2013; 8: 45-?


    Moyamoya Disease is a rare, devastating cerebrovascular disorder characterized by stenosis/occlusion of supraclinoid internal carotid arteries and development of fragile collateral vessels. Moyamoya Disease is typically diagnosed by angiography after clinical presentation of cerebral hemorrhage or ischemia. Despite unclear etiology, previous reports suggest there may be an immunological component.To explore the role of autoimmunity in moyamoya disease, we used high-density protein arrays to profile IgG autoantibodies from the sera of angiographically-diagnosed Moyamoya Disease patients and compared these to healthy controls. Protein array data analysis followed by bioinformatics analysis yielded a number of auto-antibodies which were further validated by ELISA for an independent group of MMD patients (n = 59) and control patients with other cerebrovascular diseases including carotid occlusion, carotid stenosis and arteriovenous malformation.We identified 165 significantly (p < 0.05) elevated autoantibodies in Moyamoya Disease, including those against CAMK2A, CD79A and EFNA3. Pathway analysis associated these autoantibodies with post-translational modification, neurological disease, inflammatory response, and DNA damage repair and maintenance. Using the novel functional interpolating single-nucleotide polymorphisms bioinformatics approach, we identified 6 Moyamoya Disease-associated autoantibodies against APP, GPS1, STRA13, CTNNB1, ROR1 and EDIL3. The expression of these 6 autoantibodies was validated by custom-designed reverse ELISAs for an independent group of Moyamoya Disease patients compared to patients with other cerebrovascular diseases.We report the first high-throughput analysis of autoantibodies in Moyamoya Disease, the results of which may provide valuable insight into the immune-related pathology of Moyamoya Disease and may potentially advance diagnostic clinical tools.

    View details for DOI 10.1186/1750-1172-8-45

    View details for PubMedID 23518061

  • Systematic identification of risk factors for Alzheimer's disease through shared genetic architecture and electronic medical records. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Li, L., Ruau, D., Chen, R., Weber, S., Butte, A. J. 2013: 224-235


    Alzheimer's disease (AD) is one of the leading causes of death for older people in US with rapidly increasing incidence. AD irreversibly and progressively damages the brain, but there are treatments in clinical trials to potentially slow the development of AD. We hypothesize that the presence of clinical traits, sharing common genetic variants with AD, could be used as a non-invasive means to predict AD or trigger for administration of preventative therapeutics. We developed a method to compare the genetic architecture between AD and traits from prior GWAS studies. Six clinical traits were significantly associated with AD, capturing 5 known risk factors and 1 novel association: erythrocyte sedimentation rate (ESR). The association of ESR with AD was then validated using Electronic Medical Records (EMR) collected from Stanford Hospital and Clinics. We found that female patients and with abnormally elevated ESR were significantly associated with higher risk of AD diagnosis (OR: 1.85 [1.32-2.61], p=0.003), within 1 year prior to AD diagnosis (OR: 2.31 [1.06-5.01], p=0.032), and within 1 year after AD diagnosis (OR: 3.49 [1.93-6.31], p<0.0001). Additionally, significantly higher ESR values persist for all time courses analyzed. Our results suggest that ESR should be tested in a specific longitudinal study for association with AD diagnosis, and if positive, could be used as a prognostic marker.

    View details for PubMedID 23424127

  • Relating Genes to Function: Identifying Enriched Transcription Factors using the ENCODE ChIP-Seq Significance Tool. Bioinformatics (Oxford, England) Auerbach, R. K., Chen, B., Butte, A. J. 2013


    MOTIVATION: Biological analysis has shifted from identifying genes to mapping these genes to biological function. The ENCODE Project has generated hundreds of ChIP-Seq experiments spanning multiple transcription factors and cell lines for public use, but tools for a biomedical scientist to analyze these data are either non-existent or tailored to narrow biological questions. We present the ENCODE ChIP-Seq Significance Tool, a flexible web application leveraging public ENCODE data to identify enriched transcription factors in a gene or transcript list for comparative analyses.Implementation: The ENCODE ChIP-Seq Significance Tool is written in JavaScript on the client side and has been tested on Google Chrome, Apple Safari, and Mozilla Firefox browsers. Server-side scripts are written in PHP and leverage R and a MySQL database. The tool is available at

    View details for PubMedID 23732275

  • Whole genome sequencing in support of wellness and health maintenance. Genome medicine Patel, C. J., Sivadas, A., Tabassum, R., Preeprem, T., Zhao, J., Arafat, D., Chen, R., Morgan, A. A., Martin, G. S., Brigham, K. L., Butte, A. J., Gibson, G. 2013; 5 (6): 58-?


    Whole genome sequencing is poised to revolutionize personalized medicine, providing the capacity to classify individuals into risk categories for a wide range of diseases. Here we begin to explore how whole genome sequencing (WGS) might be incorporated alongside traditional clinical evaluation as a part of preventive medicine. The present study illustrates novel approaches for integrating genotypic and clinical information for assessment of generalized health risks and to assist individuals in the promotion of wellness and maintenance of good health.Whole genome sequences and longitudinal clinical profiles are described for eight middle-aged Caucasian participants (four men and four women) from the Center for Health Discovery and Well Being (CHDWB) at Emory University in Atlanta. We report multivariate genotypic risk assessments derived from common variants reported by genome-wide association studies (GWAS), single rare homozygous deleterious variants, and clinical measures in the domains of immune, metabolic, cardiovascular, musculoskeletal and mental health.Polygenic risk is assessed for each participant for over 100 diseases and reported relative to baseline population prevalence. Two approaches for combining clinical and genetic profiles for the purposes of health assessment are then discussed. First we propose conditioning individual disease risk assessments on observed clinical status for type 2 diabetes, coronary artery disease, hypertriglyceridemia and hypertension, and obesity. An excess of concordance between genetic prediction and observed sub-clinical disease is observed. Subsequently, we show how more holistic combination of genetic, clinical and family history data can be achieved by visualizing risk in eight sub-classes of disease. Having identified where their profiles are broadly concordant or discordant, an individual can focus on individual clinical results or genotypes as they develop personalized health action plans in consultation with a health partner.The CHDWB will facilitate longitudinal evaluation of wellness-focused medical care based on comprehensive self-knowledge of medical risks.

    View details for DOI 10.1186/gm462

    View details for PubMedID 23806097

  • Database integration of 4923 publicly-available samples of breast cancer molecular and clinical data. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science Planey, C. R., Butte, A. J. 2013; 2013: 138-142


    We outline a paradigm for meta-microarray database creation and integration with clinical variables. We use as our implementation example a breast cancer database linking RNA expression measurements (by microarray) and clinical variables, such as survival metrics and tumor size. Such an endeavor involves integrating across different microarray datasets as well as clinical parameters. To this end, we created a data curation and processing pipeline, formal database ontology, and SQL schema to optimally query, analyze and visualize data from over 30 publicly available breast cancer microarray studies listed in the Gene Expression Omnibus (GEO). We demonstrate several pilot examples using this database. This methodology serves as a model for future meta-analyses of complex public clinical datasets, in particular those in the field of cancer.

    View details for PubMedID 24303324

  • Systematic identification of DNA variants associated with ultraviolet radiation using a novel Geographic-Wide Association Study (GeoWAS). BMC medical genetics Hsu, I., Chen, R., Ramesh, A., Corona, E., Kang, H. P., Ruau, D., Butte, A. J. 2013; 14: 62-?


    Long-term environmental variables are widely understood to play important roles in DNA variation. Previously, clinical studies examining the impacts of these variables on the human genome were localized to a single country, and used preselected DNA variants. Furthermore, clinical studies or surveys are either not available or difficult to carry out for developing countries. A systematic approach utilizing bioinformatics to identify associations among environmental variables, genetic variation, and diseases across various geographical locations is needed but has been lacking.Using a novel Geographic-Wide Association Study (GeoWAS) methodology, we identified Single Nucleotide Polymorphisms (SNPs) in the Human Genome Diversity Project (HGDP) with population allele frequencies associated geographical ultraviolet radiation exposure, and then assessed the diseases known to be assigned with these SNPs.2,857 radiation SNPs were identified from over 650,000 SNPs in 52 indigenous populations across the world. Using a quantitative disease-SNP database curated from 5,065 human genetic papers, we identified disease associations with those radiation SNPs. The correlation of the rs16891982 SNP in the SLC45A2 gene with melanoma was used as a case study for analysis of disease risk, and the results were consistent with the incidence and mortality rates of melanoma in published scientific literature. Finally, by analyzing the ontology of genes in which the radiation SNPs were significantly enriched, potential associations between SNPs and neurological disorders such as Alzheimer's disease were hypothesized.A systematic approach using GeoWAS has enabled us to identify DNA variation associated with ultraviolet radiation and their connections to diseases such as skin cancers. Our analyses have led to a better understating at the genetic level of why certain diseases are more predominant in specific geographical locations, due to the interactions between environmental variables such as ultraviolet radiation and the population types in those regions. The hypotheses proposed in GeoWAS can lead to future testing and interdisciplinary research.

    View details for DOI 10.1186/1471-2350-14-62

    View details for PubMedID 23786662

  • FoxO6 regulates memory consolidation and synaptic function GENES & DEVELOPMENT Salih, D. A., Rashid, A. J., Colas, D., de la Torre-Ubieta, L., Zhu, R. P., Morgan, A. A., Santo, E. E., Ucar, D., Devarajan, K., Cole, C. J., Madison, D. V., Shamloo, M., Butte, A. J., Bonni, A., Josselyn, S. A., Brunet, A. 2012; 26 (24): 2780-2801


    The FoxO family of transcription factors is known to slow aging downstream from the insulin/IGF (insulin-like growth factor) signaling pathway. The most recently discovered FoxO isoform in mammals, FoxO6, is highly enriched in the adult hippocampus. However, the importance of FoxO factors in cognition is largely unknown. Here we generated mice lacking FoxO6 and found that these mice display normal learning but impaired memory consolidation in contextual fear conditioning and novel object recognition. Using stereotactic injection of viruses into the hippocampus of adult wild-type mice, we found that FoxO6 activity in the adult hippocampus is required for memory consolidation. Genome-wide approaches revealed that FoxO6 regulates a program of genes involved in synaptic function upon learning in the hippocampus. Consistently, FoxO6 deficiency results in decreased dendritic spine density in hippocampal neurons in vitro and in vivo. Thus, FoxO6 may promote memory consolidation by regulating a program coordinating neuronal connectivity in the hippocampus, which could have important implications for physiological and pathological age-dependent decline in memory.

    View details for DOI 10.1101/gad.208926.112

    View details for Web of Science ID 000312775700011

    View details for PubMedID 23222102

  • A Nutrient-Wide Association Study on Blood Pressure CIRCULATION Tzoulaki, I., Patel, C. J., Okamura, T., Chan, Q., Brown, I. J., Miura, K., Ueshima, H., Zhao, L., Van Horn, L., Daviglus, M. L., Stamler, J., Butte, A. J., Ioannidis, J. P., Elliott, P. 2012; 126 (21): 2456-2464


    A nutrient-wide approach may be useful to comprehensively test and validate associations between nutrients (derived from foods and supplements) and blood pressure (BP) in an unbiased manner.Data from 4680 participants aged 40 to 59 years in the cross-sectional International Study of Macro/Micronutrients and Blood Pressure (INTERMAP) were stratified randomly into training and testing sets. US National Health and Nutrition Examination Survey (NHANES) four cross-sectional cohorts (1999-2000, 2001-2002, 2003-2004, 2005-2006) were used for external validation. We performed multiple linear regression analyses associating each of 82 nutrients and 3 urine electrolytes with systolic and diastolic BP in the INTERMAP training set. Significant findings were validated in the INTERMAP testing set and further in the NHANES cohorts (false discovery rate <5% in training, P<0.05 for internal and external validation). Among the validated nutrients, alcohol and urinary sodium-to-potassium ratio were directly associated with systolic BP, and dietary phosphorus, magnesium, iron, thiamin, folacin, and riboflavin were inversely associated with systolic BP. In addition, dietary folacin and riboflavin were inversely associated with diastolic BP. The absolute effect sizes in the validation data (NHANES) ranged from 0.97 mm Hg lower systolic BP (phosphorus) to 0.39 mm Hg lower systolic BP (thiamin) per 1-SD difference in nutrient variable. Inclusion of nutrient intake from supplements in addition to foods gave similar results for some nutrients, though it attenuated the associations of folacin, thiamin, and riboflavin intake with BP.We identified significant inverse associations between B vitamins and BP, relationships hitherto poorly investigated. Our analyses represent a systematic unbiased approach to the evaluation and validation of nutrient-BP associations.

    View details for DOI 10.1161/CIRCULATIONAHA.112.114058

    View details for Web of Science ID 000311342600010

    View details for PubMedID 23093587

  • Cross-Species Functional Analysis of Cancer-Associated Fibroblasts Identifies a Critical Role for CLCF1 and IL-6 in Non-Small Cell Lung Cancer In Vivo CANCER RESEARCH Vicent, S., Sayles, L. C., Vaka, D., Khatri, P., Gevaert, O., Chen, R., Zheng, Y., Gillespie, A. K., Clarke, N., Xu, Y., Shrager, J., Hoang, C. D., Plevritis, S., Butte, A. J., Sweet-Cordero, E. A. 2012; 72 (22): 5744-5756


    Cancer-associated fibroblasts (CAF) have been reported to support tumor progression by a variety of mechanisms. However, their role in the progression of non-small cell lung cancer (NSCLC) remains poorly defined. In addition, the extent to which specific proteins secreted by CAFs contribute directly to tumor growth is unclear. To study the role of CAFs in NSCLCs, a cross-species functional characterization of mouse and human lung CAFs was conducted. CAFs supported the growth of lung cancer cells in vivo by secretion of soluble factors that directly stimulate the growth of tumor cells. Gene expression analysis comparing normal mouse lung fibroblasts and mouse lung CAFs identified multiple genes that correlate with the CAF phenotype. A gene signature of secreted genes upregulated in CAFs was an independent marker of poor survival in patients with NSCLC. This secreted gene signature was upregulated in normal lung fibroblasts after long-term exposure to tumor cells, showing that lung fibroblasts are "educated" by tumor cells to acquire a CAF-like phenotype. Functional studies identified important roles for CLCF1-CNTFR and interleukin (IL)-6-IL-6R signaling in promoting growth of NSCLCs. This study identifies novel soluble factors contributing to the CAF protumorigenic phenotype in NSCLCs and suggests new avenues for the development of therapeutic strategies.

    View details for DOI 10.1158/0008-5472.CAN-12-1097

    View details for Web of Science ID 000311141300012

    View details for PubMedID 22962265

  • Population Genetic Inference from Personal Genome Data: Impact of Ancestry and Admixture on Human Genomic Variation AMERICAN JOURNAL OF HUMAN GENETICS Kidd, J. M., Gravel, S., Byrnes, J., Moreno-Estrada, A., Musharoff, S., Bryc, K., Degenhardt, J. D., Brisbin, A., Sheth, V., Chen, R., McLaughlin, S. F., Peckham, H. E., Omberg, L., Chung, C. A., Stanley, S., Pearlstein, K., Levandowsky, E., Acevedo-Acevedo, S., Auton, A., Keinan, A., Acuna-Alonzo, V., Barquera-Lozano, R., Canizales-Quinteros, S., Eng, C., Burchard, E. G., Russell, A., Reynolds, A., Clark, A. G., Reese, M. G., Lincoln, S. E., Butte, A. T., De La Vega, F. M., Bustamante, C. D. 2012; 91 (4): 660-671


    Full sequencing of individual human genomes has greatly expanded our understanding of human genetic variation and population history. Here, we present a systematic analysis of 50 human genomes from 11 diverse global populations sequenced at high coverage. Our sample includes 12 individuals who have admixed ancestry and who have varying degrees of recent (within the last 500 years) African, Native American, and European ancestry. We found over 21 million single-nucleotide variants that contribute to a 1.75-fold range in nucleotide heterozygosity across diverse human genomes. This heterozygosity ranged from a high of one heterozygous site per kilobase in west African genomes to a low of 0.57 heterozygous sites per kilobase in segments inferred to have diploid Native American ancestry from the genomes of Mexican and Puerto Rican individuals. We show evidence of all three continental ancestries in the genomes of Mexican, Puerto Rican, and African American populations, and the genome-wide statistics are highly consistent across individuals from a population once ancestry proportions have been accounted for. Using a generalized linear model, we identified subtle variations across populations in the proportion of neutral versus deleterious variation and found that genome-wide statistics vary in admixed populations even once ancestry proportions have been factored in. We further infer that multiple periods of gene flow shaped the diversity of admixed populations in the Americas-70% of the European ancestry in today's African Americans dates back to European gene flow happening only 7-8 generations ago.

    View details for DOI 10.1016/j.ajhg.2012.08.025

    View details for Web of Science ID 000309568500008

    View details for PubMedID 23040495

  • A Peripheral Blood Diagnostic Test for Acute Rejection in Renal Transplantation AMERICAN JOURNAL OF TRANSPLANTATION Li, L., Khatri, P., Sigdel, T. K., Tran, T., Ying, L., Vitalone, M. J., Chen, A., Hsieh, S., Dai, H., Zhang, M., Naesens, M., Zarkhin, V., Sansanwal, P., Chen, R., Mindrinos, M., Xiao, W., Benfield, M., Ettenger, R. B., Dharnidharka, V., Mathias, R., Portale, A., McDonald, R., Harmon, W., Kershaw, D., Vehaskari, V. M., Kamil, E., Baluarte, H. J., Warady, B., Davis, R., Butte, A. J., Salvatierra, O., Sarwal, M. M. 2012; 12 (10): 2710-2718


    Monitoring of renal graft status through peripheral blood (PB) rather than invasive biopsy is important as it will lessen the risk of infection and other stresses, while reducing the costs of rejection diagnosis. Blood gene biomarker panels were discovered by microarrays at a single center and subsequently validated and cross-validated by QPCR in the NIH SNSO1 randomized study from 12 US pediatric transplant programs. A total of 367 unique human PB samples, each paired with a graft biopsy for centralized, blinded phenotype classification, were analyzed (115 acute rejection (AR), 180 stable and 72 other causes of graft injury). Of the differentially expressed genes by microarray, Q-PCR analysis of a five gene-set (DUSP1, PBEF1, PSEN1, MAPK9 and NKTR) classified AR with high accuracy. A logistic regression model was built on independent training-set (n = 47) and validated on independent test-set (n = 198)samples, discriminating AR from STA with 91% sensitivity and 94% specificity and AR from all other non-AR phenotypes with 91% sensitivity and 90% specificity. The 5-gene set can diagnose AR potentially avoiding the need for invasive renal biopsy. These data support the conduct of a prospective study to validate the clinical predictive utility of this diagnostic tool.

    View details for DOI 10.1111/j.1600-6143.2012.04253.x

    View details for Web of Science ID 000309180000018

    View details for PubMedID 23009139

  • Evolutionary Meta-Analysis of Association Studies Reveals Ancient Constraints Affecting Disease Marker Discovery MOLECULAR BIOLOGY AND EVOLUTION Dudley, J. T., Chen, R., Sanderford, M., Butte, A. J., Kumar, S. 2012; 29 (9): 2087-2094


    Genome-wide disease association studies contrast genetic variation between disease cohorts and healthy populations to discover single nucleotide polymorphisms (SNPs) and other genetic markers revealing underlying genetic architectures of human diseases. Despite scores of efforts over the past decade, many reproducible genetic variants that explain substantial proportions of the heritable risk of common human diseases remain undiscovered. We have conducted a multispecies genomic analysis of 5,831 putative human risk variants for more than 230 disease phenotypes reported in 2,021 studies. We find that the current approaches show a propensity for discovering disease-associated SNPs (dSNPs) at conserved genomic positions because the effect size (odds ratio) and allelic P value of genetic association of an SNP relates strongly to the evolutionary conservation of their genomic position. We propose a new measure for ranking SNPs that integrates evolutionary conservation scores and the P value (E-rank). Using published data from a large case-control study, we demonstrate that E-rank method prioritizes SNPs with a greater likelihood of bona fide and reproducible genetic disease associations, many of which may explain greater proportions of genetic variance. Therefore, long-term evolutionary histories of genomic positions offer key practical utility in reassessing data from existing disease association studies, and in the design and analysis of future studies aimed at revealing the genetic basis of common human diseases.

    View details for DOI 10.1093/molbev/mss079

    View details for Web of Science ID 000308851600001

    View details for PubMedID 22389448

  • Sequencing and analysis of a South Asian-Indian personal genome BMC GENOMICS Gupta, R., Ratan, A., Rajesh, C., Chen, R., Kim, H. L., Burhans, R., Miller, W., Santhosh, S., Davuluri, R. V., Butte, A. J., Schuster, S. C., Seshagiri, S., Thomas, G. 2012; 13


    With over 1.3 billion people, India is estimated to contain three times more genetic diversity than does Europe. Next-generation sequencing technologies have facilitated the understanding of diversity by enabling whole genome sequencing at greater speed and lower cost. While genomes from people of European and Asian descent have been sequenced, only recently has a single male genome from the Indian subcontinent been published at sufficient depth and coverage. In this study we have sequenced and analyzed the genome of a South Asian Indian female (SAIF) from the Indian state of Kerala.We identified over 3.4 million SNPs in this genome including over 89,873 private variations. Comparison of the SAIF genome with several published personal genomes revealed that this individual shared ~50% of the SNPs with each of these genomes. Analysis of the SAIF mitochondrial genome showed that it was closely related to the U1 haplogroup which has been previously observed in Kerala. We assessed the SAIF genome for SNPs with health and disease consequences and found that the individual was at a higher risk for multiple sclerosis and a few other diseases. In analyzing SNPs that modulate drug response, we found a variation that predicts a favorable response to metformin, a drug used to treat diabetes. SNPs predictive of adverse reaction to warfarin indicated that the SAIF individual is not at risk for bleeding if treated with typical doses of warfarin. In addition, we report the presence of several additional SNPs of medical relevance.This is the first study to report the complete whole genome sequence of a female from the state of Kerala in India. The availability of this complete genome and variants will further aid studies aimed at understanding genetic diversity, identifying clinically relevant changes and assessing disease burden in the Indian population.

    View details for DOI 10.1186/1471-2164-13-440

    View details for Web of Science ID 000312952300001

    View details for PubMedID 22938532

  • Integration of disease-specific single nucleotide polymorphisms, expression quantitative trait loci and coexpression networks reveal novel candidate genes for type 2 diabetes DIABETOLOGIA Kang, H. P., Yang, X., Chen, R., Zhang, B., Corona, E., Schadt, E. E., Butte, A. J. 2012; 55 (8): 2205-2213


    While genome-wide association studies (GWASs) have been successful in identifying novel variants associated with various diseases, it has been much more difficult to determine the biological mechanisms underlying these associations. Expression quantitative trait loci (eQTL) provide another dimension to these data by associating single nucleotide polymorphisms (SNPs) with gene expression. We hypothesised that integrating SNPs known to be associated with type 2 diabetes with eQTLs and coexpression networks would enable the discovery of novel candidate genes for type 2 diabetes.We selected 32 SNPs associated with type 2 diabetes in two or more independent GWASs. We used previously described eQTLs mapped from genotype and gene expression data collected from 1,008 morbidly obese patients to find genes with expression associated with these SNPs. We linked these genes to coexpression modules, and ranked the other genes in these modules using an inverse sum score.We found 62 genes with expression associated with type 2 diabetes SNPs. We validated our method by linking highly ranked genes in the coexpression modules back to SNPs through a combined eQTL dataset. We showed that the eQTLs highlighted by this method are significantly enriched for association with type 2 diabetes in data from the Wellcome Trust Case Control Consortium (WTCCC, p = 0.026) and the Gene Environment Association Studies (GENEVA, p = 0.042), validating our approach. Many of the highly ranked genes are also involved in the regulation or metabolism of insulin, glucose or lipids.We have devised a novel method, involving the integration of datasets of different modalities, to discover novel candidate genes for type 2 diabetes.

    View details for DOI 10.1007/s00125-012-2568-3

    View details for Web of Science ID 000306122600016

    View details for PubMedID 22584726

  • Human genomic disease variants: A neutral evolutionary explanation GENOME RESEARCH Dudley, J. T., Kim, Y., Liu, L., Markov, G. J., Gerold, K., Chen, R., Butte, A. J., Kumar, S. 2012; 22 (8): 1383-1394


    Many perspectives on the role of evolution in human health include nonempirical assumptions concerning the adaptive evolutionary origins of human diseases. Evolutionary analyses of the increasing wealth of clinical and population genomic data have begun to challenge these presumptions. In order to systematically evaluate such claims, the time has come to build a common framework for an empirical and intellectual unification of evolution and modern medicine. We review the emerging evidence and provide a supporting conceptual framework that establishes the classical neutral theory of molecular evolution (NTME) as the basis for evaluating disease- associated genomic variations in health and medicine. For over a decade, the NTME has already explained the origins and distribution of variants implicated in diseases and has illuminated the power of evolutionary thinking in genomic medicine. We suggest that a majority of disease variants in modern populations will have neutral evolutionary origins (previously neutral), with a relatively smaller fraction exhibiting adaptive evolutionary origins (previously adaptive). This pattern is expected to hold true for common as well as rare disease variants. Ultimately, a neutral evolutionary perspective will provide medicine with an informative and actionable framework that enables objective clinical assessment beyond convenient tendencies to invoke past adaptive events in human history as a root cause of human disease.

    View details for DOI 10.1101/gr.133702.111

    View details for Web of Science ID 000307090300001

    View details for PubMedID 22665443

  • Leveraging models of cell regulation and GWAS data in integrative network-based association studies NATURE GENETICS Califano, A., Butte, A. J., Friend, S., Ideker, T., Schadt, E. 2012; 44 (8): 841-847

    View details for DOI 10.1038/ng.2355

    View details for Web of Science ID 000306854700006

    View details for PubMedID 22836096

  • Identification of Cell Surface Targets through Meta-analysis of Microarray Data NEOPLASIA Haeberle, H., Dudley, J. T., Liu, J. T., Butte, A. J., Contag, C. H. 2012; 14 (7): 666-669


    High-resolution image guidance for resection of residual tumor cells would enable more precise and complete excision for more effective treatment of cancers, such as medulloblastoma, the most common pediatric brain cancer. Numerous studies have shown that brain tumor patient outcomes correlate with the precision of resection. To enable guided resection with molecular specificity and cellular resolution, molecular probes that effectively delineate brain tumor boundaries are essential. Therefore, we developed a bioinformatics approach to analyze micro-array datasets for the identification of transcripts that encode candidate cell surface biomarkers that are highly enriched in medulloblastoma. The results identified 380 genes with greater than a two-fold increase in the expression in the medulloblastoma compared with that in the normal cerebellum. To enrich for targets with accessibility for extracellular molecular probes, we further refined this list by filtering it with gene ontology to identify genes with protein localization on, or within, the plasma membrane. To validate this meta-analysis, the top 10 candidates were evaluated with immunohistochemistry. We identified two targets, fibrillin 2 and EphA3, which specifically stain medulloblastoma. These results demonstrate a novel bioinformatics approach that successfully identified cell surface and extracellular candidate markers enriched in medulloblastoma versus adjacent cerebellum. These two proteins are high-value targets for the development of tumor-specific probes in medulloblastoma. This bioinformatics method has broad utility for the identification of accessible molecular targets in a variety of cancers and will enable probe development for guided resection.

    View details for DOI 10.1593/neo.12634

    View details for Web of Science ID 000308489500010

    View details for PubMedID 22904683

  • Data-driven integration of epidemiological and toxicological data to select candidate interacting genes and environmental factors in association with disease BIOINFORMATICS Patel, C. J., Chen, R., Butte, A. J. 2012; 28 (12): I121-I126


    Complex diseases, such as Type 2 Diabetes Mellitus (T2D), result from the interplay of both environmental and genetic factors. However, most studies investigate either the genetics or the environment and there are a few that study their possible interaction in context of disease. One key challenge in documenting interactions between genes and environment includes choosing which of each to test jointly. Here, we attempt to address this challenge through a data-driven integration of epidemiological and toxicological studies. Specifically, we derive lists of candidate interacting genetic and environmental factors by integrating findings from genome-wide and environment-wide association studies. Next, we search for evidence of toxicological relationships between these genetic and environmental factors that may have an etiological role in the disease. We illustrate our method by selecting candidate interacting factors for T2D.

    View details for DOI 10.1093/bioinformatics/bts229

    View details for Web of Science ID 000305419800016

    View details for PubMedID 22689751

  • Integrative Approach to Pain Genetics Identifies Pain Sensitivity Loci across Diseases PLOS COMPUTATIONAL BIOLOGY Ruau, D., Dudley, J. T., Chen, R., Phillips, N. G., Swan, G. E., Lazzeroni, L. C., Clark, J. D., Butte, A. J., Angst, M. S. 2012; 8 (6)


    Identifying human genes relevant for the processing of pain requires difficult-to-conduct and expensive large-scale clinical trials. Here, we examine a novel integrative paradigm for data-driven discovery of pain gene candidates, taking advantage of the vast amount of existing disease-related clinical literature and gene expression microarray data stored in large international repositories. First, thousands of diseases were ranked according to a disease-specific pain index (DSPI), derived from Medical Subject Heading (MESH) annotations in MEDLINE. Second, gene expression profiles of 121 of these human diseases were obtained from public sources. Third, genes with expression variation significantly correlated with DSPI across diseases were selected as candidate pain genes. Finally, selected candidate pain genes were genotyped in an independent human cohort and prospectively evaluated for significant association between variants and measures of pain sensitivity. The strongest signal was with rs4512126 (5q32, ABLIM3, P?=?1.3×10?¹?) for the sensitivity to cold pressor pain in males, but not in females. Significant associations were also observed with rs12548828, rs7826700 and rs1075791 on 8q22.2 within NCALD (P?=?1.7×10??, 1.8×10??, and 2.2×10?? respectively). Our results demonstrate the utility of a novel paradigm that integrates publicly available disease-specific gene expression data with clinical data curated from MEDLINE to facilitate the discovery of pain-relevant genes. This data-derived list of pain gene candidates enables additional focused and efficient biological studies validating additional candidates.

    View details for DOI 10.1371/journal.pcbi.1002538

    View details for Web of Science ID 000305965300012

    View details for PubMedID 22685391

  • Clinical utility of sequence-based genotype compared with that derivable from genotyping arrays JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Morgan, A. A., Chen, R., Butte, A. J. 2012; 19 (E1): E21-E27


    We investigated the common-disease relevant information obtained from sequencing compared with that reported from genotyping arrays.Using 187 publicly available individual human genomes, we constructed genomic disease risk summaries based on 55 common diseases with reported gene-disease associations in the research literature using two different risk models, one based on the product of likelihood ratios and the other on the allelic variant with the maximum associated disease risk. We also constructed risk profiles based on the single nucleotide polymorphisms (SNPs) of these individuals that could be measured or imputed from two common genotyping array platforms.We show that the model risk predictions derived from sequencing differ substantially from those obtained from the SNPs measured on commercially available genotyping arrays for several different non-monogenic diseases, although high density genotyping arrays give identical results for many diseases. Conclusions: Our approach may be used to compare the ability of different platforms to probe known genetic risks disease by disease.

    View details for DOI 10.1136/amiajnl-2011-000737

    View details for Web of Science ID 000314151400005

    View details for PubMedID 22718036

  • Systematic evaluation of environmental factors: persistent pollutants and nutrients correlated with serum lipid levels INTERNATIONAL JOURNAL OF EPIDEMIOLOGY Patel, C. J., Cullen, M. R., Ioannidis, J. P., Butte, A. J. 2012; 41 (3): 828-843


    Both genetic and environmental factors contribute to triglyceride, low-density lipoprotein-cholesterol (LDL-C), and high-density lipoprotein-cholesterol (HDL-C) levels. Although genome-wide association studies are currently testing the genetic factors systematically, testing and reporting one or a few factors at a time can lead to fragmented literature for environmental chemical factors. We screened for correlation between environmental factors and lipid levels, utilizing four independent surveys with information on 188 environmental factors from the Centers of Disease Control, National Health and Nutrition Examination Survey, collected between 1999 and 2006.We used linear regression to correlate each environmental chemical factor to triglycerides, LDL-C and HDL-C adjusting for age, age(2), sex, ethnicity, socio-economic status and body mass index. Final estimates were adjusted for waist circumference, diabetes status, blood pressure and survey. Multiple comparisons were controlled for by estimating the false discovery rate and significant findings were tentatively validated in an independent survey.We identified and validated 29, 9 and 17 environmental factors correlated with triglycerides, LDL-C and HDL-C levels, respectively. Findings include hydrocarbons and nicotine associated with lower HDL-C and vitamin E (?-tocopherol) associated with unfavourable lipid levels. Higher triglycerides and lower HDL-C were correlated with higher levels of fat-soluble contaminants (e.g. polychlorinated biphenyls and dibenzofurans). Nutrients and vitamin markers (e.g. vitamins B, D and carotenes), were associated with favourable triglyceride and HDL-C levels.Our systematic association study has enabled us to postulate about broad environmental correlation to lipid levels. Although subject to confounding and reverse causality bias, these findings merit evaluation in additional cohorts.

    View details for DOI 10.1093/ije/dys003

    View details for Web of Science ID 000306417300030

    View details for PubMedID 22421054

  • Translational Bioinformatics: Data-driven Drug Discovery and Development CLINICAL PHARMACOLOGY & THERAPEUTICS Butte, A. J., Ito, S. 2012; 91 (6): 949-952


    Internet-accessible computing power and data-sharing mandates now enable researchers to interrogate thousands of publicly available databases containing molecular, clinical, and epidemiological data. With emerging new approaches, translational bioinformatics can now provide answers to previously untouchable questions, ranging from detecting population signals of adverse drug reactions to clinical interpretation of the whole genome. There are challenges, including lack of access to some data sources and software, but there are also overwhelming doses of hopes and expectations.

    View details for DOI 10.1038/clpt.2012.55

    View details for Web of Science ID 000304245800001

    View details for PubMedID 22609903

  • Type 2 Diabetes Risk Alleles Demonstrate Extreme Directional Differentiation among Human Populations, Compared to Other Diseases PLOS GENETICS Chen, R., Corona, E., Sikora, M., Dudley, J. T., Morgan, A. A., Moreno-Estrada, A., Nilsen, G. B., Ruau, D., Lincoln, S. E., Bustamante, C. D., Butte, A. J. 2012; 8 (4): 100-115


    Many disease-susceptible SNPs exhibit significant disparity in ancestral and derived allele frequencies across worldwide populations. While previous studies have examined population differentiation of alleles at specific SNPs, global ethnic patterns of ensembles of disease risk alleles across human diseases are unexamined. To examine these patterns, we manually curated ethnic disease association data from 5,065 papers on human genetic studies representing 1,495 diseases, recording the precise risk alleles and their measured population frequencies and estimated effect sizes. We systematically compared the population frequencies of cross-ethnic risk alleles for each disease across 1,397 individuals from 11 HapMap populations, 1,064 individuals from 53 HGDP populations, and 49 individuals with whole-genome sequences from 10 populations. Type 2 diabetes (T2D) demonstrated extreme directional differentiation of risk allele frequencies across human populations, compared with null distributions of European-frequency matched control genomic alleles and risk alleles for other diseases. Most T2D risk alleles share a consistent pattern of decreasing frequencies along human migration into East Asia. Furthermore, we show that these patterns contribute to disparities in predicted genetic risk across 1,397 HapMap individuals, T2D genetic risk being consistently higher for individuals in the African populations and lower in the Asian populations, irrespective of the ethnicity considered in the initial discovery of risk alleles. We observed a similar pattern in the distribution of T2D Genetic Risk Scores, which are associated with an increased risk of developing diabetes in the Diabetes Prevention Program cohort, for the same individuals. This disparity may be attributable to the promotion of energy storage and usage appropriate to environments and inconsistent energy intake. Our results indicate that the differential frequencies of T2D risk alleles may contribute to the observed disparity in T2D incidence rates across ethnic populations.

    View details for DOI 10.1371/journal.pgen.1002621

    View details for Web of Science ID 000303441800007

    View details for PubMedID 22511877

  • Sex Differences in Reported Pain Across 11,000 Patients Captured in Electronic Medical Records JOURNAL OF PAIN Ruau, D., Liu, L. Y., Clark, J. D., Angst, M. S., Butte, A. J. 2012; 13 (3): 228-234


    Clinically recorded pain scores are abundant in patient health records but are rarely used in research. The use of this information could help improve clinical outcomes. For example, a recent report by the Institute of Medicine stated that ineffective use of clinical information contributes to undertreatment of patient subpopulations--especially women. This study used diagnosis-associated pain scores from a large hospital database to document sex differences in reported pain. We used de-identified electronic medical records from Stanford Hospital and Clinics for more than 72,000 patients. Each record contained at least 1 disease-associated pain score. We found over 160,000 pain scores in more than 250 primary diagnoses, and analyzed differences in disease-specific pain reported by men and women. After filtering for diagnoses with minimum encounter numbers, we found diagnosis-specific sex differences in reported pain. The most significant differences occurred in patients with disorders of the musculoskeletal, circulatory, respiratory and digestive systems, followed by infectious diseases, and injury and poisoning. We also discovered sex-specific differences in pain intensity in previously unreported diseases, including disorders of the cervical region, and acute sinusitis (P = .01, .017, respectively). Pain scores were collected during hospital encounters. No information about the use of pre-encounter over-the-counter medications was available. To our knowledge, this is the largest data-driven study documenting sex differences of disease-associated pain. It highlights the utility of electronic medical record data to corroborate and expand on results of smaller clinical studies. Our findings emphasize the need for future research examining the mechanisms underlying differences in pain.This article highlights the potential of electronic medical records to conduct large-scale pain studies. Our results are consistent with previous studies reporting pain differences between sexes and also suggest that clinicians should pay increased attention to this idea.

    View details for DOI 10.1016/j.jpain.2011.11.002

    View details for Web of Science ID 000301612900003

    View details for PubMedID 22245360

  • Sex differences in disease risk from reported genome-wide association study findings HUMAN GENETICS Liu, L. Y., Schaub, M. A., Sirota, M., Butte, A. J. 2012; 131 (3): 353-364


    Men and women differ in susceptibility to many diseases and in responses to treatment. Recent advances in genome-wide association studies (GWAS) provide a wealth of data for associating genetic profiles with disease risk; however, in general, these data have not been systematically probed for sex differences in gene-disease associations. Incorporating sex into the analysis of GWAS results can elucidate new relationships between single nucleotide polymorphisms (SNPs) and human disease. In this study, we performed a sex-differentiated analysis on significant SNPs from GWAS data of the seven common diseases studied by the Wellcome Trust Case Control Consortium. We employed and compared three methods: logistic regression, Woolf's test of heterogeneity, and a novel statistical metric that we developed called permutation method to assess sex effects (PMASE). After correction for false discovery, PMASE finds SNPs that are significantly associated with disease in only one sex. These sexually dimorphic SNP-disease associations occur in Coronary Artery Disease and Crohn's Disease. GWAS analyses that fail to consider sex-specific effects may miss discovering sexual dimorphism in SNP-disease associations that give new insights into differences in disease mechanism between men and women.

    View details for DOI 10.1007/s00439-011-1081-y

    View details for Web of Science ID 000300252700004

    View details for PubMedID 21858542

  • Multiplex meta-analysis of RNA expression to identify genes with variants associated with immune dysfunction JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Morgan, A. A., Pyrgos, V. J., Nadeau, K. C., Williamson, P. R., Butte, A. J. 2012; 19 (2): 284-288


    We demonstrate a genome-wide method for the integration of many studies of gene expression of phenotypically similar disease processes, a method of multiplex meta-analysis. We use immune dysfunction as an example disease process.We use a heterogeneous collection of datasets across human and mice samples from a range of tissues and different forms of immunodeficiency. We developed a method integrating Tibshirani's modified t-test (SAM) is used to interrogate differential expression within a study and Fisher's method for omnibus meta-analysis to identify differentially expressed genes across studies. The ability of this overall gene expression profile to prioritize disease associated genes is evaluated by comparing against the results of a recent genome wide association study for common variable immunodeficiency (CVID).Our approach is able to prioritize genes associated with immunodeficiency in general (area under the ROC curve = 0.713) and CVID in particular (area under the ROC curve = 0.643).This approach may be used to investigate a larger range of failures of the immune system. Our method may be extended to other disease processes, using RNA levels to prioritize genes likely to contain disease associated DNA variants.

    View details for DOI 10.1136/amiajnl-2011-000657

    View details for Web of Science ID 000300768100023

    View details for PubMedID 22319178

  • Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges PLOS COMPUTATIONAL BIOLOGY Khatri, P., Sirota, M., Butte, A. J. 2012; 8 (2)


    Pathway analysis has become the first choice for gaining insight into the underlying biology of differentially expressed genes and proteins, as it reduces complexity and has increased explanatory power. We discuss the evolution of knowledge base-driven pathway analysis over its first decade, distinctly divided into three generations. We also discuss the limitations that are specific to each generation, and how they are addressed by successive generations of methods. We identify a number of annotation challenges that must be addressed to enable development of the next generation of pathway analysis methods. Furthermore, we identify a number of methodological challenges that the next generation of methods must tackle to take advantage of the technological advances in genomics and proteomics in order to improve specificity, sensitivity, and relevance of pathway analysis.

    View details for DOI 10.1371/journal.pcbi.1002375

    View details for Web of Science ID 000300729900019

    View details for PubMedID 22383865

  • Transmission distortion in Crohn's disease risk gene ATG16L1 leads to sex difference in disease association INFLAMMATORY BOWEL DISEASES Liu, L. Y., Schaub, M. A., Sirota, M., Butte, A. J. 2012; 18 (2): 312-322


    Crohn's disease (CD), an inflammatory disease of the bowel, affects millions of people around the world. Evidence suggests that disease onset and pathogenesis differ between males and females. Yet no comprehensive efforts exist to assess the sex-specific genetic architecture of CD.We used genotyping data from a cohort of 1748 CD cases and 2938 controls to investigate 71 meta-analysis-confirmed CD risk loci for sex differences in disease risk. We further validated the significant results in separate cohorts of 968 CD cases and 2809 controls, and performed a meta-analysis across datasets.The single nucleotide polymorphism (SNP) rs3792106 (C/T) in ATG16L1 showed a significant sex effect with P-value 6.9 × 10(-13) and allelic odds ratio 1.48 in females, and P-value 0.013 and odds ratio 1.22 in males (odds ratio heterogeneity P-value 0.037). Surprisingly, the difference was found to arise from a discrepancy in allele frequencies between male and female controls (P-value 0.0045) rather than cases. We found similar results for this SNP in the separate validation datasets. Using 155 HapMap 3 trios, we detected significant maternal overtransmission of the T allele at rs3792106 (P-value 0.027).Our results indicate that different transmission patterns between sexes may sustain the disparate allele frequencies at rs3792106 in healthy populations, and furthermore that a virus-risk variant mechanism implicated in CD alters the distribution in diseased patients. To our knowledge, this is the first report of sex-specific CD association in ATG16L1. The possible implications in CD and basic human biology present interesting areas for future investigation.

    View details for DOI 10.1002/ibd.21781

    View details for Web of Science ID 000298957800016

    View details for PubMedID 21618365

  • Performance comparison of whole-genome sequencing platforms NATURE BIOTECHNOLOGY Lam, H. Y., Clark, M. J., Chen, R., Chen, R., Natsoulis, G., O'Huallachain, M., Dewey, F. E., Habegger, L., Ashley, E. A., Gerstein, M. B., Butte, A. J., Ji, H. P., Snyder, M. 2012; 30 (1): 78-U118

    View details for DOI 10.1038/nbt.2065

    View details for Web of Science ID 000299110600023

  • Quantifying multi-ethnic representation in genetic studies of high mortality diseases. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science Chen, R., Dudley, J. T., Ruau, D., Butte, A. J. 2012; 2012: 11-18


    Most GWASs were performed using study populations with Caucasian ethnicity or ancestry, and findings from one ethnic subpopulation might not always translate to another. We curated 4,573 genetic studies on 763 human diseases and identified 3,461 disease-susceptible SNPs with genome-wide significance; only 10% of these had been validated in at least two different ethnic populations. SNPs for autoimmune diseases demonstrated the lowest percentage of cross-ethnicity validation. We used the mortality data from the Center for Disease Control and Prevention and identified 19 diseases killing over 10,000 Americans per year that were still lacking publications of even a single cross-ethnic SNP. Fifteen of these diseases had never been studied in large GWAS in non-Caucasian populations, including chronic liver diseases and cirrhosis, leukemia, and non-Hodgkin's lymphoma. Our results demonstrate that diseases killing most Americans are still lacking genetic studies across ethnicities.

    View details for PubMedID 22779041

  • Coanalysis of GWAS with eQTLs reveals disease-tissue associations. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science Kang, H. P., Morgan, A. A., Chen, R., Schadt, E. E., Butte, A. J. 2012; 2012: 35-41


    Expression quantitative trait loci (eQTL), or genetic variants associated with changes in gene expression, have the potential to assist in interpreting results of genome-wide association studies (GWAS). eQTLs also have varying degrees of tissue specificity. By correlating the statistical significance of eQTLs mapped in various tissue types to their odds ratios reported in a large GWAS by the Wellcome Trust Case Control Consortium (WTCCC), we discovered that there is a significant association between diseases studied genetically and their relevant tissues. This suggests that eQTL data sets can be used to determine tissues that play a role in the pathogenesis of a disease, thereby highlighting these tissue types for further post-GWAS functional studies.

    View details for PubMedID 22779046

  • Neonatal Informatics: Transforming Neonatal Care Through Translational Bioinformatics. NeoReviews Palma, J. P., Benitz, W. E., Tarczy-Hornoch, P., Butte, A. J., Longhurst, C. A. 2012; 13 (5): e281-e284


    The future of neonatal informatics will be driven by the availability of increasingly vast amounts of clinical and genetic data. The field of translational bioinformatics is concerned with linking and learning from these data and applying new findings to clinical care to transform the data into proactive, predictive, preventive, and participatory health. As a result of advances in translational informatics, the care of neonates will become more data driven, evidence based, and personalized.

    View details for PubMedID 22924023

  • Gene expression deconvolution in linear space reply NATURE METHODS Shen-Orr, S. S., Tibshirani, R., Butte, A. J. 2012; 9 (1): 9-9

    View details for DOI 10.1038/nmeth.1831

    View details for Web of Science ID 000298667000004

  • Progressive histological damage in renal allografts is associated with expression of innate and adaptive immunity genes KIDNEY INTERNATIONAL Naesens, M., Khatri, P., Li, L., Sigdel, T. K., Vitalone, M. J., Chen, R., Butte, A. J., Salvatierra, O., Sarwal, M. M. 2011; 80 (12): 1364-1376


    The degree of progressive chronic histological damage is associated with long-term renal allograft survival. In order to identify promising molecular targets for timely intervention, we examined renal allograft protocol and indication biopsies from 120 low-risk pediatric and adolescent recipients by whole-genome microarray expression profiling. In data-driven analysis, we found a highly regulated pattern of adaptive and innate immune gene expression that correlated with established or ongoing histological chronic injury, and also with development of future chronic histological damage, even in histologically pristine kidneys. Hence, histologically unrecognized immunological injury at a molecular level sets the stage for the development of chronic tissue injury, while the same molecular response is accentuated during established and worsening chronic allograft damage. Irrespective of the hypothesized immune or nonimmune trigger for chronic allograft injury, a highly orchestrated regulation of innate and adaptive immune responses was found in the graft at the molecular level. This occurred months before histologic lesions appear, and quantitatively below the diagnostic threshold of classic T-cell or antibody-mediated rejection. Thus, measurement of specific immune gene expression in protocol biopsies may be warranted to predict the development of subsequent chronic injury in histologically quiescent grafts and as a means to titrate immunosuppressive therapy.

    View details for DOI 10.1038/ki.2011.245

    View details for Web of Science ID 000297541900014

    View details for PubMedID 21881554

  • ProfileChaser: searching microarray repositories based on genome-wide patterns of differential expression BIOINFORMATICS Engreitz, J. M., Chen, R., Morgan, A. A., Dudley, J. T., Mallelwar, R., Butte, A. J. 2011; 27 (23): 3317-3318


    We introduce ProfileChaser, a web server that allows for querying the Gene Expression Omnibus based on genome-wide patterns of differential expression. Using a novel, content-based approach, ProfileChaser retrieves expression profiles that match the differentially regulated transcriptional programs in a user-supplied experiment. This analysis identifies statistical links to similar expression experiments from the vast array of publicly available data on diseases, drugs, phenotypes and other experimental conditions.http://profilechaser.stanford.eduabutte@stanford.eduSupplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btr548

    View details for Web of Science ID 000297352100015

    View details for PubMedID 21967760

  • Comparison of automated and human assignment of MeSH terms on publicly-available molecular datasets. Journal of biomedical informatics Ruau, D., Mbagwu, M., Dudley, J. T., Krishnan, V., Butte, A. J. 2011; 44: S39-43


    Publicly available molecular datasets can be used for independent verification or investigative repurposing, but depends on the presence, consistency and quality of descriptive annotations. Annotation and indexing of molecular datasets using well-defined controlled vocabularies or ontologies enables accurate and systematic data discovery, yet the majority of molecular datasets available through public data repositories lack such annotations. A number of automated annotation methods have been developed; however few systematic evaluations of the quality of annotations supplied by application of these methods have been performed using annotations from standing public data repositories. Here, we compared manually-assigned Medical Subject Heading (MeSH) annotations associated with experiments by data submitters in the PRoteomics IDEntification (PRIDE) proteomics data repository to automated MeSH annotations derived through the National Center for Biomedical Ontology Annotator and National Library of Medicine MetaMap programs. These programs were applied to free-text annotations for experiments in PRIDE. As many submitted datasets were referenced in publications, we used the manually curated MeSH annotations of those linked publications in MEDLINE as "gold standard". Annotator and MetaMap exhibited recall performance 3-fold greater than that of the manual annotations. We connected PRIDE experiments in a network topology according to shared MeSH annotations and found 373 distinct clusters, many of which were found to be biologically coherent by network analysis. The results of this study suggest that both Annotator and MetaMap are capable of annotating public molecular datasets with a quality comparable, and often exceeding, that of the actual data submitters, highlighting a continuous need to improve and apply automated methods to molecular datasets in public data repositories to maximize their value and utility.

    View details for DOI 10.1016/j.jbi.2011.03.007

    View details for PubMedID 21420508

  • Phased Whole-Genome Genetic Risk in a Family Quartet Using a Major Allele Reference Sequence PLOS GENETICS Dewey, F. E., Chen, R., Cordero, S. P., Ormond, K. E., Caleshu, C., Karczewski, K. J., Whirl-Carrillo, M., Wheeler, M. T., Dudley, J. T., Byrnes, J. K., Cornejo, O. E., Knowles, J. W., Woon, M., Sangkuhl, K., Gong, L., Thorn, C. F., Hebert, J. M., Capriotti, E., David, S. P., Pavlovic, A., West, A., Thakuria, J. V., Ball, M. P., Zaranek, A. W., Rehm, H. L., Church, G. M., West, J. S., Bustamante, C. D., Snyder, M., Altman, R. B., Klein, T. E., Butte, A. J., Ashley, E. A. 2011; 7 (9)


    Whole-genome sequencing harbors unprecedented potential for characterization of individual and family genetic variation. Here, we develop a novel synthetic human reference sequence that is ethnically concordant and use it for the analysis of genomes from a nuclear family with history of familial thrombophilia. We demonstrate that the use of the major allele reference sequence results in improved genotype accuracy for disease-associated variant loci. We infer recombination sites to the lowest median resolution demonstrated to date (< 1,000 base pairs). We use family inheritance state analysis to control sequencing error and inform family-wide haplotype phasing, allowing quantification of genome-wide compound heterozygosity. We develop a sequence-based methodology for Human Leukocyte Antigen typing that contributes to disease risk prediction. Finally, we advance methods for analysis of disease and pharmacogenomic risk across the coding and non-coding genome that incorporate phased variant data. We show these methods are capable of identifying multigenic risk for inherited thrombophilia and informing the appropriate pharmacological therapy. These ethnicity-specific, family-based approaches to interpretation of genetic variation are emblematic of the next generation of genetic risk assessment using whole-genome sequencing.

    View details for DOI 10.1371/journal.pgen.1002280

    View details for Web of Science ID 000295419100031

    View details for PubMedID 21935354

  • Applications of Translational Bioinformatics in Transplantation CLINICAL PHARMACOLOGY & THERAPEUTICS Khatri, P., Sarwal, M. M., Butte, A. J. 2011; 90 (2): 323-327

    View details for DOI 10.1038/clpt.2011.120

    View details for Web of Science ID 000292974900027

    View details for PubMedID 21716268

  • Identification of an IFN-gamma/mast cell axis in a mouse model of chronic asthma JOURNAL OF CLINICAL INVESTIGATION Yu, M., Eckart, M. R., Morgan, A. A., Mukai, K., Butte, A. J., Tsai, M., Galli, S. J. 2011; 121 (8): 3133-3143


    Asthma is considered a Th2 cell–associated disorder. Despite this, both the Th1 cell–associated cytokine IFN-? and airway neutrophilia have been implicated in severe asthma. To investigate the relative contributions of different immune system components to the pathogenesis of asthma, we previously developed a model that exhibits several features of severe asthma in humans, including airway neutrophilia and increased lung IFN-?. In the present studies, we tested the hypothesis that IFN-? regulates mast cell function in our model of chronic asthma. Engraftment of mast cell–deficient KitW(-sh/W-sh) mice, which develop markedly attenuated features of disease, with wild-type mast cells restored disease pathology in this model of chronic asthma. However, disease pathology was not fully restored by engraftment with either IFN-? receptor 1–null (Ifngr1–/–) or Fc? receptor 1?–null (Fcer1g–/–) mast cells. Additional analysis, including gene array studies, showed that mast cell expression of IFN-?R contributed to the development of many Fc?RI?-dependent and some Fc?RI?-independent features of disease in our model, including airway hyperresponsiveness, neutrophilic and eosinophilic inflammation, airway remodeling, and lung expression of several cytokines, chemokines, and markers of an alternatively activated macrophage response. These findings identify a previously unsuspected IFN-?/mast cell axis in the pathology of chronic allergic inflammation of the airways in mice.

    View details for DOI 10.1172/JCI43598

    View details for Web of Science ID 000293495500024

    View details for PubMedID 21737883

  • The role of bioinformatics in studying rheumatic and autoimmune disorders NATURE REVIEWS RHEUMATOLOGY Sirota, M., Butte, A. J. 2011; 7 (8): 489-494


    In the past decade, the availability and abundance of individual-level molecular data, such as gene expression, proteomics and sequence data, has enabled the use of integrative computational approaches to pose and answer novel questions about disease. In this article, we discuss several examples of applications of bioinformatics techniques to study autoimmune and rheumatic disorders. We focus our discussion on how integrative techniques can be applied to analyze gene expression and genetic variation data across different diseases, and discuss the implications of such analyses. We also outline current challenges and future directions of these approaches. We show that integrative computational methods are essential for translational research and provide a powerful opportunity to improve human health by refining the current knowledge about diagnostics, therapeutics and mechanisms of disease pathogenesis.

    View details for DOI 10.1038/nrrheum.2011.87

    View details for Web of Science ID 000293468700009

    View details for PubMedID 21691330

  • Predicting Adverse Drug Reactions Using Publicly Available PubChem BioAssay Data CLINICAL PHARMACOLOGY & THERAPEUTICS Pouliot, Y., Chiang, A. P., Butte, A. J. 2011; 90 (1): 90-99


    Adverse drug reactions (ADRs) can have severe consequences, and therefore the ability to predict ADRs prior to market introduction of a drug is desirable. Computational approaches applied to preclinical data could be one way to inform drug labeling and marketing with respect to potential ADRs. Based on the premise that some of the molecular actors of ADRs involve interactions that are detectable in large, and increasingly public, compound screening campaigns, we generated logistic regression models that correlate postmarketing ADRs with screening data from the PubChem BioAssay database. These models analyze ADRs at the level of organ systems, using the system organ classes (SOCs). Of the 19 SOCs under consideration, nine were found to be significantly correlated with preclinical screening data. With regard to six of the eight established drugs for which we could retropredict SOC-specific ADRs, prior knowledge was found that supports these predictions. We conclude this paper by predicting that SOC-specific ADRs will be associated with three unapproved or recently introduced drugs.

    View details for DOI 10.1038/clpt.2011.81

    View details for Web of Science ID 000291853800018

    View details for PubMedID 21613989

  • Translational bioinformatics: linking knowledge across biological and clinical realms JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Sarkar, I. N., Butte, A. J., Lussier, Y. A., Tarczy-Hornoch, P., Ohno-Machado, L. 2011; 18 (4): 354-357


    Nearly a decade since the completion of the first draft of the human genome, the biomedical community is positioned to usher in a new era of scientific inquiry that links fundamental biological insights with clinical knowledge. Accordingly, holistic approaches are needed to develop and assess hypotheses that incorporate genotypic, phenotypic, and environmental knowledge. This perspective presents translational bioinformatics as a discipline that builds on the successes of bioinformatics and health informatics for the study of complex diseases. The early successes of translational bioinformatics are indicative of the potential to achieve the promise of the Human Genome Project for gaining deeper insights to the genetic underpinnings of disease and progress toward the development of a new generation of therapies.

    View details for DOI 10.1136/amiajnl-2011-000245

    View details for Web of Science ID 000292061700003

    View details for PubMedID 21561873

  • Exploiting drug-disease relationships for computational drug repositioning BRIEFINGS IN BIOINFORMATICS Dudley, J. T., Deshpande, T., Butte, A. J. 2011; 12 (4): 303-311


    Finding new uses for existing drugs, or drug repositioning, has been used as a strategy for decades to get drugs to more patients. As the ability to measure molecules in high-throughput ways has improved over the past decade, it is logical that such data might be useful for enabling drug repositioning through computational methods. Many computational predictions for new indications have been borne out in cellular model systems, though extensive animal model and clinical trial-based validation are still pending. In this review, we show that computational methods for drug repositioning can be classified in two axes: drug based, where discovery initiates from the chemical perspective, or disease based, where discovery initiates from the clinical perspective of disease or its pathology. Newer algorithms for computational drug repositioning will likely span these two axes, will take advantage of newer types of molecular measurements, and will certainly play a role in reducing the global burden of disease.

    View details for DOI 10.1093/bib/bbr013

    View details for Web of Science ID 000293078100002

    View details for PubMedID 21690101

  • Computationally translating molecular discoveries into tools for medicine: translational bioinformatics articles now featured in JAMIA JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Butte, A. J., Shah, N. H. 2011; 18 (4): 352-353

    View details for DOI 10.1136/amiajnl-2011-000343

    View details for Web of Science ID 000292061700002

    View details for PubMedID 21672904

  • Protein Microarrays Discover Angiotensinogen and PRKRIP1 as Novel Targets for Autoantibodies in Chronic Renal Disease MOLECULAR & CELLULAR PROTEOMICS Butte, A. J., Sigdel, T. K., Wadia, P. P., Miklos, D. B., Sarwal, M. M. 2011; 10 (3)


    Biomarkers for early detection of chronic kidney disease are needed, as millions of patients suffer from chronic diseases predisposing them to kidney failure. Protein microarrays may also hold utility in the discovery of auto-antibodies in other conditions not commonly considered auto-immune diseases. We hypothesized that proteins are released as a consequence of damage at a cellular level during end-organ damage from renal injury, not otherwise recognized as self-antigens, and an adaptive humoral immune response to these proteins might be detected in the blood, as a noninvasive tracker of this injury. The resultant antibodies (Ab) detected in the blood would serve as effective biomarkers for occult renal injury, enabling earlier clinical detection of chronic kidney disease than currently possible, because of the redundancy of the serum creatinine as a biomarker for early kidney injury. To screen for novel autoantibodies in chronic kidney disease, 24 protein microarrays were used to compare serum Ab from patients with chronic kidney disease against matched controls. From a panel of 38 antigens with increased Ab binding, four were validated in 71 individuals, with (n=50) and without (n=21) renal insufficiency. Significant elevations in the titer of novel auto-Ab were noted against angiotensinogen and PRKRIP1 in renal insufficiency. Current validation is underway to evaluate if these auto-Ab can provide means to follow the evolution of chronic kidney disease in patients with early stages of renal insufficiency, and if these rising titers of these auto-Ab correlate with the rate of progression of chronic kidney disease.

    View details for DOI 10.1074/mcp.M110.000497

    View details for Web of Science ID 000287847200001

    View details for PubMedID 21183621

  • Computational prediction and experimental validation associating FABP-1 and pancreatic adenocarcinoma with diabetes BMC GASTROENTEROLOGY Sharaf, R. N., Butte, A. J., Montgomery, K. D., Pai, R., Dudley, J. T., Pasricha, P. J. 2011; 11


    Pancreatic cancer, composed principally of pancreatic adenocarcinoma (PaC), is the fourth leading cause of cancer death in the United States. PaC-associated diabetes may be a marker of early disease. We sought to identify molecules associated with PaC and PaC with diabetes (PaC-DM) using a novel translational bioinformatics approach. We identified fatty acid binding protein-1 (FABP-1) as one of several candidates. The primary aim of this pilot study was to experimentally validate the predicted association between FABP-1 with PaC and PaC with diabetes.We searched public microarray measurements for genes that were specifically highly expressed in PaC. We then filtered for proteins with known involvement in diabetes. Validation of FABP-1 was performed via antibody immunohistochemistry on formalin-fixed paraffin embedded pancreatic tissue microarrays (FFPE TMA). FFPE TMA were constructed using 148 cores of pancreatic tissue from 134 patients collected between 1995 and 2002 from patients who underwent pancreatic surgery. Primary analysis was performed on 21 normal and 60 pancreatic adenocarcinoma samples, stratified for diabetes. Clinical data on samples was obtained via retrospective chart review. Serial sections were cut per standard protocol. Antibody staining was graded by an experienced pathologist on a scale of 0-3. Bivariate and multivariate analyses were conducted to assess FABP-1 staining and clinical characteristics.Normal samples were significantly more likely to come from younger patients. PaC samples were significantly more likely to stain for FABP-1, when FABP-1 staining was considered a binary variable. Compared to normals, there was significantly increased staining in diabetic PaC samples (p = 0.004) and there was a trend towards increased staining in the non-diabetic PaC group (p = 0.07). In logistic regression modeling, FABP-1 staining was significantly associated with diagnosis of PaC (OR 8.6 95% CI 1.1-68, p = 0.04), though age was a confounder.Compared to normal controls, there was a significant positive association between FABP-1 staining and PaC on FFPE-TMA, strengthened by the presence of diabetes. Further studies with closely phenotyped patient samples are required to understand the true relationship between FABP-1, PaC and PaC-associated diabetes. A translational bioinformatics approach has potential to identify novel disease associations and potential biomarkers in gastroenterology.

    View details for DOI 10.1186/1471-230X-11-5

    View details for Web of Science ID 000287235100001

    View details for PubMedID 21251264

  • Matching cancer genomes to established cell lines for personalized oncology. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Dudley, J. T., Chen, R., Butte, A. J. 2011: 243-252


    The diagnosis and treatment of cancers, which rank among the leading causes of mortality in developed nations, presents substantial clinical challenges. The genetic and epigenetic heterogeneity of tumors can lead to differential response to therapy and gross disparities in patient outcomes, even for tumors originating from similar tissues. High-throughput DNA sequencing technologies hold promise to improve the diagnosis and treatment of cancers through efficient and economical profiling of complete tumor genomes, paving the way for approaches to personalized oncology that consider the unique genetic composition of the patient's tumor. Here we present a novel method to leverage the information provided by cancer genome sequencing to match an individual tumor genome with commercial cell lines, which might be leveraged as clinical surrogates to inform prognosis or therapeutic strategy. We evaluate the method using a published lung cancer genome and genetic profiles of commercial cancer cell lines. The results support the general plausibility of this matching approach, thereby offering a first step in translational bioinformatics approaches to personalized oncology using established cancer cell lines.

    View details for PubMedID 21121052

  • The reference human genome demonstrates high risk of type 1 diabetes and other disorders. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Chen, R., Butte, A. J. 2011: 231-242


    Personal genome resequencing has provided promising lead to personalized medicine. However, due to the limited samples and the lack of case/control design, current interpretation of personal genome sequences has been mainly focused on the identification and functional annotation of the DNA variants that are different from the reference genome. The reference genome was deduced from a collection of DNAs from anonymous individuals, some of whom might be carriers of disease risk alleles. We queried the reference genome against a large high-quality disease-SNP association database and found 3,556 disease-susceptible variants, including 15 rare variants. We assessed the likelihood ratio for risk for the reference genome on 104 diseases and found high risk for type 1 diabetes (T1D) and hypertension. We further demonstrated that the risk of T1D was significantly higher in the reference genome than those in a healthy patient with a whole human genome sequence. We found that the high T1D risk was mainly driven by a R260W mutation in PTPN22 in the reference genome. Therefore, we recommend that the disease-susceptible variants in the reference genome should be taken into consideration and future genome sequences should be interpreted with curated and predicted disease-susceptible loci to assess personal disease risk.

    View details for PubMedID 21121051

  • Content-based microarray search using differential expression profiles BMC BIOINFORMATICS Engreitz, J. M., Morgan, A. A., Dudley, J. T., Chen, R., Thathoo, R., Altman, R. B., Butte, A. J. 2010; 11


    With the expansion of public repositories such as the Gene Expression Omnibus (GEO), we are rapidly cataloging cellular transcriptional responses to diverse experimental conditions. Methods that query these repositories based on gene expression content, rather than textual annotations, may enable more effective experiment retrieval as well as the discovery of novel associations between drugs, diseases, and other perturbations.We develop methods to retrieve gene expression experiments that differentially express the same transcriptional programs as a query experiment. Avoiding thresholds, we generate differential expression profiles that include a score for each gene measured in an experiment. We use existing and novel dimension reduction and correlation measures to rank relevant experiments in an entirely data-driven manner, allowing emergent features of the data to drive the results. A combination of matrix decomposition and p-weighted Pearson correlation proves the most suitable for comparing differential expression profiles. We apply this method to index all GEO DataSets, and demonstrate the utility of our approach by identifying pathways and conditions relevant to transcription factors Nanog and FoxO3.Content-based gene expression search generates relevant hypotheses for biological inquiry. Experiments across platforms, tissue types, and protocols inform the analysis of new datasets.

    View details for DOI 10.1186/1471-2105-11-603

    View details for Web of Science ID 000286192100001

    View details for PubMedID 21172034

  • In silico research in the era of cloud computing NATURE BIOTECHNOLOGY Dudley, J. T., Butte, A. J. 2010; 28 (11): 1181-1185

    View details for DOI 10.1038/nbt1110-1181

    View details for Web of Science ID 000283924100019

    View details for PubMedID 21057489

  • Non-Synonymous and Synonymous Coding SNPs Show Similar Likelihood and Effect Size of Human Disease Association PLOS ONE Chen, R., Davydov, E. V., Sirota, M., Butte, A. J. 2010; 5 (10)


    Many DNA variants have been identified on more than 300 diseases and traits using Genome-Wide Association Studies (GWASs). Some have been validated using deep sequencing, but many fewer have been validated functionally, primarily focused on non-synonymous coding SNPs (nsSNPs). It is an open question whether synonymous coding SNPs (sSNPs) and other non-coding SNPs can lead to as high odds ratios as nsSNPs. We conducted a broad survey across 21,429 disease-SNP associations curated from 2,113 publications studying human genetic association, and found that nsSNPs and sSNPs shared similar likelihood and effect size for disease association. The enrichment of disease-associated SNPs around the 80(th) base in the first introns might provide an effective way to prioritize intronic SNPs for functional studies. We further found that the likelihood of disease association was positively associated with the effect size across different types of SNPs, and SNPs in the 3' untranslated regions, such as the microRNA binding sites, might be under-investigated. Our results suggest that sSNPs are just as likely to be involved in disease mechanisms, so we recommend that sSNPs discovered from GWAS should also be examined with functional studies.

    View details for DOI 10.1371/journal.pone.0013574

    View details for Web of Science ID 000283419100014

    View details for PubMedID 21042586

  • Dynamic MicroRNA Expression Programs During Cardiac Differentiation of Human Embryonic Stem Cells Role for miR-499 CIRCULATION-CARDIOVASCULAR GENETICS Wilson, K. D., Hu, S., Venkatasubrahmanyam, S., Fu, J., Sun, N., Abilez, O. J., Baugh, J. J., Jia, F., Ghosh, Z., Li, R. A., Butte, A. J., Wu, J. C. 2010; 3 (5): 426-U97


    MicroRNAs (miRNAs) are a newly discovered endogenous class of small, noncoding RNAs that play important posttranscriptional regulatory roles by targeting messenger RNAs for cleavage or translational repression. Human embryonic stem cells are known to express miRNAs that are often undetectable in adult organs, and a growing body of evidence has implicated miRNAs as important arbiters of heart development and disease.To better understand the transition between the human embryonic and cardiac "miRNA-omes," we report here the first miRNA profiling study of cardiomyocytes derived from human embryonic stem cells. Analyzing 711 unique miRNAs, we have identified several interesting miRNAs, including miR-1, -133, and -208, that have been previously reported to be involved in cardiac development and disease and that show surprising patterns of expression across our samples. We also identified novel miRNAs, such as miR-499, that are strongly associated with cardiac differentiation and that share many predicted targets with miR-208. Overexpression of miR-499 and -1 resulted in upregulation of important cardiac myosin heavy-chain genes in embryoid bodies; miR-499 overexpression also caused upregulation of the cardiac transcription factor MEF2C.Taken together, our data give significant insight into the regulatory networks that govern human embryonic stem cell differentiation and highlight the ability of miRNAs to perturb, and even control, the genes that are involved in cardiac specification of human embryonic stem cells.

    View details for DOI 10.1161/CIRCGENETICS.109.934281

    View details for Web of Science ID 000283163100006

    View details for PubMedID 20733065

  • An Optimistic Prognosis for the Clinical Utility of Laboratory Test Data ANESTHESIA AND ANALGESIA Zheng, M., Ravindran, P., Wang, J., Epstein, R. H., Chen, D. P., Butte, A. J., Peltz, G. 2010; 111 (4): 1026-1035


    It is hoped that anesthesiologists and other clinicians will be able to increasingly rely upon laboratory test data to improve the perioperative care of patients. However, it has been suggested that in order for a laboratory test to have clinically useful diagnostic performance characteristics (sensitivity and specificity), its performance must be considerably better than those that have been evaluated in most etiologic or epidemiologic studies. This pessimism about the clinical utility of laboratory tests is based upon the untested assumption that laboratory data are normally distributed within case and control populations. We evaluated the data distribution for 700 commonly ordered laboratory tests, and found that the vast majority (99%) do not have a normal distribution. The deviation from normal was most pronounced at extreme values, which had a large quantitative effect on laboratory test performance. At the sensitivity and specificity values required for diagnostic utility, the minimum required odds ratios for laboratory tests with a nonnormal data distribution were significantly smaller (by orders of magnitude) than for tests with a normal distribution. By evaluating the effect that the data distribution has on laboratory test performance, we have arrived at the more optimistic outlook that it is feasible to produce laboratory tests with diagnostically useful performance characteristics. We also show that moderate errors in the classification of outcome variables (e.g., death vs. survival at a specified end point) have a small impact on test performance, which is of importance for outcomes research that uses anesthesia information management systems. Because these analyses typically seek to identify factors associated with an undesirable outcome, the data distributions of the independent variables need to be considered when interpreting the odds ratios obtained from such investigations.

    View details for DOI 10.1213/ANE.0b013e3181efff0c

    View details for Web of Science ID 000282310200033

  • Drug Discovery in a Multidimensional World: Systems, Patterns, and Networks JOURNAL OF CARDIOVASCULAR TRANSLATIONAL RESEARCH Dudley, J. T., Schadt, E., Sirota, M., Butte, A. J., Ashley, E. 2010; 3 (5): 438-447


    Despite great strides in revealing and understanding the physiological and molecular bases of cardiovascular disease, efforts to translate this understanding into needed therapeutic interventions continue to lag far behind the initial discoveries. Although pharmaceutical companies continue to increase investments into research and development, the number of drugs gaining federal approval is in decline. Many factors underlie these trends, and a vast number of technological and scientific innovations are being sought through efforts to reinvigorate drug discovery pipelines. Recent advances in molecular profiling technologies and development of sophisticated computational approaches for analyzing these data are providing new, systems-oriented approaches towards drug discovery. Unlike the traditional approach to drug discovery which is typified by a one-drug-one-target mindset, systems-oriented approaches to drug discovery leverage the parallelism and high-dimensionality of the molecular data to construct more comprehensive molecular models that aim to model broader bimolecular systems. These models offer a means to explore complex molecular states (e.g., disease) where thousands to millions of molecular entities comprising multiple molecular data types (e.g., proteomics and gene expression) can be evaluated simultaneously as components of a cohesive biomolecular system. In this paper, we discuss emerging approaches towards systems-oriented drug discovery and contrast these efforts with the traditional, unidimensional approach to drug discovery. We also highlight several applications of these system-oriented approaches across various aspects of drug discovery, including target discovery, drug repositioning and drug toxicity. When available, specific applications to cardiovascular drug discovery are highlighted and discussed.

    View details for DOI 10.1007/s12265-010-9214-6

    View details for Web of Science ID 000284694700003

    View details for PubMedID 20677029

  • Getting from Genes to Function in Lung Disease A National Heart, Lung, and Blood Institute Workshop Report AMERICAN JOURNAL OF RESPIRATORY AND CRITICAL CARE MEDICINE Ober, C., Butte, A. J., Elias, J. A., Lusis, A. J., Gan, W., Banks-Schlegel, S., Schwartz, D. 2010; 182 (6): 732-737


    Genome-wide association studies (GWAS) have revealed novel genes and pathways involved in lung disease, many of which are potential targets for therapy. However, despite numerous successes, a large proportion of the genetic variance in disease risk remains unexplained, and the function of the associated genetic variations identified by GWAS and the mechanisms by which they alter individual risk for disease or pathogenesis are still largely unknown. The National Heart, Lung, and Blood Institute (NHLBI) convened a 2-day workshop to address these shortcomings and to make recommendations for future research areas that will move the scientific community beyond gene discovery. Topics of individual sessions ranged from data integration and systems genetics to functional validation of genetic variations in humans and model systems. There was broad consensus among the participants for five high-priority areas for future research, including the following: (1) integrated approaches to characterize the function of genetic variations, (2) studies on the role of environment and mechanisms of transcriptional and post-transcriptional regulation, (3) development of model systems to study gene function in complex biological systems, (4) comparative phenomic studies across lung diseases, and (5) training in and applications of bioinformatic approaches for comprehensive mining of existing data sets. Last, it was agreed that future research on lung diseases should integrate approaches across "-omic" technologies and to include ethnically/racially diverse populations in human studies of lung disease whenever possible.

    View details for DOI 10.1164/rccm.201002-0180PP

    View details for Web of Science ID 000282162100005

    View details for PubMedID 20558629

  • Biomarker and Drug Discovery for Gastroenterology Through Translational Bioinformatics GASTROENTEROLOGY Dudley, J. T., Butte, A. J. 2010; 139 (3): 735-U66

    View details for DOI 10.1053/j.gastro.2010.07.024

    View details for Web of Science ID 000281365500014

    View details for PubMedID 20650279

  • Validating pathophysiological models of aging using clinical electronic medical records JOURNAL OF BIOMEDICAL INFORMATICS Chen, D. P., Morgan, A. A., Butte, A. J. 2010; 43 (3): 358-364


    Bioinformatics methods that leverage the vast amounts of clinical data promises to provide insights into underlying molecular mechanisms that help explain human physiological processes. One of these processes is adolescent development. The utility of predictive aging models generated from cross-sectional cohorts and their applicability to separate populations, including the clinical population, has yet to be completely explored. In order to address this, we built regression models predictive of adolescent chronological age from 2001 to 2002 National Health and Nutrition Examination Survey (NHANES) data and validated them against independent 2003-2004 NHANES data and clinical data from an academic tertiary-care pediatric hospital. The results indicate distinct differences between male and female models with both alkaline phosphatase and creatinine as predictive biomarkers for both genders, hematocrit and mean cell volume for males, and total serum globulin for females. We also suggest that the models are generalizable, are clinically relevant, and imply underlying molecular and clinical differences between males and females that may affect prediction accuracy. The integration of both epidemiological and clinical data promises to create more robust models that shed new light on physiological processes.

    View details for DOI 10.1016/j.jbi.2009.11.007

    View details for Web of Science ID 000278780800002

    View details for PubMedID 19958842

  • Current methodologies for translational bioinformatics JOURNAL OF BIOMEDICAL INFORMATICS Lussier, Y. A., Butte, A. J., Hunter, L. 2010; 43 (3): 355-357

    View details for DOI 10.1016/j.jbi.2010.05.002

    View details for Web of Science ID 000278780800001

    View details for PubMedID 20470899

  • Challenges in the clinical application of whole-genome sequencing LANCET Ormond, K. E., Wheeler, M. T., Hudgins, L., Klein, T. E., Butte, A. J., Altman, R. B., Ashley, E. A., Greely, H. T. 2010; 375 (9727): 1749-1751
  • Predicting environmental chemical factors associated with disease-related gene expression data BMC MEDICAL GENOMICS Patel, C. J., Butte, A. J. 2010; 3


    Many common diseases arise from an interaction between environmental and genetic factors. Our knowledge regarding environment and gene interactions is growing, but frameworks to build an association between gene-environment interactions and disease using preexisting, publicly available data has been lacking. Integrating freely-available environment-gene interaction and disease phenotype data would allow hypothesis generation for potential environmental associations to disease.We integrated publicly available disease-specific gene expression microarray data and curated chemical-gene interaction data to systematically predict environmental chemicals associated with disease. We derived chemical-gene signatures for 1,338 chemical/environmental chemicals from the Comparative Toxicogenomics Database (CTD). We associated these chemical-gene signatures with differentially expressed genes from datasets found in the Gene Expression Omnibus (GEO) through an enrichment test.We were able to verify our analytic method by accurately identifying chemicals applied to samples and cell lines. Furthermore, we were able to predict known and novel environmental associations with prostate, lung, and breast cancers, such as estradiol and bisphenol A.We have developed a scalable and statistical method to identify possible environmental associations with disease using publicly available data and have validated some of the associations in the literature.

    View details for DOI 10.1186/1755-8794-3-17

    View details for Web of Science ID 000278191100002

    View details for PubMedID 20459635

  • Antibodies specifically target AML antigen NuSAP1 after allogeneic bone marrow transplantation BLOOD Wadia, P. P., Coram, M., Armstrong, R. J., Mindrinos, M., Butte, A. J., Miklos, D. B. 2010; 115 (10): 2077-2087


    Identifying the targets of immune response after allogeneic hematopoietic cell transplantation (HCT) promises to provide relevant immune therapy candidate proteins. We used protein microarrays to serologically identify nucleolar and spindle-associated protein 1 (NuSAP1) and chromatin assembly factor 1, subunit B (p60; CHAF1b) as targets of new antibody responses that developed after allogeneic HCT. Western blots and enzyme-linked immunosorbent assays (ELISA) validated their post-HCT recognition and enabled ELISA testing of 120 other patients with various malignancies who underwent allo-HCT. CHAF1b-specific antibodies were predominantly detected in patients with acute myeloid leukemia (AML), whereas NuSAP1-specific antibodies were exclusively detected in patients with AML 1 year after transplantation (P < .001). Complete genomic exon sequencing failed to identify a nonsynonymous single nucleotide polymorphism (SNP) for NuSAP1 and CHAF1b between the donor and recipient cells. Expression profiles and reverse transcriptase-polymerase chain reaction (RT-PCR) showed NuSAP1 was predominately expressed in the bone marrow CD34(+)CD90(+) hematopoietic stem cells, leukemic cell lines, and B lymphoblasts compared with other tissues or cells. Thus, NuSAP1 is recognized as an immunogenic antigen in 65% of patients with AML following allogeneic HCT and suggests a tumor antigen role.

    View details for DOI 10.1182/blood-2009-03-211375

    View details for Web of Science ID 000275751300033

    View details for PubMedID 20053754

  • Network-Based Elucidation of Human Disease Similarities Reveals Common Functional Modules Enriched for Pluripotent Drug Targets PLOS COMPUTATIONAL BIOLOGY Suthram, S., Dudley, J. T., Chiang, A. P., Chen, R., Hastie, T. J., Butte, A. J. 2010; 6 (2)


    Current work in elucidating relationships between diseases has largely been based on pre-existing knowledge of disease genes. Consequently, these studies are limited in their discovery of new and unknown disease relationships. We present the first quantitative framework to compare and contrast diseases by an integrated analysis of disease-related mRNA expression data and the human protein interaction network. We identified 4,620 functional modules in the human protein network and provided a quantitative metric to record their responses in 54 diseases leading to 138 significant similarities between diseases. Fourteen of the significant disease correlations also shared common drugs, supporting the hypothesis that similar diseases can be treated by the same drugs, allowing us to make predictions for new uses of existing drugs. Finally, we also identified 59 modules that were dysregulated in at least half of the diseases, representing a common disease-state "signature". These modules were significantly enriched for genes that are known to be drug targets. Interestingly, drugs known to target these genes/proteins are already known to treat significantly more diseases than drugs targeting other genes/proteins, highlighting the importance of these core modules as prime therapeutic opportunities.

    View details for DOI 10.1371/journal.pcbi.1000662

    View details for Web of Science ID 000275260000026

    View details for PubMedID 20140234

  • Dynamism in gene expression across multiple studies PHYSIOLOGICAL GENOMICS Morgan, A. A., Dudley, J. T., Deshpande, T., Butte, A. J. 2010; 40 (3): 128-140


    In this study we develop methods of examining gene expression dynamics, how and when genes change expression, and demonstrate their application in a meta-analysis involving over 29,000 microarrays. By defining measures across many experimental conditions, we have a new way of characterizing dynamics, complementary to measures looking at changes in absolute variation or breadth of tissues showing expression. We show conservation in overall patterns of dynamism across three species (human, mouse, and rat) and show associations with known disease-related genes. We discuss the enriched functional properties of the sets of genes showing different patterns of dynamics and show that the differences in expression dynamics is associated with the variety of different transcription factor regulatory sites. These results can influence thinking about the selection of genes for microarray design and the analysis of measurements of mRNA expression variation in a global context of expression dynamics across many conditions, as genes that are rarely differentially expressed between experimental conditions may be the subject of increased scrutiny when they significantly vary in expression between experimental subsets.

    View details for DOI 10.1152/physiolgenomics.90403.2008

    View details for Web of Science ID 000274287000002

    View details for PubMedID 19920211

  • Identification of complex metabolic states in critically injured patients using bioinformatic cluster analysis CRITICAL CARE Cohen, M. J., Grossman, A. D., Morabito, D., Knudson, M. M., Butte, A. J., Manley, G. T. 2010; 14 (1)


    Advances in technology have made extensive monitoring of patient physiology the standard of care in intensive care units (ICUs). While many systems exist to compile these data, there has been no systematic multivariate analysis and categorization across patient physiological data. The sheer volume and complexity of these data make pattern recognition or identification of patient state difficult. Hierarchical cluster analysis allows visualization of high dimensional data and enables pattern recognition and identification of physiologic patient states. We hypothesized that processing of multivariate data using hierarchical clustering techniques would allow identification of otherwise hidden patient physiologic patterns that would be predictive of outcome.Multivariate physiologic and ventilator data were collected continuously using a multimodal bioinformatics system in the surgical ICU at San Francisco General Hospital. These data were incorporated with non-continuous data and stored on a server in the ICU. A hierarchical clustering algorithm grouped each minute of data into 1 of 10 clusters. Clusters were correlated with outcome measures including incidence of infection, multiple organ failure (MOF), and mortality.We identified 10 clusters, which we defined as distinct patient states. While patients transitioned between states, they spent significant amounts of time in each. Clusters were enriched for our outcome measures: 2 of the 10 states were enriched for infection, 6 of 10 were enriched for MOF, and 3 of 10 were enriched for death. Further analysis of correlations between pairs of variables within each cluster reveals significant differences in physiology between clusters.Here we show for the first time the feasibility of clustering physiological measurements to identify clinically relevant patient states after trauma. These results demonstrate that hierarchical clustering techniques can be useful for visualizing complex multivariate data and may provide new insights for the care of critically injured patients.

    View details for DOI 10.1186/cc8864

    View details for Web of Science ID 000276989800044

    View details for PubMedID 20122274

  • Translational bioinformatics in the cloud: an affordable alternative GENOME MEDICINE Dudley, J. T., Pouliot, Y., Chen, R., Morgan, A. A., Butte, A. J. 2010; 2

    View details for DOI 10.1186/gm172

    View details for Web of Science ID 000208627100051

  • Likelihood ratios for genome medicine. Genome medicine Morgan, A. A., Chen, R., Butte, A. J. 2010; 2 (5): 30-?


    Patients are beginning to present to healthcare providers with the results of high-throughput individualized genotyping, and interpreting these results in the context of the explosive growth of literature linking individual variants with disease may seem daunting. However, we suggest that results of a personal genomic analysis may be viewed as a panel of many tests for multiple diseases. By using well-established methods of evidence based medicine, these very many parallel tests may be combined using likelihood ratios to report a post-test probability of disease for use in patient assessment.

    View details for DOI 10.1186/gm151

    View details for PubMedID 20497613

  • Likelihood ratios for genome medicine GENOME MEDICINE Morgan, A. A., Chen, R., Butte, A. J. 2010; 2

    View details for DOI 10.1186/gm151

    View details for Web of Science ID 000208627100030

  • Translational bioinformatics in the cloud: an affordable alternative. Genome medicine Dudley, J. T., Pouliot, Y., Chen, R., Morgan, A. A., Butte, A. J. 2010; 2 (8): 51-?


    With the continued exponential expansion of publicly available genomic data and access to low-cost, high-throughput molecular technologies for profiling patient populations, computational technologies and informatics are becoming vital considerations in genomic medicine. Although cloud computing technology is being heralded as a key enabling technology for the future of genomic research, available case studies are limited to applications in the domain of high-throughput sequence data analysis. The goal of this study was to evaluate the computational and economic characteristics of cloud computing in performing a large-scale data integration and analysis representative of research problems in genomic medicine. We find that the cloud-based analysis compares favorably in both performance and cost in comparison to a local computational cluster, suggesting that cloud computing technologies might be a viable resource for facilitating large-scale translational research in genomic medicine.

    View details for DOI 10.1186/gm172

    View details for PubMedID 20691073

  • Autoimmune Disease Classification by Inverse Association with SNP Alleles PLOS GENETICS Sirota, M., Schaub, M. A., Batzoglou, S., Robinson, W. H., Butte, A. J. 2009; 5 (12)


    With multiple genome-wide association studies (GWAS) performed across autoimmune diseases, there is a great opportunity to study the homogeneity of genetic architectures across autoimmune disease. Previous approaches have been limited in the scope of their analysis and have failed to properly incorporate the direction of allele-specific disease associations for SNPs. In this work, we refine the notion of a genetic variation profile for a given disease to capture strength of association with multiple SNPs in an allele-specific fashion. We apply this method to compare genetic variation profiles of six autoimmune diseases: multiple sclerosis (MS), ankylosing spondylitis (AS), autoimmune thyroid disease (ATD), rheumatoid arthritis (RA), Crohn's disease (CD), and type 1 diabetes (T1D), as well as five non-autoimmune diseases. We quantify pair-wise relationships between these diseases and find two broad clusters of autoimmune disease where SNPs that make an individual susceptible to one class of autoimmune disease also protect from diseases in the other autoimmune class. We find that RA and AS form one such class, and MS and ATD another. We identify specific SNPs and genes with opposite risk profiles for these two classes. We furthermore explore individual SNPs that play an important role in defining similarities and differences between disease pairs. We present a novel, systematic, cross-platform approach to identify allele-specific relationships between disease pairs based on genetic variation as well as the individual SNPs which drive the relationships. While recognizing similarities between diseases might lead to identifying novel treatment options, detecting differences between diseases previously thought to be similar may point to key novel disease-specific genes and pathways.

    View details for DOI 10.1371/journal.pgen.1000792

    View details for Web of Science ID 000273469700042

    View details for PubMedID 20041220

  • A Quick Guide for Developing Effective Bioinformatics Programming Skills PLOS COMPUTATIONAL BIOLOGY Dudley, J. T., Butte, A. J. 2009; 5 (12)

    View details for DOI 10.1371/journal.pcbi.1000589

    View details for Web of Science ID 000274229000007

    View details for PubMedID 20041221

  • Protein microarrays identify antibodies to protein kinase C zeta that are associated with a greater risk of allograft loss in pediatric renal transplant recipients KIDNEY INTERNATIONAL Sutherland, S. M., Li, L., Sigdel, T. K., Wadia, P. P., Miklos, D. B., Butte, A. J., Sarwal, M. M. 2009; 76 (12): 1277-1283


    Antibodies to human leukocyte antigens (HLAs) are a risk factor for acute renal allograft rejection and loss. The role of non-HLAs and their significance to allograft rejection have gained recent attention. Here, we applied protein microarray technology, with the capacity to simultaneously identify 5056 potential antigen targets, to assess non-HLA antibody formation in 15 pediatric renal transplant recipients during allograft rejection. Comparison of the pre- and post-transplant serum identified de novo antibodies to 229 non-HLA targets, 36 of which were present in multiple patients at allograft rejection. On the basis of its reactivity, protein kinase Czeta (PKCzeta) was selected for confirmatory testing and clinical study. Immunohistochemical analysis found PKCzeta both within the renal tissue and infiltrating lymphocytes at rejection. Patients who had an elevated anti-PKCzeta titer developed rejection, which was significantly more likely to result in graft loss. The absence of C4d deposition in patients with high anti-PKCzeta titers suggests that it is a marker of severe allograft injury rather than itself being pathogenic. Presumably, critical renal injury and inflammation associated with this rejection subtype lead to the immunological exposure of PKCzeta with resultant antibody formation. Prospective assessment of serum anti-PKCzeta levels at allograft rejection will be needed to confirm these results.

    View details for DOI 10.1038/ki.2009.384

    View details for Web of Science ID 000272230400009

    View details for PubMedID 19812540

  • Relationship of differential gene expression profiles in CD34(+) myelodysplastic syndrome marrow cells to disease subtype and progression BLOOD Sridhar, K., Ross, D. T., Tibshirani, R., Butte, A. J., Greenberg, P. L. 2009; 114 (23): 4847-4858


    Microarray analysis with 40 000 cDNA gene chip arrays determined differential gene expression profiles (GEPs) in CD34(+) marrow cells from myelodysplastic syndrome (MDS) patients compared with healthy persons. Using focused bioinformatics analyses, we found 1175 genes significantly differentially expressed by MDS versus normal, requiring a minimum of 39 genes to separately classify these patients. Major GEP differences were demonstrated between healthy and MDS patients and between several MDS subgroups: (1) those whose disease remained stable and those who subsequently transformed (tMDS) to acute myeloid leukemia; (2) between del(5q) and other MDS patients. A 6-gene "poor risk" signature was defined, which was associated with acute myeloid leukemia transformation and provided additive prognostic information for International Prognostic Scoring System Intermediate-1 patients. Overexpression of genes generating ribosomal proteins and for other signaling pathways was demonstrated in the tMDS patients. Comparison of del(5q) with the remaining MDS patients showed 1924 differentially expressed genes, with underexpression of 1014 genes, 11 of which were within the 5q31-32 commonly deleted region. These data demonstrated (1) GEPs distinguishing MDS patients from healthy and between those with differing clinical outcomes (tMDS vs those whose disease remained stable) and cytogenetics [eg, del(5q)]; and (2) molecular criteria refining prognostic categorization and associated biologic processes in MDS.

    View details for DOI 10.1182/blood-2009-08-236422

    View details for Web of Science ID 000272190700014

    View details for PubMedID 19801443

  • FoxO3 Regulates Neural Stem Cell Homeostasis CELL STEM CELL Renault, V. M., Rafalski, V. A., Morgan, A. A., Salih, D. A., Brett, J. O., Webb, A. E., Villeda, S. A., Thekkat, P. U., Guillerey, C., Denko, N. C., Palmer, T. D., Bufte, A. J., Brunet, A. 2009; 5 (5): 527-539


    In the nervous system, neural stem cells (NSCs) are necessary for the generation of new neurons and for cognitive function. Here we show that FoxO3, a member of a transcription factor family known to extend lifespan in invertebrates, regulates the NSC pool. We find that adult FoxO3(-/-) mice have fewer NSCs in vivo than wild-type counterparts. NSCs isolated from adult FoxO3(-/-) mice have decreased self-renewal and an impaired ability to generate different neural lineages. Identification of the FoxO3-dependent gene expression profile in NSCs suggests that FoxO3 regulates the NSC pool by inducing a program of genes that preserves quiescence, prevents premature differentiation, and controls oxygen metabolism. The ability of FoxO3 to prevent the premature depletion of NSCs might have important implications for counteracting brain aging in long-lived species.

    View details for DOI 10.1016/j.stem.2009.09.014

    View details for Web of Science ID 000272019500014

    View details for PubMedID 19896443

  • Systematic Evaluation of Drug-Disease Relationships to Identify Leads for Novel Drug Uses CLINICAL PHARMACOLOGY & THERAPEUTICS Chiang, A. P., Butte, A. J. 2009; 86 (5): 507-510


    Drug repositioning refers to the discovery of alternative uses for drugs--uses that are different from that for which the drugs were originally intended. One challenge in this effort lies in choosing the indication for which a drug of interest could be prospectively tested. We systematically evaluated a drug treatment-based view of diseases in order to address this challenge. Suggestions for novel drug uses were generated using a "guilt by association" approach. When compared with a control group of drug uses, the suggested novel drug uses generated by this approach were significantly enriched with respect to previous and ongoing clinical trials.

    View details for DOI 10.1038/clpt.2009.103

    View details for Web of Science ID 000271186000018

    View details for PubMedID 19571805

  • Wiskott-Aldrich syndrome protein is an effector of Kit signaling BLOOD Mani, M., Venkatasubrahmanyam, S., Sanyal, M., Levy, S., Butte, A., Weinberg, K., Jahn, T. 2009; 114 (14): 2900-2908


    The pleiotropic receptor tyrosine kinase Kit can provide cytoskeletal signals that define cell shape, positioning, and migration, but the underlying mechanisms are less well understood. In this study, we provide evidence that Kit signals through Wiskott-Aldrich syndrome protein (WASP), the central hematopoietic actin nucleation-promoting factor and regulator of the cytoskeleton. Kit ligand (KL) stimulation resulted in transient tyrosine phosphorylation of WASP, as well as interacting proteins WASP-interacting protein and Arp2/3. KL-induced filopodia in bone marrow-derived mast cells (BMMCs) were significantly decreased in number and size in the absence of WASP. KL-dependent regulation of intracellular Ca(2+) levels was aberrant in WASP-deficient BMMCs. When BMMCs were derived from WASP-heterozygous female mice using KL as a growth factor, the cultures eventually developed from a mixture of WASP-positive and -negative populations into a homogenous WASP-positive culture derived from the WASP-positive progenitors. Thus, WASP expression conferred a selective advantage to the development of Kit-dependent hematopoiesis consistent with the selective advantage of WASP-positive hematopoietic cells observed in WAS-heterozygous female humans. Finally, KL-mediated gene expression in wild-type and WASP-deficient BMMCs was compared and revealed that approximately 30% of all Kit-induced changes were WASP dependent. The results indicate that Kit signaling through WASP is necessary for normal Kit-mediated filopodia formation, cell survival, and gene expression, and provide new insight into the mechanism in which WASP exerts a strong selective pressure in hematopoiesis.

    View details for DOI 10.1182/blood-2009-01-200733

    View details for Web of Science ID 000270387100013

    View details for PubMedID 19643989

  • Disease signatures are robust across tissues and experiments MOLECULAR SYSTEMS BIOLOGY Dudley, J. T., Tibshirani, R., Deshpande, T., Butte, A. J. 2009; 5


    Meta-analyses combining gene expression microarray experiments offer new insights into the molecular pathophysiology of disease not evident from individual experiments. Although the established technical reproducibility of microarrays serves as a basis for meta-analysis, pathophysiological reproducibility across experiments is not well established. In this study, we carried out a large-scale analysis of disease-associated experiments obtained from NCBI GEO, and evaluated their concordance across a broad range of diseases and tissue types. On evaluating 429 experiments, representing 238 diseases and 122 tissues from 8435 microarrays, we find evidence for a general, pathophysiological concordance between experiments measuring the same disease condition. Furthermore, we find that the molecular signature of disease across tissues is overall more prominent than the signature of tissue expression across diseases. The results offer new insight into the quality of public microarray data using pathophysiological metrics, and support new directions in meta-analysis that include characterization of the commonalities of disease irrespective of tissue, as well as the creation of multi-tissue systems models of disease pathology using public data.

    View details for DOI 10.1038/msb.2009.66

    View details for Web of Science ID 000270456400006

    View details for PubMedID 19756046

  • Expression of Complement Components Differs Between Kidney Allografts from Living and Deceased Donors JOURNAL OF THE AMERICAN SOCIETY OF NEPHROLOGY Naesens, M., Li, L., Ying, L., Sansanwal, P., Sigdel, T. K., Hsieh, S., Kambham, N., Lerut, E., Salvatierra, O., Butte, A. J., Sarwal, M. M. 2009; 20 (8): 1839-1851


    A disparity remains between graft survival of renal allografts from deceased donors and from living donors. A better understanding of the molecular mechanisms that underlie this disparity may allow the development of targeted therapies to enhance graft survival. Here, we used microarrays to examine whole genome expression profiles using tissue from 53 human renal allograft protocol biopsies obtained both at implantation and after transplantation. The gene expression profiles of living-donor kidneys and pristine deceased-donor kidneys (normal histology, young age) were significantly different before reperfusion at implantation. Deceased-donor kidneys exhibited a significant increase in renal expression of complement genes; posttransplantation biopsies from well-functioning, nonrejecting kidneys, regardless of donor source, also demonstrated a significant increase in complement expression. Peritransplantation phenomena, such as donor death and possibly cold ischemia time, contributed to differences in complement pathway gene expression. In addition, complement gene expression at the time of implantation was associated with both early and late graft function. These data suggest that complement-modulating therapy may improve graft outcomes in renal transplantation.

    View details for DOI 10.1681/ASN.2008111145

    View details for Web of Science ID 000268903200028

    View details for PubMedID 19443638

  • A Classifier-based approach to identify genetic similarities between diseases BIOINFORMATICS Schaub, M. A., Kaplow, I. M., Sirota, M., Do, C. B., Butte, A. J., Batzoglou, S. 2009; 25 (12): I21-I29


    Genome-wide association studies are commonly used to identify possible associations between genetic variations and diseases. These studies mainly focus on identifying individual single nucleotide polymorphisms (SNPs) potentially linked with one disease of interest. In this work, we introduce a novel methodology that identifies similarities between diseases using information from a large number of SNPs. We separate the diseases for which we have individual genotype data into one reference disease and several query diseases. We train a classifier that distinguishes between individuals that have the reference disease and a set of control individuals. This classifier is then used to classify the individuals that have the query diseases. We can then rank query diseases according to the average classification of the individuals in each disease set, and identify which of the query diseases are more similar to the reference disease. We repeat these classification and comparison steps so that each disease is used once as reference disease.We apply this approach using a decision tree classifier to the genotype data of seven common diseases and two shared control sets provided by the Wellcome Trust Case Control Consortium. We show that this approach identifies the known genetic similarity between type 1 diabetes and rheumatoid arthritis, and identifies a new putative similarity between bipolar disease and hypertension.

    View details for DOI 10.1093/bioinformatics/btp226

    View details for Web of Science ID 000266498300004

    View details for PubMedID 19477990

  • C3 Polymorphisms and Outcomes of Renal Allografts NEW ENGLAND JOURNAL OF MEDICINE Naesens, M., Butte, A. J., Sarwal, M. M. 2009; 360 (23): 2478-2478

    View details for Web of Science ID 000266590500027

    View details for PubMedID 19504761

  • MicroRNA Profiling of Human-Induced Pluripotent Stem Cells STEM CELLS AND DEVELOPMENT Wilson, K. D., Venkatasubrahmanyam, S., Jia, F., Sun, N., Butte, A. J., Wu, J. C. 2009; 18 (5): 749-757


    MicroRNAs (miRNAs) are a newly discovered endogenous class of small noncoding RNAs that play important posttranscriptional regulatory roles by targeting mRNAs for cleavage or translational repression. Accumulating evidence now supports the importance of miRNAs for human embryonic stem cell (hESC) self-renewal, pluripotency, and differentiation. However, with respect to induced pluripotent stem cells (iPSC), in which embryonic-like cells are reprogrammed from adult cells using defined factors, the role of miRNAs during reprogramming has not been well-characterized. Determining the miRNAs that are associated with reprogramming should yield significant insight into the specific miRNA expression patterns that are required for pluripotency. To address this lack of knowledge, we use miRNA microarrays to compare the "microRNA-omes" of human iPSCs, hESCs, and fetal fibroblasts. We confirm the presence of a signature group of miRNAs that is up-regulated in both iPSCs and hESCs, such as the miR-302 and 17-92 clusters. We also highlight differences between the two pluripotent cell types, as in expression of the miR-371/372/373 cluster. In addition to histone modifications, promoter methylation, transcription factors, and other regulatory control elements, we believe these miRNA signatures of pluripotent cells likely represent another layer of regulatory control over cell fate decisions, and should prove important for the cellular reprogramming field.

    View details for DOI 10.1089/scd.2008.0247

    View details for Web of Science ID 000266237000009

    View details for PubMedID 19284351

  • Use of Bayesian networks to probabilistically model and improve the likelihood of validation of microarray findings by RT-PCR JOURNAL OF BIOMEDICAL INFORMATICS English, S. B., Shih, S., Ramoni, M. F., Smith, L. E., Butte, A. J. 2009; 42 (2): 287-295


    Though genome-wide technologies, such as microarrays, are widely used, data from these methods are considered noisy; there is still varied success in downstream biological validation. We report a method that increases the likelihood of successfully validating microarray findings using real time RT-PCR, including genes at low expression levels and with small differences. We use a Bayesian network to identify the most relevant sources of noise based on the successes and failures in validation for an initial set of selected genes, and then improve our subsequent selection of genes for validation based on eliminating these sources of noise. The network displays the significant sources of noise in an experiment, and scores the likelihood of validation for every gene. We show how the method can significantly increase validation success rates. In conclusion, in this study, we have successfully added a new automated step to determine the contributory sources of noise that determine successful or unsuccessful downstream biological validation.

    View details for DOI 10.1016/j.jbi.2008.08.009

    View details for Web of Science ID 000264958800009

    View details for PubMedID 18790084

  • Identifying compartment-specific non-HLA targets after renal transplantation by integrating transcriptome and "antibodyome'' measures PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Li, L., Wadia, P., Chen, R., Kambham, N., Naesens, M., Sigdel, T. K., Miklos, D. B., Sarwal, M. M., Butte, A. J. 2009; 106 (11): 4148-4153


    We have conducted an integrative genomics analysis of serological responses to non-HLA targets after renal transplantation, with the aim of identifying the tissue specificity and types of immunogenic non-HLA antigenic targets after transplantation. Posttransplant antibody responses were measured by paired comparative analysis of pretransplant and posttransplant serum samples from 18 pediatric renal transplant recipients, measured against 5,056 unique protein targets on the ProtoArray platform. The specificity of antibody responses were measured against gene expression levels specific to the kidney, and 2 other randomly selected organs (heart and pancreas), by integrated genomics, employing the mapping of transcription and ProtoArray platform measures, using AILUN. The likelihood of posttransplant non-HLA targets being recognized preferentially in any of 7 microdissected kidney compartments was also examined. In addition to HLA targets, non-HLA immune responses, including anti-MICA antibodies, were detected against kidney compartment-specific antigens, with highest posttransplant recognition for renal pelvis and cortex specific antigens. The compartment specificity of selected antibodies was confirmed by IHC. In conclusion, this study provides an immunogenic and anatomic roadmap of the most likely non-HLA antigens that can generate serological responses after renal transplantation. Correlation of the most significant non-HLA antibody responses with transplant health and dysfunction are currently underway.

    View details for DOI 10.1073/pnas.0900563106

    View details for Web of Science ID 000264278800020

    View details for PubMedID 19251643

  • Data-Driven Methods to Discover Molecular Determinants of Serious Adverse Drug Events CLINICAL PHARMACOLOGY & THERAPEUTICS Chiang, A. P., Butte, A. J. 2009; 85 (3): 259-268


    The dangers of serious adverse drug reactions (SADRs) are well known to clinicians, pharmacologists, and the lay public. Efforts to elucidate the molecular mechanisms behind SADRs have made significant progress through genetics and gene expression measurements. However, as the field of pharmacology adopts the same novel higher-density measurement modalities that have proven successful in other areas of biology, one wonders whether there can be more ways to benefit from the explosion of data created by these tools. The development of analytic tools and algorithms to interpret these biological data to create tools for medicine is central to the field of translational bioinformatics. In this review we introduce some of the types of SADR predictors that are required, and we discuss several databases that are publicly available for the study of SADRs, ranging from clinical to molecular measurements. We also describe recent examples of how bioinformatics methods coupled with data repositories can advance the science of SADRs.

    View details for DOI 10.1038/clpt.2008.274

    View details for Web of Science ID 000263606500012

    View details for PubMedID 19177064

  • The "etiome": identification and clustering of human disease etiological factors BMC BIOINFORMATICS Liu, Y. I., Wise, P. H., Butte, A. J. 2009; 10


    Both genetic and environmental factors contribute to human diseases. Most common diseases are influenced by a large number of genetic and environmental factors, most of which individually have only a modest effect on the disease. Though genetic contributions are relatively well characterized for some monogenetic diseases, there has been no effort at curating the extensive list of environmental etiological factors.From a comprehensive search of the MeSH annotation of MEDLINE articles, we identified 3,342 environmental etiological factors associated with 3,159 diseases. We also identified 1,100 genes associated with 1,034 complex diseases from the NIH Genetic Association Database (GAD), a database of genetic association studies. 863 diseases have both genetic and environmental etiological factors available. Integrating genetic and environmental factors results in the "etiome", which we define as the comprehensive compendium of disease etiology. Clustering of environmental factors may alert clinicians of the risks of added exposures, or synergy in interventions to alter these factors. Clustering of both genetic and environmental etiological factors puts genes in the context of environment in a quantitative manner.In this paper, we obtained a comprehensive list of associations between disease and environmental factors using MeSH annotation of MEDLINE articles. It serves as a summary of current knowledge between etiological factors and diseases. By combining the environmental etiological factors and genetic factors from GAD, we computed the "etiome" profile for 863 diseases. Comparing diseases across these profiles may have utility for clinical medicine, basic science research, and population-based science.

    View details for DOI 10.1186/1471-2105-10-S2-S14

    View details for Web of Science ID 000265602500015

    View details for PubMedID 19208189

  • Ontology-driven indexing of public datasets for translational bioinformatics BMC BIOINFORMATICS Shah, N. H., Jonquet, C., Chiang, A. P., Butte, A. J., Chen, R., Musen, M. A. 2009; 10


    The volume of publicly available genomic scale data is increasing. Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample. These annotations are not mapped to concepts in any ontology, making it difficult to integrate these datasets across repositories. We have previously developed methods to map text-annotations of tissue microarrays to concepts in the NCI thesaurus and SNOMED-CT. In this work we generalize our methods to map text annotations of gene expression datasets to concepts in the UMLS. We demonstrate the utility of our methods by processing annotations of datasets in the Gene Expression Omnibus. We demonstrate that we enable ontology-based querying and integration of tissue and gene expression microarray data. We enable identification of datasets on specific diseases across both repositories. Our approach provides the basis for ontology-driven data integration for translational research on gene and protein expression data. Based on this work we have built a prototype system for ontology based annotation and indexing of biomedical data. The system processes the text metadata of diverse resource elements such as gene expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed article abstracts to annotate and index them with concepts from appropriate ontologies. The key functionality of this system is to enable users to locate biomedical data resources related to particular ontology concepts.

    View details for DOI 10.1186/1471-2105-10-S2-S1

    View details for Web of Science ID 000265602500002

    View details for PubMedID 19208184

  • Selected proceedings of the First Summit on Translational Bioinformatics 2008 Introduction BMC BIOINFORMATICS Butte, A. J., Sarkar, I. N., Ramoni, M., Lussier, Y., Troyanskaya, O. 2009; 10
  • Report on EU-USA Workshop: How Systems Biology Can Advance Cancer Research (27 October 2008) MOLECULAR ONCOLOGY Aebersold, R., Auffray, C., Baney, E., Barillot, E., Brazma, A., Brett, C., Brunak, S., Butte, A., Califano, A., Celis, J., Cufer, T., Ferrell, J., Galas, D., Gallahan, D., Gatenby, R., Goldbeter, A., Hace, N., Henney, A., Hood, L., Iyengar, R., Jackson, V., Kallioniemi, O., Klingmueller, U., Kolar, P., Kolch, W., Kyriakopoulou, C., Laplace, F., Lehrach, H., Marcus, F., Matrisian, L., Nolan, G., Pelkmans, L., Potti, A., Sander, C., Seljak, M., Singer, D., Sorger, P., Stunnenberg, H., Superti-Furga, G., Uhlen, M., Vidal, M., Weinstein, J., Wigle, D., Williams, M., Wolkenhauer, O., Zhivotousky, B., Zinovyev, A., Zupan, B. 2009; 3 (1): 9-17


    The main conclusion is that systems biology approaches can indeed advance cancer research, having already proved successful in a very wide variety of cancer-related areas, and are likely to prove superior to many current research strategies. Major points include: Systems biology and computational approaches can make important contributions to research and development in key clinical aspects of cancer and of cancer treatment, and should be developed for understanding and application to diagnosis, biomarkers, cancer progression, drug development and treatment strategies. Development of new measurement technologies is central to successful systems approaches, and should be strongly encouraged. The systems view of disease combined with these new technologies and novel computational tools will over the next 5-20 years lead to medicine that is predictive, personalized, preventive and participatory (P4 medicine).Major initiatives are in progress to gather extremely wide ranges of data for both somatic and germ-line genetic variations, as well as gene, transcript, protein and metabolite expression profiles that are cancer-relevant. Electronic databases and repositories play a central role to store and analyze these data. These resources need to be developed and sustained. Understanding cellular pathways is crucial in cancer research, and these pathways need to be considered in the context of the progression of cancer at various stages. At all stages of cancer progression, major areas require modelling via systems and developmental biology methods including immune system reactions, angiogenesis and tumour progression.A number of mathematical models of an analytical or computational nature have been developed that can give detailed insights into the dynamics of cancer-relevant systems. These models should be further integrated across multiple levels of biological organization in conjunction with analysis of laboratory and clinical data.Biomarkers represent major tools in determining the presence of cancer, its progression and the responses to treatments. There is a need for sets of high-quality annotated clinical samples, enabling comparisons across different diseases and the quantitative simulation of major pathways leading to biomarker development and analysis of drug effects.Education is recognized as a key component in the success of any systems biology programme, especially for applications to cancer research. It is recognized that a balance needs to be found between the need to be interdisciplinary and the necessity of having extensive specialist knowledge in particular areas.A proposal from this workshop is to explore one or more types of cancer over the full scale of their progression, for example glioblastoma or colon cancer. Such an exemplar project would require all the experimental and computational tools available for the generation and analysis of quantitative data over the entire hierarchy of biological information. These tools and approaches could be mobilized to understand, detect and treat cancerous processes and establish methods applicable across a wide range of cancers.

    View details for DOI 10.1016/j.molonc.2008.11.003

    View details for Web of Science ID 000264094700003

    View details for PubMedID 19383362

  • TOWARDS A CYTOKINE-CELL INTERACTION KNOWLEDGEBASE OF THE ADAPTIVE IMMUNE SYSTEM PACIFIC SYMPOSIUM ON BIOCOMPUTING 2009 Shen-Orr, S. S., Goldberger, O., Garten, Y., Rosenberg-Hasson, Y., Lovelace, P. A., Hirschberg, D. L., Altman, R. B., Davis, M. M., Butte, A. J. 2009: 439-450


    The immune system of higher organisms is, by any standard, complex. To date, using reductionist techniques, immunologists have elucidated many of the basic principles of how the immune system functions, yet our understanding is still far from complete. In an era of high throughput measurements, it is already clear that the scientific knowledge we have accumulated has itself grown larger than our ability to cope with it, and thus it is increasingly important to develop bioinformatics tools with which to navigate the complexity of the information that is available to us. Here, we describe ImmuneXpresso, an information extraction system, tailored for parsing the primary literature of immunology and relating it to experimental data. The immune system is very much dependent on the interactions of various white blood cells with each other, either in synaptic contacts, at a distance using cytokines or chemokines, or both. Therefore, as a first approximation, we used ImmuneXpresso to create a literature derived network of interactions between cells and cytokines. Integration of cell-specific gene expression data facilitates cross-validation of cytokine mediated cell-cell interactions and suggests novel interactions. We evaluate the performance of our automatically generated multi-scale model against existing manually curated data, and show how this system can be used to guide experimentalists in interpreting multi-scale, experimental data. Our methodology is scalable and can be generalized to other systems.

    View details for Web of Science ID 000263639700041

    View details for PubMedID 19209721

  • Translational bioinformatics applications in genome medicine GENOME MEDICINE Butte, A. J. 2009; 1

    View details for DOI 10.1186/gm64

    View details for Web of Science ID 000208627000064

  • Translational bioinformatics applications in genome medicine. Genome medicine Butte, A. J. 2009; 1 (6): 64-?


    Although investigators using methodologies in bioinformatics have always been useful in genomic experimentation in analytic, engineering, and infrastructure support roles, only recently have bioinformaticians been able to have a primary scientific role in asking and answering questions on human health and disease. Here, I argue that this shift in role towards asking questions in medicine is now the next step needed for the field of bioinformatics. I outline four reasons why bioinformaticians are newly enabled to drive the questions in primary medical discovery: public availability of data, intersection of data across experiments, commoditization of methods, and streamlined validation. I also list four recommendations for bioinformaticians wishing to get more involved in translational research.

    View details for DOI 10.1186/gm64

    View details for PubMedID 19566916

  • Selected proceedings of the First Summit on Translational Bioinformatics 2008. BMC bioinformatics Butte, A. J., Sarkar, I. N., Ramoni, M., Lussier, Y., Troyanskaya, O. 2009; 10: I1-?

    View details for DOI 10.1186/1471-2105-10-S2-I1

    View details for PubMedID 19208183

  • Infection in the intensive care unit alters physiological networks BMC BIOINFORMATICS Grossman, A. D., Cohen, M. J., Manley, G. T., Butte, A. J. 2009; 10


    Physicians use clinical and physiological data to treat patients every day, and it is essential for treating a patient appropriately. However, medical sources of clinical physiological data are only now starting to find use in bioinformatics research.We collected 29 types of physiological and clinical data on a minute-by-minute basis from trauma patients in the intensive care unit along with whether they contracted an infection during their stay. Dividing the patients into two groups based on this criterion, we determined that the correlational network amongst pairs of physiological variables changes based on whether the patient contracted an infection.Examining the variable pairs with the largest change in correlation across groups reveals potential changes in the way our treatments affect the patient's physiology and in how our bodies react to physiological insults. These findings highlight the usefulness of physiological informatics and suggest new relationships to study while also validating previously reported relationships.

    View details for DOI 10.1186/1471-2105-10-S9-S4

    View details for Web of Science ID 000270371700005

    View details for PubMedID 19761574



    There is a strong clinical imperative to identify discerning molecular biomarkers of disease to inform diagnosis, prognosis, and treatment. Ideally, such biomarkers would be drawn from peripheral sources non-invasively to reduce costs and lower potential for complication. Advances in high-throughput genomics and proteomics have vastly increased the space of prospective molecular biomarkers. Consequently, the elucidation of molecular biomarkers of clinical importance often entails a genome- or proteome-wide search for candidates. Here we present a novel framework for the identification of disease-specific protein biomarkers through the integration of biofluid proteomes and inter-disease genomic relationships using a network paradigm. We created a blood plasma biomarker network by linking expression-based genomic profiles from 136 diseases to 1,028 detectable blood plasma proteins. We also created a urine biomarker network by linking genomic profiles from 127 diseases to 577 proteins detectable in urine. Through analysis of these molecular biomarker networks, we find that the majority (> 80%) of putative protein biomarkers are linked to multiple disease conditions. Thus, prospective disease-specific protein biomarkers are found in only a small subset of the biofluids proteomes. These findings illustrate the importance of considering shared molecular pathology across diseases when evaluating biomarker specificity. The proposed framework is amenable to integration with complimentary network models of biology, which could further constrain the biomarker candidate space, and establish a role for the understanding of multi-scale, inter-disease genomic relationships in biomarker discovery.

    View details for Web of Science ID 000263639700004

    View details for PubMedID 19209693

  • GeneChaser: Identifying all biological and clinical conditions in which genes of interest are differentially expressed BMC BIOINFORMATICS Chen, R., Mallelwar, R., Thosar, A., Venkatasubrahmanyam, S., Butte, A. J. 2008; 9


    The amount of gene expression data in the public repositories, such as NCBI Gene Expression Omnibus (GEO) has grown exponentially, and provides a gold mine for bioinformaticians, but has not been easily accessible by biologists and clinicians.We developed an automated approach to annotate and analyze all GEO data sets, including 1,515 GEO data sets from 231 microarray types across 42 species, and performed 12,658 group versus group comparisons of 24 GEO-specified types. We then built GeneChaser, a web server that enables biologists and clinicians without bioinformatics skills to easily identify biological and clinical conditions in which a gene or set of genes was differentially expressed. GeneChaser displays these conditions in graphs, gives statistical comparisons, allows sort/filter functions and provides access to the original studies.We performed a single gene search for Nanog and a multiple gene search for Nanog, Oct4, Sox2 and LIN28, confirmed their roles in embryonic stem cell development, identified several drugs that regulate their expression, and suggested their potential roles in sex determination, abnormal sperm morphology, malaria infection, and cancer.We demonstrated that GeneChaser is a powerful tool to elucidate information on function, transcriptional regulation, drug-response and clinical implications for genes of interest.

    View details for DOI 10.1186/1471-2105-9-548

    View details for Web of Science ID 000262999600001

    View details for PubMedID 19094235

  • Tissue- and age-specific changes in gene expression during disease induction and progression in NOD mice CLINICAL IMMUNOLOGY Kodama, K., Butte, A. J., Creusot, R. J., Su, L., Sheng, D., Hartnett, M., Iwai, H., Soares, L. R., Fathman, C. G. 2008; 129 (2): 195-201


    Whole genome oligo-microarrays were used to characterize age-dependent and tissue-specific changes in gene expression in pancreatic lymph nodes, spleen, and peripheral blood cells, obtained from up to 8 individual NOD mice at 6 different time points (1.5 to 20 weeks of age), compared to NOD.B10 tissue controls. "Milestone Genes" are genes whose expression was significantly changed (approximately 3 fold) as the result of splicing or changes in transcript level. Milestone Genes were identified among genes within type one diabetes (T1D) susceptibility regions (Idd). Milestone Genes showing uniform patterns of changes in expression at various time points were identified, but the patterns of distribution and kinetics of expression were unique to each tissue. Potential T1D candidate genes were identified among Milestone Genes within Idd regions and/or hierarchical clusters. These studies identified tissue- and age-specific changes in gene expression that may play an important role in the inductive or destructive events of T1D.

    View details for DOI 10.1016/j.clim.2008.07.028

    View details for Web of Science ID 000260553600003

    View details for PubMedID 18801706

  • Expression-based Pathway Signature Analysis (EPSA): Mining publicly available microarray data for insight into human disease BMC MEDICAL GENOMICS Tenenbaum, J. D., Walker, M. G., Utz, P. J., Butte, A. J. 2008; 1


    Publicly available data repositories facilitate the sharing of an ever-increasing amount of microarray data. However, these datasets remain highly underutilized. Reutilizing the data could offer insights into questions and diseases entirely distinct from those considered in the original experimental design.We first analyzed microarray datasets derived from known perturbations of specific pathways using the samr package in R to identify specific patterns of change in gene expression. We refer to these pattern of gene expression alteration as a "pathway signatures." We then used Spearman's rank correlation coefficient, a non-parametric measure of correlation, to determine similarities between pathway signatures and disease profiles, and permutation analysis to evaluate false discovery rate. This enabled detection of statistically significant similarity between these pathway signatures and corresponding changes observed in human disease. Finally, we evaluated pathway activation, as indicated by correlation with the pathway signature, as a risk factor for poor prognosis using multiple unrelated, publicly available datasets.We have developed a novel method, Expression-based Pathway Signature Analysis (EPSA). We demonstrate that ESPA is a rigorous computational approach for statistically evaluating the degree of similarity between highly disparate sources of microarray expression data. We also show how EPSA can be used in a number of cases to stratify patients with differential disease prognosis. EPSA can be applied to many different types of datasets in spite of different platforms, different experimental designs, and different species. Applying this method can yield new insights into human disease progression.EPSA enables the use of publicly available data for an entirely new, translational purpose to enable the identification of potential pathways of dysregulation in human disease, as well as potential leads for therapeutic molecular targets.

    View details for DOI 10.1186/1755-8794-1-51

    View details for Web of Science ID 000272706500001

    View details for PubMedID 18937865

  • Hematopoietic Stem Cell Quiescence Is Maintained by Compound Contributions of the Retinoblastoma Gene Family CELL STEM CELL Viatour, P., Somervaille, T. C., Venkatasubrahmanyam, S., Kogan, S., McLaughlin, M. E., Weissman, I. L., Butte, A. J., Passegue, E., Sage, J. 2008; 3 (4): 416-428


    Individual members of the retinoblastoma (Rb) tumor suppressor gene family serve critical roles in the control of cellular proliferation and differentiation, but the extent of their contributions is masked by redundant and compensatory mechanisms. Here we employed a conditional knockout strategy to simultaneously inactivate all three members, Rb, p107, and p130, in adult hematopoietic stem cells (HSCs). Rb family triple knockout (TKO) mice develop a cell-intrinsic myeloproliferation that originates from hyperproliferative early hematopoietic progenitors and is accompanied by increased apoptosis in lymphoid progenitor populations. Loss of quiescence in the TKO HSC pool is associated with an expansion of these mutant stem cells but also with an enhanced mobilization and an impaired reconstitution potential upon transplantation. The presence of a single p107 allele is sufficient to largely rescue these defects. Thus, Rb family members collectively maintain HSC quiescence and the balance between lymphoid and myeloid cell fates in the hematopoietic system.

    View details for DOI 10.1016/j.stem.2008.07.009

    View details for Web of Science ID 000260149800012

    View details for PubMedID 18940733

  • Enabling integrative genomic analysis of high-impact human diseases through text mining. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Dudley, J., Butte, A. J. 2008: 580-591


    Our limited ability to perform large-scale translational discovery and analysis of disease characterizations from public genomic data repositories remains a major bottleneck in efforts to translate genomics experiments to medicine. Through comprehensive, integrative genomic analysis of all available human disease characterizations we gain crucial insight into the molecular phenomena underlying pathogenesis as well as intra- and inter-disease differentiation. Such knowledge is crucial in the development of improved clinical diagnostics and the identification of molecular targets for novel therapeutics. In this study we build on our previous work to realize the next important step in large-scale translational discovery and analysis, which is to automatically identify those genomic experiments in which a disease state is compared to a normal control state. We present an automated text mining method that employs Natural Language Processing (NLP) techniques to automatically identify disease-related experiments in the NCBI Gene Expression Omnibus (GEO) that include measurements for both disease and normal control states. In this manner, we find that 62% of disease-related experiments contain sample subsets that can be automatically identified as normal controls. Furthermore, we calculate that the identified experiments characterize diseases that contribute to 30% of all human disease-related mortality in the United States. This work demonstrates that we now have the necessary tools and methods to initiate large-scale translational bioinformatics inquiry across the broad spectrum of high-impact human disease.

    View details for PubMedID 18229717

  • Novel integration of hospital electronic medical records and gene expression measurements to identify genetic markers of maturation. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Chen, D. P., Weber, S. C., Constantinou, P. S., Ferris, T. A., Lowe, H. J., Butte, A. J. 2008: 243-254


    Traditionally, the elucidation of genes involved in maturation and aging has been studied in a temporal fashion by examining gene expression at different time points in an organism's life as well as by knocking out, knocking in, and mutating genes thought to be involved. Here, we propose an in silico method to combine clinical electronic medical record (EMR) data and gene expression measurements in the context of disease to identify genes that may be involved in the process of human maturation and aging. First we show that absolute lymphocyte count may serve as a biomarker for maturation by using statistical methods to compare trends among different clinical laboratory tests in response to an increase in age. We then propose using the rate of decay for absolute lymphocyte count across 12 diseases as a proxy for differences in aging. We correlate the differing rates with gene expression across the same diseases to find maturation/aging related genes. Among the 53 genes with strongest correlations between expression profile and change in rate of decay, we found genes previously implicated in the process of aging, including MGMT (DNA repair), TERF2 (telomere stability), POLD1 (DNA replication and repair), and POLG (mtDNA replication).

    View details for PubMedID 18229690

  • FitSNPs: highly differentially expressed genes are more likely to have variants associated with disease GENOME BIOLOGY Chen, R., Morgan, A. A., Dudley, J., Deshpande, T., Li, L., Kodama, K., Chiang, A. P., Butte, A. J. 2008; 9 (12)


    Candidate single nucleotide polymorphisms (SNPs) from genome-wide association studies (GWASs) were often selected for validation based on their functional annotation, which was inadequate and biased. We propose to use the more than 200,000 microarray studies in the Gene Expression Omnibus to systematically prioritize candidate SNPs from GWASs.We analyzed all human microarray studies from the Gene Expression Omnibus, and calculated the observed frequency of differential expression, which we called differential expression ratio, for every human gene. Analysis conducted in a comprehensive list of curated disease genes revealed a positive association between differential expression ratio values and the likelihood of harboring disease-associated variants. By considering highly differentially expressed genes, we were able to rediscover disease genes with 79% specificity and 37% sensitivity. We successfully distinguished true disease genes from false positives in multiple GWASs for multiple diseases. We then derived a list of functionally interpolating SNPs (fitSNPs) to analyze the top seven loci of Wellcome Trust Case Control Consortium type 1 diabetes mellitus GWASs, rediscovered all type 1 diabetes mellitus genes, and predicted a novel gene (KIAA1109) for an unexplained locus 4q27. We suggest that fitSNPs would work equally well for both Mendelian and complex diseases (being more effective for cancer) and proposed candidate genes to sequence for their association with 597 syndromes with unknown molecular basis.Our study demonstrates that highly differentially expressed genes are more likely to harbor disease-associated DNA variants. FitSNPs can serve as an effective tool to systematically prioritize candidate SNPs from GWASs.

    View details for DOI 10.1186/gb-2008-9-12-r170

    View details for Web of Science ID 000263074100009

    View details for PubMedID 19061490

  • Evaluation and integration of 49 genome-wide experiments and the prediction of previously unknown obesity-related genes BIOINFORMATICS English, S. B., Butte, A. J. 2007; 23 (21): 2910-2917


    Genome-wide experiments only rarely show resounding success in yielding genes associated with complex polygenic disorders. We evaluate 49 obesity-related genome-wide experiments with publicly available findings including microarray, genetics, proteomics and gene knock-down from human, mouse, rat and worm, in terms of their ability to rediscover a comprehensive set of genes previously found to be causally associated or having variants associated with obesity.Individual experiments show poor predictive ability for rediscovering known obesity-associated genes. We show that intersecting the results of experiments significantly improves the sensitivity, specificity and precision of the prediction of obesity-associated genes. We create an integrative model that statistically significantly outperforms all 49 individual genome-wide experiments. We find that genes known to be associated with obesity are significantly implicated in more obesity-related experiments and use this to provide a list of genes that we predict to have the highest likelihood of association for obesity. The approach described here can include any number and type of genome-wide experiments and might be useful for other complex polygenic disorders as well.

    View details for DOI 10.1093/bioinformatics/btm483

    View details for Web of Science ID 000251197700015

    View details for PubMedID 17921495

  • AILUN: reannotating gene expression data automatically NATURE METHODS Chen, R., Li, L., Butte, A. J. 2007; 4 (11): 879-879

    View details for Web of Science ID 000250575700002

    View details for PubMedID 17971777

  • Clinical arrays of laboratory measures, or "clinarrays", built from an electronic health record enable disease subtyping by severity. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Chen, D. P., Weber, S. C., Constantinou, P. S., Ferris, T. A., Lowe, H. J., Butte, A. J. 2007: 115-119


    The severity of diseases has often been assigned by direct observation of a patient and by pathological examination after symptoms have appeared. As we move into the genomic era, the ability to predict disease severity prior to manifestation has improved dramatically due to genomic sequencing and analysis of gene expression microarrays. However, as the severity of diseases can be exacerbated by non genetic factors, the ability to predict disease severity by examining gene expression alone may be inadequate. We propose the creation of a "clinarray" to examine phenotypic expression in the form of clinical laboratory measurements. We demonstrate that the clinarray can be used to distinguish between the severities of patients with cystic fibrosis and those with Crohn's disease by applying unsupervised clustering methods that have been previously applied to microarrays.

    View details for PubMedID 18693809

  • Methodologies for extracting functional pharmacogenomic experiments from international repository. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Lin, Y., Chiang, A., Lin, R., Yao, P., Chen, R., Butte, A. J. 2007: 463-467


    Pharmacogenomic studies are studies designed to elucidate the relationships between drugs and genes on the genomic scale. Given the rapidly increasing amount of microarray data in international repositories, and the implicit drug information contained in PubMed, MeSH and UMLS, we propose automatic methods for identifying drug-related microarray experiments from NCBI GEO by the semantic connections between these data resources. In our study, we find that 51.5% of microarray experiments are associated with at least one PubMed identifier, 22.1% of these contain a MeSH term that relates to the UMLS Pharmacologic Substances semantic sub-tree. Our work shows an abundance of publicly available gene expression data available to enable the discovery of novel drug indications, drug classifications and other pharmacogenomic studies.

    View details for PubMedID 18693879

  • Multiplexed protein array platforms for analysis of autoimmune diseases ANNUAL REVIEW OF IMMUNOLOGY Balboni, I., Chan, S. M., Kattah, M., Tenenbaum, J. D., Butte, A. J., Utz, P. J. 2006; 24: 391-418


    Several proteomics platforms have emerged in the past decade that show great promise for filling in the many gaps that remain from earlier studies of the genome and from the sequencing of the human genome itself. This review describes applications of proteomics technologies to the study of autoimmune diseases. We focus largely on biased technology platforms that are capable of analyzing a large panel of known analytes, as opposed to techniques such as two-dimensional gel electrophoresis (2DIGE) or mass spectroscopy that represent unbiased approaches (as reviewed in 1). At present, the main analytes that can be systematically studied in autoimmunity include autoantibodies, cytokines and chemokines, components of signaling pathways, and cell-surface receptors. We review the most commonly used platforms for such studies, citing important discoveries and limitations that exist. We conclude by reviewing advances in biomedical informatics that will eventually allow the human proteome to be deciphered.

    View details for DOI 10.1146/annurev.immunol.24.021605.090709

    View details for Web of Science ID 000237583300013

    View details for PubMedID 16551254

  • Creation and implications of a phenome-genome network NATURE BIOTECHNOLOGY Butte, A. J., Kohane, I. S. 2006; 24 (1): 55-62


    Although gene and protein measurements are increasing in quantity and comprehensiveness, they do not characterize a sample's entire phenotype in an environmental or experimental context. Here we comprehensively consider associations between components of phenotype, genotype and environment to identify genes that may govern phenotype and responses to the environment. Context from the annotations of gene expression data sets in the Gene Expression Omnibus is represented using the Unified Medical Language System, a compendium of biomedical vocabularies with nearly 1-million concepts. After showing how data sets can be clustered by annotative concepts, we find a network of relations between phenotypic, disease, environmental and experimental contexts as well as genes with differential expression associated with these concepts. We identify novel genes related to concepts such as aging. Comprehensively identifying genes related to phenotype and environment is a step toward the Human Phenome Project.

    View details for DOI 10.1038/nbt1150

    View details for Web of Science ID 000234555800024

    View details for PubMedID 16404398

  • Finding disease-related genomic experiments within an international repository: first steps in translational bioinformatics. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Butte, A. J., Chen, R. 2006: 106-110


    The amount of gene expression data in international repositories has grown exponentially. An important first step in translating the results of genomic experiments into medicine is to relate these genomic experiments to the human diseases they have studied. Unfortunately, repositories for expression data store the crucial annotative details only as free-text, making it manually intractable to link these with human disease. In this study, we sought to find experiments in NCBI GEO that are related to human diseases by making use of annotations relating these experiments with PUBMED identifiers representing the publication in which each experiment was published. In this manner, we find that 35% of PUBMED-associated genomic experiments can be related to a human disease, and that publicly-available data from these genomic experiments can already be related to over 270 human diseases and conditions. This represents an important first step in bridging the world of nucleotides, transcripts and expression with the afflications of us all.

    View details for PubMedID 17238312

  • Genome-wide analysis of host responses to the Pseudomonas aeruginosa type III secretion system yields synergistic effects CELLULAR MICROBIOLOGY Ichikawa, J. K., English, S. B., Wolfgang, M. C., Jackson, R., Butte, A. J., Lory, S. 2005; 7 (11): 1635-1646


    The type III secretion system (TTSS) is a dedicated bacterial pathogen protein targeting system that directly affects host cell signalling and response pathways. Our goal was to identify host responses to the Pseudomonas aeruginosa effectors, introduced into target cells utilizing the TTSS. We carried out expression profiling of a human lung pneumocyte cell line A549 exposed to isogenic mutants of P. aeruginosa PAK lacking individual or a combination of TTSS components. We then devised a data analysis method to isolate the key responses to specific secreted bacterial effector proteins as well as components of the TTSS machinery. Individually, the effector proteins elicited host responses consistent with their known functions, many of which were cell cycle-related. However, our analysis has shown that the effector proteins elicit a distinct host transcriptional response when present in combination, suggesting a synergistic effect. Furthermore, the pattern of host transcriptional responses is consistent with the pore forming ability of the TTSS needle complex. This study shows that the individual components of the TTSS define an integrated system and that a systems biology approach is required to fully understand the complex interplay between pathogen and host.

    View details for DOI 10.1111/j.1462-5822.2005.00581.x

    View details for Web of Science ID 000232391500011

    View details for PubMedID 16207250

  • Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks BMC BIOINFORMATICS Wolfe, C. J., Kohane, I. S., Butte, A. J. 2005; 6


    Biological processes are carried out by coordinated modules of interacting molecules. As clustering methods demonstrate that genes with similar expression display increased likelihood of being associated with a common functional module, networks of coexpressed genes provide one framework for assigning gene function. This has informed the guilt-by-association (GBA) heuristic, widely invoked in functional genomics. Yet although the idea of GBA is accepted, the breadth of GBA applicability is uncertain.We developed methods to systematically explore the breadth of GBA across a large and varied corpus of expression data to answer the following question: To what extent is the GBA heuristic broadly applicable to the transcriptome and conversely how broadly is GBA captured by a priori knowledge represented in the Gene Ontology (GO)? Our study provides an investigation of the functional organization of five coexpression networks using data from three mammalian organisms. Our method calculates a probabilistic score between each gene and each Gene Ontology category that reflects coexpression enrichment of a GO module. For each GO category we use Receiver Operating Curves to assess whether these probabilistic scores reflect GBA. This methodology applied to five different coexpression networks demonstrates that the signature of guilt-by-association is ubiquitous and reproducible and that the GBA heuristic is broadly applicable across the population of nine hundred Gene Ontology categories. We also demonstrate the existence of highly reproducible patterns of coexpression between some pairs of GO categories.We conclude that GBA has universal value and that transcriptional control may be more modular than previously realized. Our analyses also suggest that methodologies combining coexpression measurements across multiple genes in a biologically-defined module can aid in characterizing gene function or in characterizing whether pairs of functions operate together.

    View details for DOI 10.1186/1471-2105-6-227

    View details for Web of Science ID 000232279100001

    View details for PubMedID 16162296

  • Prediction of preadipocyte differentiation by gene expression reveals role of insulin receptor substrates and necdin NATURE CELL BIOLOGY Tseng, Y. H., Butte, A. J., Kokkotou, E., Yechoor, V. K., Taniguchi, C. M., Kriauciunas, K. M., Cypess, A. M., Niinobe, M., Yoshikawa, K., Patti, M. E., Kahn, C. R. 2005; 7 (6): 601-U22


    The insulin/IGF-1 (insulin-like growth factor 1) signalling pathway promotes adipocyte differentiation via complex signalling networks. Here, using microarray analysis of brown preadipocytes that are derived from wild-type and insulin receptor substrate (Irs) knockout animals that exhibit progressively impaired differentiation, we define 374 genes/expressed-sequence tags whose expression in preadipocytes correlates with the ultimate ability of the cells to differentiate. Many of these genes, including preadipocyte factor-1 (Pref-1) and multiple members of the Wnt signalling pathway, are related to early adipogenic events. Necdin is also markedly increased in Irs knockout cells that cannot differentiate, and knockdown of necdin restores brown adipogenesis with downregulation of Pref-1 and Wnt10a expression. Insulin receptor substrate proteins regulate a necdin-E2F4 interaction that represses peroxisome-proliferator-activated receptor gamma (PPARgamma) transcription via a cyclic AMP response element binding protein (CREB)-dependent pathway. Together these define a key signalling network that is involved in brown preadipocyte determination.

    View details for DOI 10.1038/ncb1259

    View details for Web of Science ID 000229562100014

    View details for PubMedID 15895078

  • A computational model to define the molecular causes of type 2 diabetes mellitus. Diabetes technology & therapeutics Pollard, J., Butte, A. J., Hoberman, S., Joshi, M., Levy, J., Pappo, J. 2005; 7 (2): 323-336


    Metabolic abnormalities associated with type 2 diabetes mellitus (DM2) are caused in part by inadequate insulin action and resulting changes in gene expression in the skeletal muscle. Two recent, independent studies of human skeletal muscle biopsies from ethnically diverse DM2 patients have identified coordinated reductions in the expression of the oxidative phosphorylation (OXPHOS) genes. Whether these reductions are a consequence or a cause of impaired insulin sensitivity remains an open question.To address this question and to define the underlying molecular causes consistent with the expression changes reported in the muscle studies, we created a large-scale computable model to analyze the molecular actions and effects of insulin on muscle gene expression. The model enables computer-aided reasoning using over 210,000 molecular relationships assembled from the DM2 literature.We integrated the data from these muscle biopsy studies into the model and used computer-aided causal reasoning to discover mechanisms that can link alterations in OXPHOS genes to decreases in glucose transport, insulin signaling, and risk factors associated to post-transplant diabetes mellitus.The emerging hypotheses describe biologic effects in DM2 and offer important cues for molecular targeted therapy.

    View details for PubMedID 15857235

  • Genome-scale expression profiling of Hutchinson-Gilford progeria syndrome reveals widespread transcriptional misregulation leading to mesodermal/mesenchymal defects and accelerated atherosclerosis AGING CELL Csoka, A. B., English, S. B., Simkevich, C. P., Ginzinger, D. G., Butte, A. J., Schatten, G. P., Rothman, F. G., Sedivy, J. M. 2004; 3 (4): 235-243


    Hutchinson-Gilford progeria syndrome (HGPS) is a rare genetic disease with widespread phenotypic features resembling premature aging. HGPS was recently shown to be caused by dominant mutations in the LMNA gene, resulting in the in-frame deletion of 50 amino acids near the carboxyl terminus of the encoded lamin A protein. Children with this disease typically succumb to myocardial infarction or stroke caused by severe atherosclerosis at an average age of 13 years. To elucidate further the molecular pathogenesis of this disease, we compared the gene expression patterns of three HGPS fibroblast cell strains heterozygous for the LMNA mutation with three normal, age-matched cell strains. We defined a set of 361 genes (1.1% of the approximately 33,000 genes analysed) that showed at least a 2-fold, statistically significant change. The most prominent categories encode transcription factors and extracellular matrix proteins, many of which are known to function in the tissues severely affected in HGPS. The most affected gene, MEOX2/GAX, is a homeobox transcription factor implicated as a negative regulator of mesodermal tissue proliferation. Thus, at the gene expression level, HGPS shows the hallmarks of a developmental disorder affecting mesodermal and mesenchymal cell lineages. The identification of a large number of genes implicated in atherosclerosis is especially valuable, because it provides clues to pathological processes that can now be investigated in HGPS patients or animal models.

    View details for Web of Science ID 000223138300012

    View details for PubMedID 15268757

  • Conserved mechanisms across development and tumorigenesis revealed by a mouse development perspective of human cancers GENES & DEVELOPMENT Kho, A. T., Zhao, Q., Cai, Z. H., Butte, A. J., Kim, J. Y., Pomeroy, S. L., Rowitch, D. H., Kohane, I. S. 2004; 18 (6): 629-640


    Identification of common mechanisms underlying organ development and primary tumor formation should yield new insights into tumor biology and facilitate the generation of relevant cancer models. We have developed a novel method to project the gene expression profiles of medulloblastomas (MBs)--human cerebellar tumors--onto a mouse cerebellar development sequence: postnatal days 1-60 (P1-P60). Genomically, human medulloblastomas were closest to mouse P1-P10 cerebella, and normal human cerebella were closest to mouse P30-P60 cerebella. Furthermore, metastatic MBs were highly associated with mouse P5 cerebella, suggesting that a clinically distinct subset of tumors is identifiable by molecular similarity to a precise developmental stage. Genewise, down- and up-regulated MB genes segregate to late and early stages of development, respectively. Comparable results for human lung cancer vis-a-vis the developing mouse lung suggest the generalizability of this multiscalar developmental perspective on tumor biology. Our findings indicate both a recapitulation of tissue-specific developmental programs in diverse solid tumors and the utility of tumor characterization on the developmental time axis for identifying novel aspects of clinical and biological behavior.

    View details for DOI 10.1101/gad.1182504

    View details for Web of Science ID 000220794200005

    View details for PubMedID 15075291

  • Quantifying the relationship between co-expression, co-regulation and gene function BMC BIOINFORMATICS Allocco, D. J., Kohane, I. S., Butte, A. J. 2004; 5


    It is thought that genes with similar patterns of mRNA expression and genes with similar functions are likely to be regulated via the same mechanisms. It has been difficult to quantitatively test these hypotheses on a large scale because there has been no general way of determining whether genes share a common regulatory mechanism. Here we use data from a recent genome wide binding analysis in combination with mRNA expression data and existing functional annotations to quantify the likelihood that genes with varying degrees of similarity in mRNA expression profile or function will be bound by a common transcription factor.Genes with strongly correlated mRNA expression profiles are more likely to have their promoter regions bound by a common transcription factor. This effect is present only at relatively high levels of expression similarity. In order for two genes to have a greater than 50% chance of sharing a common transcription factor binder, the correlation between their expression profiles (across the 611 microarrays used in our study) must be greater than 0.84. Genes with similar functional annotations are also more likely to be bound by a common transcription factor. Combining mRNA expression data with functional annotation results in a better predictive model than using either data source alone.We demonstrate how mRNA expression data and functional annotations can be used together to estimate the probability that genes share a common regulatory mechanism. Existing microarray data and known functional annotations are sufficient to identify only a relatively small percentage of co-regulated genes.

    View details for Web of Science ID 000220984200001

    View details for PubMedID 15053845

  • Coordinated reduction of genes of oxidative metabolism in humans with insulin resistance and diabetes: Potential role of PGC1 and NRF1 PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Patti, M. E., Butte, A. J., Crunkhorn, S., Cusi, K., Berria, R., Kashyap, S., Miyazaki, Y., Kohane, I., Costello, M., Saccone, R., Landaker, E. J., Goldfine, A. B., Mun, E., DeFronzo, R., Finlayson, J., Kahn, C. R., Mandarino, L. J. 2003; 100 (14): 8466-8471


    Type 2 diabetes mellitus (DM) is characterized by insulin resistance and pancreatic beta cell dysfunction. In high-risk subjects, the earliest detectable abnormality is insulin resistance in skeletal muscle. Impaired insulin-mediated signaling, gene expression, glycogen synthesis, and accumulation of intramyocellular triglycerides have all been linked with insulin resistance, but no specific defect responsible for insulin resistance and DM has been identified in humans. To identify genes potentially important in the pathogenesis of DM, we analyzed gene expression in skeletal muscle from healthy metabolically characterized nondiabetic (family history negative and positive for DM) and diabetic Mexican-American subjects. We demonstrate that insulin resistance and DM associate with reduced expression of multiple nuclear respiratory factor-1 (NRF-1)-dependent genes encoding key enzymes in oxidative metabolism and mitochondrial function. Although NRF-1 expression is decreased only in diabetic subjects, expression of both PPAR gamma coactivator 1-alpha and-beta (PGC1-alpha/PPARGC1 and PGC1-beta/PERC), coactivators of NRF-1 and PPAR gamma-dependent transcription, is decreased in both diabetic subjects and family history-positive nondiabetic subjects. Decreased PGC1 expression may be responsible for decreased expression of NRF-dependent genes, leading to the metabolic disturbances characteristic of insulin resistance and DM.

    View details for DOI 10.1073/pnas.1032913100

    View details for Web of Science ID 000184222500077

    View details for PubMedID 12832613

  • Reproducibility of gene expression across generations of Affymetrix microarrays BMC BIOINFORMATICS Nimgaonkar, A., Sanoudou, D., Butte, A. J., Haslett, J. N., Kunkel, L. M., Beggs, A. H., Kohane, I. S. 2003; 4


    The development of large-scale gene expression profiling technologies is rapidly changing the norms of biological investigation. But the rapid pace of change itself presents challenges. Commercial microarrays are regularly modified to incorporate new genes and improved target sequences. Although the ability to compare datasets across generations is crucial for any long-term research project, to date no means to allow such comparisons have been developed. In this study the reproducibility of gene expression levels across two generations of Affymetrix GeneChips (HuGeneFL and HG-U95A) was measured.Correlation coefficients were computed for gene expression values across chip generations based on different measures of similarity. Comparing the absolute calls assigned to the individual probe sets across the generations found them to be largely unchanged.We show that experimental replicates are highly reproducible, but that reproducibility across generations depends on the degree of similarity of the probe sets and the expression level of the corresponding transcript.

    View details for Web of Science ID 000184262200001

    View details for PubMedID 12823866

  • PGAGENE: integrating quantitative gene-specific results from the NHLBI Programs for Genomic Applications BIOINFORMATICS Lee, K., Kohane, I. S., Butte, A. J. 2003; 19 (6): 778-779


    Summary: PGAGENE is a web-based gene-specific genomic data search engine, which allows users to search over 5.9 million pieces of collective genetic and genomic data from the NHLBI supported Programs for Genomic Applications. This data includes microarray measurements, SNPs, and mutations, and data may be found using symbols, parts of gene names or products, Affymetrix probe IDs, GenBank accession numbers, UniGene IDs, dbSNP IDs, and others. The PGAGENE indexing agent periodically maps all publicly available gene-specific PGA data onto LocusLink using dynamically generated cross-referencing tables.

    View details for DOI 10.1093/bioinformatics/btg066

    View details for Web of Science ID 000182328400016

    View details for PubMedID 12691993

  • Computerized recruiting for clinical trials in real time ANNALS OF EMERGENCY MEDICINE Weiner, D. L., Butte, A. J., Hibberd, P. L., Fleisher, G. R. 2003; 41 (2): 242-246


    Success of prospective studies, particularly in the emergency department, often depends on immediate identification of eligible patients to ensure timely sample collection and initiation of study interventions. We report use of a real-time automated notification system to identify potential patients for a clinical trial at the time of ED registration on the basis of information routinely collected. We hypothesize that the automated notification system improves the rate of investigator notification.We performed a prospective comparison of the notification rate by the automated notification system compared with that by ED clinicians.In the 11 months before use of the automated notification system, the investigator was notified by ED staff for 56% of 61 potentially eligible patients. During 10 months of using the automated notification system, the investigator was paged by the automated notification system for 84% of 49 potentially eligible patients.The automated notification system improves study investigator notification. Use requires online linked registration, a database, and paging systems. The automated notification system is a potentially valuable tool in the recruitment of patients for clinical trials.

    View details for DOI 10.1067/mem.2003.52

    View details for Web of Science ID 000180698800010

    View details for PubMedID 12548275

  • Comparing expression profiles of genes with similar promoter regions BIOINFORMATICS Park, P. J., Butte, A. J., Kohane, I. S. 2002; 18 (12): 1576-1584


    Gene regulatory elements are often predicted by seeking common sequences in the promoter regions of genes that are clustered together based on their expression profiles. We consider the problem in the opposite direction: we seek to find the genes that have similar promoter regions and determine the extent to which these genes have similar expression profiles.We use the data sets from experiments on Saccharomyces cerevisiae. Our similarity measure for the promoter regions is based on the set of common mapped or putative transcription factor binding sites and other regulatory elements in the upstream region of the genes, as contained in the Saccharomyces cerevisiae Promoter Database. We pair up the genes with high similarity scores and compare their expression levels in time-course experiment data. We find that genes with similar promoter regions on the average have significantly higher correlation, but it can vary widely depending on the genes. This confirms that the presence of similar regulatory elements often does not correspond to similarity in expression profiles and indicates that finding transcription factor binding sites or other regulatory elements starting with the expression patterns may be limited in many cases. Regardless of the correlation, the degree to which the profiles agree under different experimental conditions can be examined to derive hypotheses concerning the role of common regulatory elements. Overall, we find that considering the relationship between the promoter regions and the expression profiles starting with the regulatory elements is a difficult but useful process that can provide valuable insights.

    View details for Web of Science ID 000179951800005

    View details for PubMedID 12490441

  • Analysis of matched mRNA measurements from two different microarray technologies BIOINFORMATICS Kuo, W. P., Jenssen, T. K., Butte, A. J., Ohno-Machado, L., Kohane, I. S. 2002; 18 (3): 405-412


    [corrected] The existence of several technologies for measuring gene expression makes the question of cross-technology agreement of measurements an important issue. Cross-platform utilization of data from different technologies has the potential to reduce the need to duplicate experiments but requires corresponding measurements to be comparable.A comparison of mRNA measurements of 2895 sequence-matched genes in 56 cell lines from the standard panel of 60 cancer cell lines from the National Cancer Institute (NCI 60) was carried out by calculating correlation between matched measurements and calculating concordance between cluster from two high-throughput DNA microarray technologies, Stanford type cDNA microarrays and Affymetrix oligonucleotide microarrays.In general, corresponding measurements from the two platforms showed poor correlation. Clusters of genes and cell lines were discordant between the two technologies, suggesting that relative intra-technology relationships were not preserved. GC-content, sequence length, average signal intensity, and an estimator of cross-hybridization were found to be associated with the degree of correlation. This suggests gene-specific, or more correctly probe-specific, factors influencing measurements differently in the two platforms, implying a poor prognosis for a broad utilization of gene expression measurements across platforms.

    View details for Web of Science ID 000174708500005

    View details for PubMedID 11934739

  • Further defining housekeeping, or "maintenance," genes Focus on "A compendium of gene expression in normal human tissues" PHYSIOLOGICAL GENOMICS Butte, A. J., Dzau, V. J., Glueck, S. B. 2001; 7 (2): 95-96

    View details for Web of Science ID 000172906700004

    View details for PubMedID 11773595

  • Comparing the similarity of time-series gene expression using signal processing metrics JOURNAL OF BIOMEDICAL INFORMATICS Butte, A. J., Bao, L., Reis, B. Y., Watkins, T. W., Kohane, I. S. 2001; 34 (6): 396-405


    Many algorithms have been used to cluster genes measured by microarray across a time series. Instead of clustering, our goal was to compare all pairs of genes to determine whether there was evidence of a phase shift between them. We describe a technique where gene expression is treated as a discrete time-invariant signal, allowing the use of digital signal-processing tools, including power spectral density, coherence, and transfer gain and phase shift. We used these on a public RNA expression set of 2467 genes measured every 7 min for 119 min and found 18 putative associations. Two of these were known in the biomedical literature and may have been missed using correlation coefficients. Digital signal processing tools can be embedded and enhance existing clustering algorithms.

    View details for DOI 10.1006/jbin.2002.1037

    View details for Web of Science ID 000177556700003

    View details for PubMedID 12198759

  • The Personal Internetworked Notary and Guardian INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS Riva, A., Mandl, K. D., Oh, D. H., Nigrin, D. J., Butte, A., Szolovits, P., Kohane, I. S. 2001; 62 (1): 27-40


    In this paper, we propose a secure, distributed and scaleable infrastructure for a lifelong personal medical record system. We leverage on existing and widely available technologies, like the Web and public-key cryptography, to define an architecture that allows patients to exercise full control over their medical data. This is done without compromising patients' privacy and the ability of other interested parties (e.g. physicians, health-care institutions, public-health researchers) to access the data when appropriately authorized. The system organizes the information as a tree of encrypted plain-text XML files, in order to ensure platform independence and durability, and uses a role-based authorization scheme to assign access privileges. In addition to the basic architecture, we describe tools to populate the patient's record with data from hospital databases and the first testbed applications we are deploying.

    View details for Web of Science ID 000169550400003

    View details for PubMedID 11340004

  • Strict interpretation of vaccination guidelines with computerized algorithms and improper timing of administered doses PEDIATRIC INFECTIOUS DISEASE JOURNAL Butte, A. J., Shaw, J. S., Bernstein, H. 2001; 20 (6): 561-565


    Frequently changing immunization recommendations may lead to incorrectly administered doses.To determine the incidence and characteristics of inappropriately timed vaccinations.Prospectively collected immunization histories of patients <5 years old from well-child care encounters with pediatric residents in a large urban clinic during a 3-month study period. New patients or those with no immunization history in the medical record were excluded. Paper records were verified before each visit and served as the immunization history. Immunization records were entered into and analyzed by the Massachusetts Immunization Information System with strict interpretation of minimum spacing and age guidelines to identify invalid vaccine doses. Reasons for invalidity were determined by manual review. Invalid doses were cross-referenced with clinic schedule to determine who delivered doses.Inclusion criteria were met by 690 encounters. Charts were available for review before the encounter for 580, containing 6983 total immunizations. Of these 289 (4.1%) administered doses were invalid; 206 of 580 (35.5%) patients had at least one invalid dose. Common invalid doses given were unnecessary poliovirus vaccine around 18 months (n = 66) and second hepatitis B vaccine given too soon after the first (n = 53). All types of providers gave invalid doses; pediatric residents and fellows delivered significantly more (P < 0.01).By strict interpretation of immunization guidelines, many patients were immunized incorrectly. Clinicians should be aware of common errors in vaccine dosing and national guidelines should be simplified.

    View details for Web of Science ID 000169344100002

    View details for PubMedID 11419495

  • Extracting knowledge from dynamics in gene expression JOURNAL OF BIOMEDICAL INFORMATICS Reis, B. Y., Butte, A. S., Kohane, I. S. 2001; 34 (1): 15-27


    Most investigations of coordinated gene expression have focused on identifying correlated expression patterns between genes by examining their normalized static expression levels. In this study, we focus on the dynamics of gene expression by seeking to identify correlated patterns of changes in genetic expression level. In doing so, we build upon methods developed in clinical informatics to detect temporal trends of laboratory and other clinical data. We construct relevance networks from Saccharomyces cerevisiae gene-expression dynamics data and find genes with related functional annotations grouped together. While some of these associations are also found using a standard expression level analysis, many are identified exclusively through the dynamic analysis. These results strongly suggest that the analysis of gene expression dynamics is a necessary and important tool for studying regulatory and other functional relationships among genes. The source code developed for this investigation is freely available to all non-commercial investigators by contacting the authors.

    View details for Web of Science ID 000169901600003

    View details for PubMedID 11376539

  • Determining significant fold differences in gene expression analysis. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Butte, A. J., Ye, J., Häring, H. U., Stumvoll, M., White, M. F., Kohane, I. S. 2001: 6-17


    A typical use for RNA expression microarrays is comparing the measurement of gene expression of two groups. There has not been a study reproducing an entire experiment and modeling the distribution of reproducibility of fold differences. Our goal was to create a model of significance for fold differences, then maximize the number of ESTs above that threshold. Multiple strategies were tested to filter out those ESTs contributing to noise, thus decreasing the requirements of what was needed for significance. We found that even though RNA expression levels appears consistent in duplicate measurements, when entire experiments are duplicated, the calculated fold differences are not as consistent. Thus, it is critically important to repeat as many data points as possible, to ensure that genes and ESTs labeled as significant are truly so. We were successfully able to use duplicated expression measurements to model the duplicated fold differences, and to calculate the levels of fold difference needed to reach significance. This approach can be applied to many other experiments to ascertain significance without a priori assumptions.

    View details for PubMedID 11262977

  • Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Butte, A. J., Tamayo, P., Slonim, D., Golub, T. R., Kohane, I. S. 2000; 97 (22): 12182-12186


    In an effort to find gene regulatory networks and clusters of genes that affect cancer susceptibility to anticancer agents, we joined a database with baseline expression levels of 7,245 genes measured by using microarrays in 60 cancer cell lines, to a database with the amounts of 5,084 anticancer agents needed to inhibit growth of those same cell lines. Comprehensive pair-wise correlations were calculated between gene expression and measures of agent susceptibility. Associations weaker than a threshold strength were removed, leaving networks of highly correlated genes and agents called relevance networks. Hypotheses for potential single-gene determinants of anticancer agent susceptibility were constructed. The effect of random chance in the large number of calculations performed was empirically determined by repeated random permutation testing; only associations stronger than those seen in multiply permuted data were used in clustering. We discuss the advantages of this methodology over alternative approaches, such as phylogenetic-type tree clustering and self-organizing maps.

    View details for Web of Science ID 000090071000086

    View details for PubMedID 11027309

  • Brief report: Severe hypothyroidism caused by type 3 iodothyronine deiodinase in infantile hemangiomas. NEW ENGLAND JOURNAL OF MEDICINE Huang, S. A., Tu, H. M., Harney, J. W., Venihaki, M., Butte, A. J., Kozakewich, H. P., Fishman, S. J., Larsen, P. R. 2000; 343 (3): 185-189
  • Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Butte, A. J., Kohane, I. S. 2000: 418-429


    Increasing numbers of methodologies are available to find functional genomic clusters in RNA expression data. We describe a technique that computes comprehensive pair-wise mutual information for all genes in such a data set. An association with a high mutual information means that one gene is non-randomly associated with another; we hypothesize this means the two are related biologically. By picking a threshold mutual information and using only associations at or above the threshold, we show how this technique was used on a public data set of 79 RNA expression measurements of 2,467 genes to construct 22 clusters, or Relevance Networks. The biological significance of each Relevance Network is explained.

    View details for PubMedID 10902190



    Insulin signaling is initiated at least in part by activation of the insulin receptor tyrosine kinase and subsequent phosphorylation of cellular substrates such as insulin receptor substrate 1 (IRS-1). Previous studies have focused on the role of IRS-1 in the mitogenic actions of insulin. We have now investigated the possible role of IRS-1 in mediating the effect of insulin to stimulate glucose transport in a physiologically relevant insulin target tissue. In this study, we transfected rat adipose cells in primary culture with an antisense ribozyme directed against rat IRS-1. Expression of the ribozyme in these cells caused a 4.4-fold increase in the concentration of insulin required to achieve half-maximal stimulation of the translocation of cotransfected epitope-tagged GLUT4 without changing the maximal insulin response. Overexpression of human IRS-1 increased the basal cell surface GLUT4 to nearly the maximal level in the absence of insulin. When the ribozyme (specific to rat IRS-1) was cotransfected along with human IRS-1, the insulin dose-response curve was shifted to the left when compared with cells transfected with the ribozyme alone. These data provide strong support for the hypothesis that IRS-1 plays a role in insulin-stimulated glucose transport in insulin-responsive cells.

    View details for Web of Science ID A1994PV77200026

    View details for PubMedID 7525563



    Insulin initiates its pleiotropic effects by activating the insulin receptor tyrosine kinase to phosphorylate several intracellular proteins. Recent studies have demonstrated that phosphotyrosine residues bind specifically to proteins that contain src homology 2 (SH2) domains, and that this interaction mediates the regulation of multiple intracellular signaling pathways. This article reviews recent progress in elucidating the detailed pathways that lead from the insulin receptor to the ultimate biologic actions of insulin.

    View details for Web of Science ID A1994PX53400003

    View details for PubMedID 18407232



    Insulin regulates essential pathways for growth, differentiation, and metabolism in vivo. We report a physiologically relevant system for dissecting the molecular mechanisms of insulin signal transduction related to glucose transport. This is an extension of our recently reported method for transfection of DNA into rat adipose cells in primary culture. In the present work, cDNA coding for GLUT4 with an epitope tag (HA1) in the first exofacial loop is used as a reporter gene so that GLUT4 translocation can be studied exclusively in transfected cells. Insulin stimulates a 4.3-fold recruitment of transfected epitope-tagged GLUT4 to the cell surface. Cells cotransfected with the reporter gene and the human insulin receptor gene show an increase in cell surface GLUT4 in the basal state (no insulin) to levels comparable to those seen with maximal insulin stimulation of cells transfected with the reporter gene alone. In contrast, cells overexpressing a naturally occurring tyrosine kinase-deficient mutant insulin receptor (Met1153-->Ile) show no increase in the basal cell surface GLUT4 and no shift in the insulin dose-response curve relative to cells transfected with the reporter gene alone. These results demonstrate that insulin receptor tyrosine kinase activity is essential in insulin-stimulated glucose transport in adipose cells.

    View details for Web of Science ID A1994NR27500077

    View details for PubMedID 8202531



    A new method of quantifying the similarity between genetic sequences is presented. The method makes use of the finding that sequence comparisons expressed in binary vector form have an associated scale-independent parameter, D. This parameter is represented in the function M(S,n) = (N) (en/enD), where S is the vector, n represents the window size which is allowed to vary, N is a constant, and D is the scale-independent measure of homology. By comparing two sequences using this method, a unimodal, symmetric distribution of D values associated with the frameshifted vectors is obtained. The degree of sequence similarity is determined by the distribution of these parameters. A set of sequences of evolutionary interest coding for glyceraldehyde-3-phosphate dehydrogenases and mammalian insulins is compared using this methodology. The results confirm evolutionary tree distances calculated using different procedures. Since a z score can be calculated for each comparison, the method allows for the rapid identification of sequence homologies ranked according to the probability of occurrence. This unique scale-independent measure of similarity allows contrasts and comparisons between any two sequence fragments using all available order information.

    View details for Web of Science ID A1993MM97500003

    View details for PubMedID 8112054


    View details for Web of Science ID A1979GP99900023

    View details for PubMedID 222055

Conference Proceedings

  • Altering physiological networks using drugs: steps towards personalized physiology Grossman, A. D., Cohen, M. J., Manley, G. T., Butte, A. J. BIOMED CENTRAL LTD. 2013
  • Transplantomics and biomarkers in organ transplantation: a report from the first international conference. Sarwal, M. M., Benjamin, J., Butte, A. J., Davis, M. M., Wood, K., Chapman, J. 2011: 379-382

    View details for DOI 10.1097/TP.0b013e3182105fb8

    View details for PubMedID 21278631

  • An integrative method for scoring candidate genes from association studies: application to warfarin dosing Tatonetti, N. P., Dudley, J. T., Sagreiya, H., Butte, A. J., Altman, R. B. BIOMED CENTRAL LTD. 2010


    A key challenge in pharmacogenomics is the identification of genes whose variants contribute to drug response phenotypes, which can include severe adverse effects. Pharmacogenomics GWAS attempt to elucidate genotypes predictive of drug response. However, the size of these studies has severely limited their power and potential application. We propose a novel knowledge integration and SNP aggregation approach for identifying genes impacting drug response. Our SNP aggregation method characterizes the degree to which uncommon alleles of a gene are associated with drug response. We first use pre-existing knowledge sources to rank pharmacogenes by their likelihood to affect drug response. We then define a summary score for each gene based on allele frequencies and train linear and logistic regression classifiers to predict drug response phenotypes.We applied our method to a published warfarin GWAS data set comprising 181 individuals. We find that our method can increase the power of the GWAS to identify both VKORC1 and CYP2C9 as warfarin pharmacogenes, where the original analysis had only identified VKORC1. Additionally, we find that our method can be used to discriminate between low-dose (AUROC=0.886) and high-dose (AUROC=0.764) responders.Our method offers a new route for candidate pharmacogene discovery from pharmacogenomics GWAS, and serves as a foundation for future work in methods for predictive pharmacogenomics.

    View details for DOI 10.1186/1471-2105-11-S9-S9

    View details for Web of Science ID 000290218700009

    View details for PubMedID 21044367

  • Latent physiological factors of complex human diseases revealed by independent component analysis of clinarrays Chen, D. P., Dudley, J. T., Butte, A. J. BIOMED CENTRAL LTD. 2010


    Diagnosis and treatment of patients in the clinical setting is often driven by known symptomatic factors that distinguish one particular condition from another. Treatment based on noticeable symptoms, however, is limited to the types of clinical biomarkers collected, and is prone to overlooking dysfunctions in physiological factors not easily evident to medical practitioners. We used a vector-based representation of patient clinical biomarkers, or clinarrays, to search for latent physiological factors that underlie human diseases directly from clinical laboratory data. Knowledge of these factors could be used to improve assessment of disease severity and help to refine strategies for diagnosis and monitoring disease progression.Applying Independent Component Analysis on clinarrays built from patient laboratory measurements revealed both known and novel concomitant physiological factors for asthma, types 1 and 2 diabetes, cystic fibrosis, and Duchenne muscular dystrophy. Serum sodium was found to be the most significant factor for both type 1 and type 2 diabetes, and was also significant in asthma. TSH3, a measure of thyroid function, and blood urea nitrogen, indicative of kidney function, were factors unique to type 1 diabetes respective to type 2 diabetes. Platelet count was significant across all the diseases analyzed.The results demonstrate that large-scale analyses of clinical biomarkers using unsupervised methods can offer novel insights into the pathophysiological basis of human disease, and suggest novel clinical utility of established laboratory measurements.

    View details for DOI 10.1186/1471-2105-11-S9-S4

    View details for Web of Science ID 000290218700004

    View details for PubMedID 21044362

  • Comparison of multiplex meta analysis techniques for understanding the acute rejection of solid organ transplants Morgan, A. A., Khatri, P., Jones, R. H., Sarwal, M. M., Butte, A. J. BIOMED CENTRAL LTD. 2010


    Combining the results of studies using highly parallelized measurements of gene expression such as microarrays and RNAseq offer unique challenges in meta analysis. Motivated by a need for a deeper understanding of organ transplant rejection, we combine the data from five separate studies to compare acute rejection versus stability after solid organ transplantation, and use this data to examine approaches to multiplex meta analysis.We demonstrate that a commonly used parametric effect size estimate approach and a commonly used non-parametric method give very different results in prioritizing genes. The parametric method providing a meta effect estimate was superior at ranking genes based on our gold-standard of identifying immune response genes in the transplant rejection datasets.Different methods of multiplex analysis can give substantially different results. The method which is best for any given application will likely depend on the particular domain, and it remains for future work to see if any one method is consistently better at identifying important biological signal across gene expression experiments.

    View details for DOI 10.1186/1471-2105-11-S9-S6

    View details for Web of Science ID 000290218700006

    View details for PubMedID 21044364

  • Allogeneic antibodies identify GVL targets CHAF1b and NuSAP1 in AML patients Wadia, P. P., Coram, M. A., Butte, A. J., Miklos, L. B. AMER SOC HEMATOLOGY. 2007: 57A-57A
  • Challenges in bioinformatics: infrastructure, models and analytics. Butte, A. J. 2001: 159-160

    View details for PubMedID 11403022

  • Enrolling patients into clinical trials faster using RealTime Recuiting (TM) Butte, A. J., Weinstein, D. A., Kohane, I. S. HANLEY & BELFUS INC. 2000: 111-115


    Previous work has been done on both optimizing the clinical trials process, and on sending critical laboratory results and decision support through paging systems. We report the first integration of both these solution, focusing on improving the clinical trial recruitment process. We describe a clinical trial needing a real-time method of recruiting patients in an unbiased manner, quickly enough that study tests can be obtained before patients leave or samples discarded. The report describes how the ten currently recruited patients were found and how diagnoses of potentially life-threatening disorders are being made.

    View details for Web of Science ID 000170207500024

    View details for PubMedID 11079855

  • Unsupervised knowledge discovery in medical databases using relevance networks Butte, A. J., Kohane, I. S. BMJ PUBLISHING GROUP. 1999: 711-715


    Increasing amounts of data exist in medical databases. When multiple variables are measured for each case in a data set, there exists an underlying relationship between all pairs of variables, some highly correlated and some not. This report describes a technique that creates networks of related variables, or relevance networks, by dropping links with either too weak correlation or too few data points to defend the relationship. The paper describes how applying this methodology to the domain of laboratory results allows the generation of meaningful relations between types of laboratory tests. These relations could be used as the basis of further exploratory research.

    View details for Web of Science ID 000170207300146

    View details for PubMedID 10566452

Stanford Medicine Resources: