Disease Variant Landscape of a Large Multiethnic Population of Moyamoya Patients by Exome Sequencing
G3-GENES GENOMES GENETICS
2016; 6 (1): 41-49
Mouse models rarely mimic the transcriptome of human neurodegenerative diseases: A systematic bioinformatics-based critique of preclinical models
EUROPEAN JOURNAL OF PHARMACOLOGY
2015; 759: 101-117
Aging-Like Changes in the Transcriptome of Irradiated Microglia
2015; 63 (5): 754-767
Translational research for neurodegenerative disease depends intimately upon animal models. Unfortunately, promising therapies developed using mouse models mostly fail in clinical trials, highlighting uncertainty about how well mouse models mimic human neurodegenerative disease at the molecular level. We compared the transcriptional signature of neurodegeneration in mouse models of Alzheimer׳s disease (AD), Parkinson׳s disease (PD), Huntington׳s disease (HD) and amyotrophic lateral sclerosis (ALS) to human disease. In contrast to aging, which demonstrated a conserved transcriptome between humans and mice, only 3 of 19 animal models showed significant enrichment for gene sets comprising the most dysregulated up- and down-regulated human genes. Spearman׳s correlation analysis revealed even healthy human aging to be more closely related to human neurodegeneration than any mouse model of AD, PD, ALS or HD. Remarkably, mouse models frequently upregulated stress response genes that were consistently downregulated in human diseases. Among potential alternate models of neurodegeneration, mouse prion disease outperformed all other disease-specific models. Even among the best available animal models, conserved differences between mouse and human transcriptomes were found across multiple animal model versus human disease comparisons, surprisingly, even including aging. Relative to mouse models, mouse disease signatures demonstrated consistent trends toward preserved mitochondrial function protein catabolism, DNA repair responses, and chromatin maintenance. These findings suggest a more complex and multifactorial pathophysiology in human neurodegeneration than is captured through standard animal models, and suggest that even among conserved physiological processes such as aging, mice are less prone to exhibit neurodegeneration-like changes. This work may help explain the poor track record of mouse-based translational therapies for neurodegeneration and provides a path forward to critically evaluate and improve animal models of human disease.
View details for DOI 10.1016/j.ejphar.2015.03.021
View details for Web of Science ID 000355362500012
View details for PubMedID 25814260
Prediction of Multiple Infections After Severe Burn Trauma A Prospective Cohort Study
ANNALS OF SURGERY
2015; 261 (4): 781-792
Whole brain irradiation remains important in the management of brain tumors. Although necessary for improving survival outcomes, cranial irradiation also results in cognitive decline in long-term survivors. A chronic inflammatory state characterized by microglial activation has been implicated in radiation-induced brain injury. We here provide the first comprehensive transcriptional profile of irradiated microglia. Fluorescence-activated cell sorting was used to isolate CD11b+ microglia from the hippocampi of C57BL/6 and Balb/c mice 1 month after 10 Gy cranial irradiation. Affymetrix gene expression profiles were evaluated using linear modeling and rank product analyses. One month after irradiation, a conserved irradiation signature across strains was identified, comprising 448 and 85 differentially up- and downregulated genes, respectively. Gene set enrichment analysis demonstrated enrichment for inflammation, including M1 macrophage-associated genes, but also an unexpected enrichment for extracellular matrix and blood coagulation-related gene sets, in contrast previously described microglial states. Weighted gene coexpression network analysis confirmed these findings and further revealed alterations in mitochondrial function. The RNA-seq transcriptome of microglia 24-h postradiation proved similar to the 1-month transcriptome, but additionally featured alterations in apoptotic and lysosomal gene expression. Reanalysis of published aging mouse microglia transcriptome data demonstrated striking similarity to the 1-month irradiated microglia transcriptome, suggesting that shared mechanisms may underlie aging and chronic irradiation-induced cognitive decline. GLIA 2015;63:754-767.
View details for DOI 10.1002/glia.22782
View details for Web of Science ID 000351622600003
View details for PubMedID 25690519
Disease Variant Landscape of a Large Multiethnic Population of Moyamoya Patients by Exome Sequencing.
G3 (Bethesda, Md.)
2015; 6 (1): 41-49
To develop predictive models for early triage of burn patients based on hypersusceptibility to repeated infections.Infection remains a major cause of mortality and morbidity after severe trauma, demanding new strategies to combat infections. Models for infection prediction are lacking.Secondary analysis of 459 burn patients (≥16 years old) with 20% or more total body surface area burns recruited from 6 US burn centers. We compared blood transcriptomes with a 180-hour cutoff on the injury-to-transcriptome interval of 47 patients (≤1 infection episode) to those of 66 hypersusceptible patients [multiple (≥2) infection episodes (MIE)]. We used LASSO regression to select biomarkers and multivariate logistic regression to built models, accuracy of which were assessed by area under receiver operating characteristic curve (AUROC) and cross-validation.Three predictive models were developed using covariates of (1) clinical characteristics; (2) expression profiles of 14 genomic probes; (3) combining (1) and (2). The genomic and clinical models were highly predictive of MIE status [AUROCGenomic = 0.946 (95% CI: 0.906-0.986); AUROCClinical = 0.864 (CI: 0.794-0.933); AUROCGenomic/AUROCClinical P = 0.044]. Combined model has an increased AUROCCombined of 0.967 (CI: 0.940-0.993) compared with the individual models (AUROCCombined/AUROCClinical P = 0.0069). Hypersusceptible patients show early alterations in immune-related signaling pathways, epigenetic modulation, and chromatin remodeling.Early triage of burn patients more susceptible to infections can be made using clinical characteristics and/or genomic signatures. Genomic signature suggests new insights into the pathophysiology of hypersusceptibility to infection may lead to novel potential therapeutic or prophylactic targets.
View details for DOI 10.1097/SLA.0000000000000759
View details for Web of Science ID 000351679500049
View details for PubMedID 24950278
Multiplex meta-analysis of medulloblastoma expression studies with external controls.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Moyamoya disease (MMD) is a rare disorder characterized by cerebrovascular occlusion and development of hemorrhage-prone collateral vessels. Approximately 10-12% of cases are familial, with a presumed low penetrance autosomal dominant pattern of inheritance. Diagnosis commonly occurs only after clinical presentation. The recent identification of the RNF213 founder mutation (p.R4810K) in the Asian population has made a significant contribution, but the etiology of this disease remains unclear. To further develop the variant landscape of MMD, we performed high-depth whole exome sequencing of 125 unrelated, predominantly nonfamilial, ethnically diverse MMD patients in parallel with 125 internally sequenced, matched controls using the same exome and analysis platform. Three subpopulations were established: Asian, Caucasian, and non-RNF213 founder mutation cases. We provided additional support for the previously observed RNF213 founder mutation (p.R4810K) in Asian cases (P = 6.01×10(-5)) that was enriched among East Asians compared to Southeast Asian and Pacific Islander cases (P = 9.52×10(-4)) and was absent in all Caucasian cases. The most enriched variant in Caucasian (P = 7.93×10(-4)) and non-RNF213 founder mutation (P = 1.51×10(-3)) cases was ZXDC (p.P562L), a gene involved in MHC Class II activation. Collapsing variant methodology ranked OBSCN, a gene involved in myofibrillogenesis, as most enriched in Caucasian (P = 1.07×10(-4)) and non-RNF213 founder mutation cases (P = 5.31×10(-5)). These findings further support the East Asian origins of the RNF213 (p.R4810K) variant and more fully describe the genetic landscape of multiethnic MMD, revealing novel, alternative candidate variants and genes that may be important in MMD etiology and diagnosis.
View details for DOI 10.1534/g3.115.020321
View details for PubMedID 26530418
Variant priorization and analysis incorporating problematic regions of the genome.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
We propose and discuss a method for doing gene expression meta-analysis (multiple datasets) across multiplex measurement modalities measuring the expression of many genes simultaneously (e.g. microarrays and RNAseq) using external control samples and a method of heterogeneity detection to identify and filter on comparable gene expression measurements. We demonstrate this approach on publicly available gene expression datasets from samples of medulloblastoma and normal cerebellar tissue and identify some potential new targets in the treatment of medulloblastoma.
View details for PubMedID 24297537
Integrated multi-cohort transcriptional meta-analysis of neurodegenerative diseases.
Acta neuropathologica communications
2014; 2: 93-?
In case-control studies of rare Mendelian disorders and complex diseases, the power to detect variant and gene-level associations of a given effect size is limited by the size of the study sample. Paradoxically, low statistical power may increase the likelihood that a statistically significant finding is also a false positive. The prioritization of variants based on call quality, putative effects on protein function, the predicted degree of deleteriousness, and allele frequency is often used as a mechanism for reducing the occurrence of false positives, while preserving the set of variants most likely to contain true disease associations. We propose that specificity can be further improved by considering errors that are specific to the regions of the genome being sequenced. These problematic regions (PRs) are identified a-priori and are used to down-weight constitutive variants in a case-control analysis. Using samples drawn from 1000-Genomes, we illustrate the utility of PRs in identifying true variant and gene associations using a case-control study on a known Mendelian disease, cystic fibrosis (CF).
View details for PubMedID 24297554
Integrating multiple 'omics' analyses identifies serological protein biomarkers for preeclampsia
A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics for organ transplantation
JOURNAL OF EXPERIMENTAL MEDICINE
2013; 210 (11): 2205-2221
IntroductionNeurodegenerative diseases share common pathologic features including neuroinflammation, mitochondrial dysfunction and protein aggregation, suggesting common underlying mechanisms of neurodegeneration. We undertook a meta-analysis of public gene expression data for neurodegenerative diseases to identify a common transcriptional signature of neurodegeneration.ResultsUsing 1,270 post-mortem central nervous system tissue samples from 13 patient cohorts covering four neurodegenerative diseases, we identified 243 differentially expressed genes, which were similarly dysregulated in 15 additional patient cohorts of 205 samples including seven neurodegenerative diseases. This gene signature correlated with histologic disease severity. Metallothioneins featured prominently among differentially expressed genes, and functional pathway analysis identified specific convergent themes of dysregulation. MetaCore network analyses revealed various novel candidate hub genes (e.g. STAU2). Genes associated with M1-polarized macrophages and reactive astrocytes were strongly enriched in the meta-analysis data. Evaluation of genes enriched in neurons revealed 70 down-regulated genes, over half not previously associated with neurodegeneration. Comparison with aging brain data (3 patient cohorts, 221 samples) revealed 53 of these to be unique to neurodegenerative disease, many of which are strong candidates to be important in neuropathogenesis (e.g. NDN, NAP1L2). ENCODE ChIP-seq analysis predicted common upstream transcriptional regulators not associated with normal aging (REST, RBBP5, SIN3A, SP2, YY1, ZNF143, IKZF1). Finally, we removed genes common to neurodegeneration from disease-specific gene signatures, revealing uniquely robust immune response and JAK-STAT signaling in amyotrophic lateral sclerosis.ConclusionsOur results implicate pervasive bioenergetic deficits, M1-type microglial activation and gliosis as unifying themes of neurodegeneration, and identify numerous novel genes associated with neurodegenerative processes.
View details for DOI 10.1186/s40478-014-0093-y
View details for PubMedID 25187168
Differentiating the roles of STAT5B and STAT5A in human CD4(+) T cells
2013; 148 (2): 227-236
Using meta-analysis of eight independent transplant datasets (236 graft biopsy samples) from four organs, we identified a common rejection module (CRM) consisting of 11 genes that were significantly overexpressed in acute rejection (AR) across all transplanted organs. The CRM genes could diagnose AR with high specificity and sensitivity in three additional independent cohorts (794 samples). In another two independent cohorts (151 renal transplant biopsies), the CRM genes correlated with the extent of graft injury and predicted future injury to a graft using protocol biopsies. Inferred drug mechanisms from the literature suggested that two FDA-approved drugs (atorvastatin and dasatinib), approved for nontransplant indications, could regulate specific CRM genes and reduce the number of graft-infiltrating cells during AR. We treated mice with HLA-mismatched mouse cardiac transplant with atorvastatin and dasatinib and showed reduction of the CRM genes, significant reduction of graft-infiltrating cells, and extended graft survival. We further validated the beneficial effect of atorvastatin on graft survival by retrospective analysis of electronic medical records of a single-center cohort of 2,515 renal transplant patients followed for up to 22 yr. In conclusion, we identified a CRM in transplantation that provides new opportunities for diagnosis, drug repositioning, and rational drug design.
View details for DOI 10.1084/jem.20122709
View details for Web of Science ID 000325997600007
View details for PubMedID 24127489
Whole genome sequencing in support of wellness and health maintenance
Analysis of the Genetic Basis of Disease in the Context of Worldwide Human Relationships and Migration
2013; 9 (5)
STAT5A and STAT5B are highly homologous proteins whose distinctive roles in human immunity remain unclear. However, STAT5A sufficiency cannot compensate for STAT5B defects, and human STAT5B deficiency, a rare autosomal recessive primary immunodeficiency, is characterized by chronic lung disease, growth failure and autoimmunity associated with regulatory T cell (Treg) reduction. We therefore hypothesized that STAT5A and STAT5B play unique roles in CD4(+) T cells. Upon knocking down STAT5A or STAT5B in human primary T cells, we found differentially regulated expression of FOXP3 and IL-2R in STAT5B knockdown T cells and down-regulated Bcl-X only in STAT5A knockdown T cells. Functional ex vivo studies in homozygous STAT5B-deficient patients showed reduced FOXP3 expression with impaired regulatory function of STAT5B-null Treg cells, also of increased memory phenotype. These results indicate that STAT5B and STAT5A act partly as non-redundant transcription factors and that STAT5B is more critical for Treg maintenance and function in humans.
View details for DOI 10.1016/j.clim.2013.04.014
View details for Web of Science ID 000322101300009
Proline: The Distribution, Frequency, Positioning, and Common Functional Roles of Proline and Polyproline Sequences in the Human Proteome
2013; 8 (1)
Integrating multiple 'omics' analyses identifies serological protein biomarkers for preeclampsia.
2013; 11: 236-?
Genetic diversity across different human populations can enhance understanding of the genetic basis of disease. We calculated the genetic risk of 102 diseases in 1,043 unrelated individuals across 51 populations of the Human Genome Diversity Panel. We found that genetic risk for type 2 diabetes and pancreatic cancer decreased as humans migrated toward East Asia. In addition, biliary liver cirrhosis, alopecia areata, bladder cancer, inflammatory bowel disease, membranous nephropathy, systemic lupus erythematosus, systemic sclerosis, ulcerative colitis, and vitiligo have undergone genetic risk differentiation. This analysis represents a large-scale attempt to characterize genetic risk differentiation in the context of migration. We anticipate that our findings will enable detailed analysis pertaining to the driving forces behind genetic risk differentiation.
View details for DOI 10.1371/journal.pgen.1003447
View details for Web of Science ID 000320030000003
View details for PubMedID 23717210
Proline: the distribution, frequency, positioning, and common functional roles of proline and polyproline sequences in the human proteome.
2013; 8 (1)
Preeclampsia (PE) is a pregnancy-related vascular disorder which is the leading cause of maternal morbidity and mortality. We sought to identify novel serological protein markers to diagnose PE with a multi-'omics' based discovery approach.Seven previous placental expression studies were combined for a multiplex analysis, and in parallel, two-dimensional gel electrophoresis was performed to compare serum proteomes in PE and control subjects. The combined biomarker candidates were validated with available ELISA assays using gestational age-matched PE (n=32) and control (n=32) samples. With the validated biomarkers, a genetic algorithm was then used to construct and optimize biomarker panels in PE assessment.In addition to the previously identified biomarkers, the angiogenic and antiangiogenic factors (soluble fms-like tyrosine kinase (sFlt-1) and placental growth factor (PIGF)), we found 3 up-regulated and 6 down-regulated biomakers in PE sera. Two optimal biomarker panels were developed for early and late onset PE assessment, respectively.Both early and late onset PE diagnostic panels, constructed with our PE biomarkers, were superior over sFlt-1/PIGF ratio in PE discrimination. The functional significance of these PE biomarkers and their associated pathways were analyzed which may provide new insights into the pathogenesis of PE.
View details for DOI 10.1186/1741-7015-11-236
View details for PubMedID 24195779
FoxO6 regulates memory consolidation and synaptic function
GENES & DEVELOPMENT
2012; 26 (24): 2780-2801
Proline is an anomalous amino acid. Its nitrogen atom is covalently locked within a ring, thus it is the only proteinogenic amino acid with a constrained phi angle. Sequences of three consecutive prolines can fold into polyproline helices, structures that join alpha helices and beta pleats as architectural motifs in protein configuration. Triproline helices are participants in protein-protein signaling interactions. Longer spans of repeat prolines also occur, containing as many as 27 consecutive proline residues. Little is known about the frequency, positioning, and functional significance of these proline sequences. Therefore we have undertaken a systematic bioinformatics study of proline residues in proteins. We analyzed the distribution and frequency of 687,434 proline residues among 18,666 human proteins, identifying single residues, dimers, trimers, and longer repeats. Proline accounts for 6.3% of the 10,882,808 protein amino acids. Of all proline residues, 4.4% are in trimers or longer spans. We detected patterns that influence function based on proline location, spacing, and concentration. We propose a classification based on proline-rich, polyproline-rich, and proline-poor status. Whereas singlet proline residues are often found in proteins that display recurring architectural patterns, trimers or longer proline sequences tend be associated with the absence of repetitive structural motifs. Spans of 6 or more are associated with DNA/RNA processing, actin, and developmental processes. We also suggest a role for proline in Kruppel-type zinc finger protein control of DNA expression, and in the nucleation and translocation of actin by the formin complex.
View details for DOI 10.1371/journal.pone.0053785
View details for PubMedID 23372670
Clinical utility of sequence-based genotype compared with that derivable from genotyping arrays
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
2012; 19 (E1): E21-E27
The FoxO family of transcription factors is known to slow aging downstream from the insulin/IGF (insulin-like growth factor) signaling pathway. The most recently discovered FoxO isoform in mammals, FoxO6, is highly enriched in the adult hippocampus. However, the importance of FoxO factors in cognition is largely unknown. Here we generated mice lacking FoxO6 and found that these mice display normal learning but impaired memory consolidation in contextual fear conditioning and novel object recognition. Using stereotactic injection of viruses into the hippocampus of adult wild-type mice, we found that FoxO6 activity in the adult hippocampus is required for memory consolidation. Genome-wide approaches revealed that FoxO6 regulates a program of genes involved in synaptic function upon learning in the hippocampus. Consistently, FoxO6 deficiency results in decreased dendritic spine density in hippocampal neurons in vitro and in vivo. Thus, FoxO6 may promote memory consolidation by regulating a program coordinating neuronal connectivity in the hippocampus, which could have important implications for physiological and pathological age-dependent decline in memory.
View details for DOI 10.1101/gad.208926.112
View details for Web of Science ID 000312775700011
View details for PubMedID 23222102
Expression-based genome-wide association study links the receptor CD44 in adipose tissue with type 2 diabetes
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2012; 109 (18): 7049-7054
We investigated the common-disease relevant information obtained from sequencing compared with that reported from genotyping arrays.Using 187 publicly available individual human genomes, we constructed genomic disease risk summaries based on 55 common diseases with reported gene-disease associations in the research literature using two different risk models, one based on the product of likelihood ratios and the other on the allelic variant with the maximum associated disease risk. We also constructed risk profiles based on the single nucleotide polymorphisms (SNPs) of these individuals that could be measured or imputed from two common genotyping array platforms.We show that the model risk predictions derived from sequencing differ substantially from those obtained from the SNPs measured on commercially available genotyping arrays for several different non-monogenic diseases, although high density genotyping arrays give identical results for many diseases. Conclusions: Our approach may be used to compare the ability of different platforms to probe known genetic risks disease by disease.
View details for DOI 10.1136/amiajnl-2011-000737
View details for Web of Science ID 000314151400005
View details for PubMedID 22718036
Type 2 Diabetes Risk Alleles Demonstrate Extreme Directional Differentiation among Human Populations, Compared to Other Diseases
2012; 8 (4): 100-115
Type 2 diabetes (T2D) is a complex, polygenic disease affecting nearly 300 million people worldwide. T2D is primarily characterized by insulin resistance, and growing evidence has indicated the causative link between adipose tissue inflammation and the development of insulin resistance. Genetic association studies have successfully revealed a number of important genes consistently associated with T2D to date. However, these robust T2D-associated genes do not fully elucidate the mechanisms underlying the development and progression of the disease. Here, we report an alternative approach, gene expression-based genome-wide association study (eGWAS): searching for genes repeatedly implicated in functional microarray experiments (often publicly available). We performed an eGWAS across 130 independent experiments (totally 1,175 T2D case-control microarrays) to find additional genes implicated in the molecular pathogenesis of T2D and identified the immune-cell receptor CD44 as our top candidate (P = 8.5 × 10(-20)). We found CD44 deficiency in a diabetic mouse model ameliorates insulin resistance and adipose tissue inflammation and also found that anti-CD44 antibody treatment decreases blood glucose levels and adipose tissue macrophage accumulation in a high-fat, diet-fed mouse model. Further, in humans, we observed CD44 is expressed in inflammatory cells in obese adipose tissue and discovered serum CD44 levels were positively correlated with insulin resistance and glycemic control. CD44 likely plays a causative role in the development of adipose tissue inflammation and insulin resistance in rodents and humans. Genes repeatedly implicated in publicly available experimental data may have unique functionally important roles in T2D and other complex diseases.
View details for DOI 10.1073/pnas.1114513109
View details for Web of Science ID 000303602100060
View details for PubMedID 22499789
Multiplex meta-analysis of RNA expression to identify genes with variants associated with immune dysfunction
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
2012; 19 (2): 284-288
Many disease-susceptible SNPs exhibit significant disparity in ancestral and derived allele frequencies across worldwide populations. While previous studies have examined population differentiation of alleles at specific SNPs, global ethnic patterns of ensembles of disease risk alleles across human diseases are unexamined. To examine these patterns, we manually curated ethnic disease association data from 5,065 papers on human genetic studies representing 1,495 diseases, recording the precise risk alleles and their measured population frequencies and estimated effect sizes. We systematically compared the population frequencies of cross-ethnic risk alleles for each disease across 1,397 individuals from 11 HapMap populations, 1,064 individuals from 53 HGDP populations, and 49 individuals with whole-genome sequences from 10 populations. Type 2 diabetes (T2D) demonstrated extreme directional differentiation of risk allele frequencies across human populations, compared with null distributions of European-frequency matched control genomic alleles and risk alleles for other diseases. Most T2D risk alleles share a consistent pattern of decreasing frequencies along human migration into East Asia. Furthermore, we show that these patterns contribute to disparities in predicted genetic risk across 1,397 HapMap individuals, T2D genetic risk being consistently higher for individuals in the African populations and lower in the Asian populations, irrespective of the ethnicity considered in the initial discovery of risk alleles. We observed a similar pattern in the distribution of T2D Genetic Risk Scores, which are associated with an increased risk of developing diabetes in the Diabetes Prevention Program cohort, for the same individuals. This disparity may be attributable to the promotion of energy storage and usage appropriate to environments and inconsistent energy intake. Our results indicate that the differential frequencies of T2D risk alleles may contribute to the observed disparity in T2D incidence rates across ethnic populations.
View details for DOI 10.1371/journal.pgen.1002621
View details for Web of Science ID 000303441800007
View details for PubMedID 22511877
Coanalysis of GWAS with eQTLs reveals disease-tissue associations.
AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science
2012; 2012: 35-41
We demonstrate a genome-wide method for the integration of many studies of gene expression of phenotypically similar disease processes, a method of multiplex meta-analysis. We use immune dysfunction as an example disease process.We use a heterogeneous collection of datasets across human and mice samples from a range of tissues and different forms of immunodeficiency. We developed a method integrating Tibshirani's modified t-test (SAM) is used to interrogate differential expression within a study and Fisher's method for omnibus meta-analysis to identify differentially expressed genes across studies. The ability of this overall gene expression profile to prioritize disease associated genes is evaluated by comparing against the results of a recent genome wide association study for common variable immunodeficiency (CVID).Our approach is able to prioritize genes associated with immunodeficiency in general (area under the ROC curve = 0.713) and CVID in particular (area under the ROC curve = 0.643).This approach may be used to investigate a larger range of failures of the immune system. Our method may be extended to other disease processes, using RNA levels to prioritize genes likely to contain disease associated DNA variants.
View details for DOI 10.1136/amiajnl-2011-000657
View details for Web of Science ID 000300768100023
View details for PubMedID 22319178
ProfileChaser: searching microarray repositories based on genome-wide patterns of differential expression
2011; 27 (23): 3317-3318
Expression quantitative trait loci (eQTL), or genetic variants associated with changes in gene expression, have the potential to assist in interpreting results of genome-wide association studies (GWAS). eQTLs also have varying degrees of tissue specificity. By correlating the statistical significance of eQTLs mapped in various tissue types to their odds ratios reported in a large GWAS by the Wellcome Trust Case Control Consortium (WTCCC), we discovered that there is a significant association between diseases studied genetically and their relevant tissues. This suggests that eQTL data sets can be used to determine tissues that play a role in the pathogenesis of a disease, thereby highlighting these tissue types for further post-GWAS functional studies.
View details for PubMedID 22779046
Computational Repositioning of the Anticonvulsant Topiramate for Inflammatory Bowel Disease
SCIENCE TRANSLATIONAL MEDICINE
2011; 3 (96)
We introduce ProfileChaser, a web server that allows for querying the Gene Expression Omnibus based on genome-wide patterns of differential expression. Using a novel, content-based approach, ProfileChaser retrieves expression profiles that match the differentially regulated transcriptional programs in a user-supplied experiment. This analysis identifies statistical links to similar expression experiments from the vast array of publicly available data on diseases, drugs, phenotypes and other experimental conditions.http://firstname.lastname@example.orgSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btr548
View details for Web of Science ID 000297352100015
View details for PubMedID 21967760
Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data
SCIENCE TRANSLATIONAL MEDICINE
2011; 3 (96)
Inflammatory bowel disease (IBD) is a chronic inflammatory disorder of the gastrointestinal tract for which there are few safe and effective therapeutic options for long-term treatment and disease maintenance. Here, we applied a computational approach to discover new drug therapies for IBD in silico, using publicly available molecular data reporting gene expression in IBD samples and 164 small-molecule drug compounds. Among the top compounds predicted to be therapeutic for IBD by our approach were prednisolone, a corticosteroid used to treat IBD, and topiramate, an anticonvulsant drug not previously described to have efficacy for IBD or any related disorders of inflammation or the gastrointestinal tract. Using a trinitrobenzenesulfonic acid (TNBS)-induced rodent model of IBD, we experimentally validated our topiramate prediction in vivo. Oral administration of topiramate significantly reduced gross pathological signs and microscopic damage in primary affected colon tissue in the TNBS-induced rodent model of IBD. These findings suggest that topiramate might serve as a therapeutic option for IBD in humans and support the use of public molecular data and computational approaches to discover new therapeutic options for disease.
View details for DOI 10.1126/scitranslmed.3002648
View details for Web of Science ID 000293953100004
View details for PubMedID 21849664
Identification of an IFN-gamma/mast cell axis in a mouse model of chronic asthma
JOURNAL OF CLINICAL INVESTIGATION
2011; 121 (8): 3133-3143
The application of established drug compounds to new therapeutic indications, known as drug repositioning, offers several advantages over traditional drug development, including reduced development costs and shorter paths to approval. Recent approaches to drug repositioning use high-throughput experimental approaches to assess a compound's potential therapeutic qualities. Here, we present a systematic computational approach to predict novel therapeutic indications on the basis of comprehensive testing of molecular signatures in drug-disease pairs. We integrated gene expression measurements from 100 diseases and gene expression measurements on 164 drug compounds, yielding predicted therapeutic potentials for these drugs. We recovered many known drug and disease relationships using computationally derived therapeutic potentials and also predict many new indications for these 164 drugs. We experimentally validated a prediction for the antiulcer drug cimetidine as a candidate therapeutic in the treatment of lung adenocarcinoma, and demonstrate its efficacy both in vitro and in vivo using mouse xenograft models. This computational method provides a systematic approach for repositioning established drugs to treat a wide range of human diseases.
View details for DOI 10.1126/scitranslmed.3001318
View details for Web of Science ID 000293953100005
View details for PubMedID 21849665
Content-based microarray search using differential expression profiles
Asthma is considered a Th2 cell–associated disorder. Despite this, both the Th1 cell–associated cytokine IFN-? and airway neutrophilia have been implicated in severe asthma. To investigate the relative contributions of different immune system components to the pathogenesis of asthma, we previously developed a model that exhibits several features of severe asthma in humans, including airway neutrophilia and increased lung IFN-?. In the present studies, we tested the hypothesis that IFN-? regulates mast cell function in our model of chronic asthma. Engraftment of mast cell–deficient KitW(-sh/W-sh) mice, which develop markedly attenuated features of disease, with wild-type mast cells restored disease pathology in this model of chronic asthma. However, disease pathology was not fully restored by engraftment with either IFN-? receptor 1–null (Ifngr1–/–) or Fc? receptor 1?–null (Fcer1g–/–) mast cells. Additional analysis, including gene array studies, showed that mast cell expression of IFN-?R contributed to the development of many Fc?RI?-dependent and some Fc?RI?-independent features of disease in our model, including airway hyperresponsiveness, neutrophilic and eosinophilic inflammation, airway remodeling, and lung expression of several cytokines, chemokines, and markers of an alternatively activated macrophage response. These findings identify a previously unsuspected IFN-?/mast cell axis in the pathology of chronic allergic inflammation of the airways in mice.
View details for DOI 10.1172/JCI43598
View details for Web of Science ID 000293495500024
View details for PubMedID 21737883
Clinical assessment incorporating a personal genome Reply
2010; 376 (9744): 869-870
Differentially Expressed RNA from Public Microarray Data Identifies Serum Protein Biomarkers for Cross-Organ Transplant Rejection and Other Conditions
PLOS COMPUTATIONAL BIOLOGY
2010; 6 (9)
With the expansion of public repositories such as the Gene Expression Omnibus (GEO), we are rapidly cataloging cellular transcriptional responses to diverse experimental conditions. Methods that query these repositories based on gene expression content, rather than textual annotations, may enable more effective experiment retrieval as well as the discovery of novel associations between drugs, diseases, and other perturbations.We develop methods to retrieve gene expression experiments that differentially express the same transcriptional programs as a query experiment. Avoiding thresholds, we generate differential expression profiles that include a score for each gene measured in an experiment. We use existing and novel dimension reduction and correlation measures to rank relevant experiments in an entirely data-driven manner, allowing emergent features of the data to drive the results. A combination of matrix decomposition and p-weighted Pearson correlation proves the most suitable for comparing differential expression profiles. We apply this method to index all GEO DataSets, and demonstrate the utility of our approach by identifying pathways and conditions relevant to transcription factors Nanog and FoxO3.Content-based gene expression search generates relevant hypotheses for biological inquiry. Experiments across platforms, tissue types, and protocols inform the analysis of new datasets.
View details for DOI 10.1186/1471-2105-11-603
View details for Web of Science ID 000286192100001
View details for PubMedID 21172034
Validating pathophysiological models of aging using clinical electronic medical records
JOURNAL OF BIOMEDICAL INFORMATICS
2010; 43 (3): 358-364
Serum proteins are routinely used to diagnose diseases, but are hard to find due to low sensitivity in screening the serum proteome. Public repositories of microarray data, such as the Gene Expression Omnibus (GEO), contain RNA expression profiles for more than 16,000 biological conditions, covering more than 30% of United States mortality. We hypothesized that genes coding for serum- and urine-detectable proteins, and showing differential expression of RNA in disease-damaged tissues would make ideal diagnostic protein biomarkers for those diseases. We showed that predicted protein biomarkers are significantly enriched for known diagnostic protein biomarkers in 22 diseases, with enrichment significantly higher in diseases for which at least three datasets are available. We then used this strategy to search for new biomarkers indicating acute rejection (AR) across different types of transplanted solid organs. We integrated three biopsy-based microarray studies of AR from pediatric renal, adult renal and adult cardiac transplantation and identified 45 genes upregulated in all three. From this set, we chose 10 proteins for serum ELISA assays in 39 renal transplant patients, and discovered three that were significantly higher in AR. Interestingly, all three proteins were also significantly higher during AR in the 63 cardiac transplant recipients studied. Our best marker, serum PECAM1, identified renal AR with 89% sensitivity and 75% specificity, and also showed increased expression in AR by immunohistochemistry in renal, hepatic and cardiac transplant biopsies. Our results demonstrate that integrating gene expression microarray measurements from disease samples and even publicly-available data sets can be a powerful, fast, and cost-effective strategy for the discovery of new diagnostic serum protein biomarkers.
View details for DOI 10.1371/journal.pcbi.1000940
View details for Web of Science ID 000282372600010
View details for PubMedID 20885780
Clinical assessment incorporating a personal genome
2010; 375 (9725): 1525-1535
Bioinformatics methods that leverage the vast amounts of clinical data promises to provide insights into underlying molecular mechanisms that help explain human physiological processes. One of these processes is adolescent development. The utility of predictive aging models generated from cross-sectional cohorts and their applicability to separate populations, including the clinical population, has yet to be completely explored. In order to address this, we built regression models predictive of adolescent chronological age from 2001 to 2002 National Health and Nutrition Examination Survey (NHANES) data and validated them against independent 2003-2004 NHANES data and clinical data from an academic tertiary-care pediatric hospital. The results indicate distinct differences between male and female models with both alkaline phosphatase and creatinine as predictive biomarkers for both genders, hematocrit and mean cell volume for males, and total serum globulin for females. We also suggest that the models are generalizable, are clinically relevant, and imply underlying molecular and clinical differences between males and females that may affect prediction accuracy. The integration of both epidemiological and clinical data promises to create more robust models that shed new light on physiological processes.
View details for DOI 10.1016/j.jbi.2009.11.007
View details for Web of Science ID 000278780800002
View details for PubMedID 19958842
Dynamism in gene expression across multiple studies
2010; 40 (3): 128-140
The cost of genomic information has fallen steeply, but the clinical translation of genetic risk estimates remains unclear. We aimed to undertake an integrated analysis of a complete human genome in a clinical context.We assessed a patient with a family history of vascular disease and early sudden death. Clinical assessment included analysis of this patient's full genome sequence, risk prediction for coronary artery disease, screening for causes of sudden cardiac death, and genetic counselling. Genetic analysis included the development of novel methods for the integration of whole genome and clinical risk. Disease and risk analysis focused on prediction of genetic risk of variants associated with mendelian disease, recognised drug responses, and pathogenicity for novel variants. We queried disease-specific mutation databases and pharmacogenomics databases to identify genes and mutations with known associations with disease and drug response. We estimated post-test probabilities of disease by applying likelihood ratios derived from integration of multiple common variants to age-appropriate and sex-appropriate pre-test probabilities. We also accounted for gene-environment interactions and conditionally dependent risks.Analysis of 2.6 million single nucleotide polymorphisms and 752 copy number variations showed increased genetic risk for myocardial infarction, type 2 diabetes, and some cancers. We discovered rare variants in three genes that are clinically associated with sudden cardiac death-TMEM43, DSP, and MYBPC3. A variant in LPA was consistent with a family history of coronary artery disease. The patient had a heterozygous null mutation in CYP2C19 suggesting probable clopidogrel resistance, several variants associated with a positive response to lipid-lowering therapy, and variants in CYP4F2 and VKORC1 that suggest he might have a low initial dosing requirement for warfarin. Many variants of uncertain importance were reported.Although challenges remain, our results suggest that whole-genome sequencing can yield useful and clinically relevant information for individual patients.National Institute of General Medical Sciences; National Heart, Lung And Blood Institute; National Human Genome Research Institute; Howard Hughes Medical Institute; National Library of Medicine, Lucile Packard Foundation for Children's Health; Hewlett Packard Foundation; Breetwor Family Foundation.
View details for Web of Science ID 000277655100025
View details for PubMedID 20435227
Translational bioinformatics in the cloud: an affordable alternative
Likelihood ratios for genome medicine
FoxO3 Regulates Neural Stem Cell Homeostasis
CELL STEM CELL
2009; 5 (5): 527-539
In this study we develop methods of examining gene expression dynamics, how and when genes change expression, and demonstrate their application in a meta-analysis involving over 29,000 microarrays. By defining measures across many experimental conditions, we have a new way of characterizing dynamics, complementary to measures looking at changes in absolute variation or breadth of tissues showing expression. We show conservation in overall patterns of dynamism across three species (human, mouse, and rat) and show associations with known disease-related genes. We discuss the enriched functional properties of the sets of genes showing different patterns of dynamics and show that the differences in expression dynamics is associated with the variety of different transcription factor regulatory sites. These results can influence thinking about the selection of genes for microarray design and the analysis of measurements of mRNA expression variation in a global context of expression dynamics across many conditions, as genes that are rarely differentially expressed between experimental conditions may be the subject of increased scrutiny when they significantly vary in expression between experimental subsets.
View details for DOI 10.1152/physiolgenomics.90403.2008
View details for Web of Science ID 000274287000002
View details for PubMedID 19920211
Unsupervised method for automatic construction of a disease dictionary from a large free text collection.
AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
In the nervous system, neural stem cells (NSCs) are necessary for the generation of new neurons and for cognitive function. Here we show that FoxO3, a member of a transcription factor family known to extend lifespan in invertebrates, regulates the NSC pool. We find that adult FoxO3(-/-) mice have fewer NSCs in vivo than wild-type counterparts. NSCs isolated from adult FoxO3(-/-) mice have decreased self-renewal and an impaired ability to generate different neural lineages. Identification of the FoxO3-dependent gene expression profile in NSCs suggests that FoxO3 regulates the NSC pool by inducing a program of genes that preserves quiescence, prevents premature differentiation, and controls oxygen metabolism. The ability of FoxO3 to prevent the premature depletion of NSCs might have important implications for counteracting brain aging in long-lived species.
View details for DOI 10.1016/j.stem.2009.09.014
View details for Web of Science ID 000272019500014
View details for PubMedID 19896443
Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge
Concept specific lexicons (e.g. diseases, drugs, anatomy) are a critical source of background knowledge for many medical language-processing systems. However, the rapid pace of biomedical research and the lack of constraints on usage ensure that such dictionaries are incomplete. Focusing on disease terminology, we have developed an automated, unsupervised, iterative pattern learning approach for constructing a comprehensive medical dictionary of disease terms from randomized clinical trial (RCT) abstracts, and we compared different ranking methods for automatically extracting con-textual patterns and concept terms. When used to identify disease concepts from 100 randomly chosen, manually annotated clinical abstracts, our disease dictionary shows significant performance improvement (F1 increased by 35-88%) over available, manually created disease terminologies.
View details for PubMedID 18999169
FitSNPs: highly differentially expressed genes are more likely to have variants associated with disease
2008; 9 (12)
Genome sciences have experienced an increasing demand for efficient text-processing tools that can extract biologically relevant information from the growing amount of published literature. In response, a range of text-mining and information-extraction tools have recently been developed specifically for the biological domain. Such tools are only useful if they are designed to meet real-life tasks and if their performance can be estimated and compared. The BioCreative challenge (Critical Assessment of Information Extraction in Biology) consists of a collaborative initiative to provide a common evaluation framework for monitoring and assessing the state-of-the-art of text-mining systems applied to biologically relevant problems.The Second BioCreative assessment (2006 to 2007) attracted 44 teams from 13 countries worldwide, with the aim of evaluating current information-extraction/text-mining technologies developed for one or more of the three tasks defined for this challenge evaluation. These tasks included the recognition of gene mentions in abstracts (gene mention task); the extraction of a list of unique identifiers for human genes mentioned in abstracts (gene normalization task); and finally the extraction of physical protein-protein interaction annotation-relevant information (protein-protein interaction task). The 'gold standard' data used for evaluating submissions for the third task was provided by the interaction databases MINT (Molecular Interaction Database) and IntAct.The Second BioCreative assessment almost doubled the number of participants for each individual task when compared with the first BioCreative assessment. An overall improvement in terms of balanced precision and recall was observed for the best submissions for the gene mention (F score 0.87); for the gene normalization task, the best results were comparable (F score 0.81) compared with results obtained for similar tasks posed at the first BioCreative challenge. In case of the protein-protein interaction task, the importance and difficulties of experimentally confirmed annotation extraction from full-text articles were explored, yielding different results depending on the step of the annotation extraction workflow. A common characteristic observed in all three tasks was that the combination of system outputs could yield better results than any single system. Finally, the development of the first text-mining meta-server was promoted within the context of this community challenge.
View details for DOI 10.1186/gb-2008-9-S2-S1
View details for Web of Science ID 000278173900001
View details for PubMedID 18834487
Overview of BioCreative II gene normalization
Candidate single nucleotide polymorphisms (SNPs) from genome-wide association studies (GWASs) were often selected for validation based on their functional annotation, which was inadequate and biased. We propose to use the more than 200,000 microarray studies in the Gene Expression Omnibus to systematically prioritize candidate SNPs from GWASs.We analyzed all human microarray studies from the Gene Expression Omnibus, and calculated the observed frequency of differential expression, which we called differential expression ratio, for every human gene. Analysis conducted in a comprehensive list of curated disease genes revealed a positive association between differential expression ratio values and the likelihood of harboring disease-associated variants. By considering highly differentially expressed genes, we were able to rediscover disease genes with 79% specificity and 37% sensitivity. We successfully distinguished true disease genes from false positives in multiple GWASs for multiple diseases. We then derived a list of functionally interpolating SNPs (fitSNPs) to analyze the top seven loci of Wellcome Trust Case Control Consortium type 1 diabetes mellitus GWASs, rediscovered all type 1 diabetes mellitus genes, and predicted a novel gene (KIAA1109) for an unexplained locus 4q27. We suggest that fitSNPs would work equally well for both Mendelian and complex diseases (being more effective for cancer) and proposed candidate genes to sequence for their association with 597 syndromes with unknown molecular basis.Our study demonstrates that highly differentially expressed genes are more likely to harbor disease-associated DNA variants. FitSNPs can serve as an effective tool to systematically prioritize candidate SNPs from GWASs.
View details for DOI 10.1186/gb-2008-9-12-r170
View details for Web of Science ID 000263074100009
View details for PubMedID 19061490
Rapidly retargetable approaches to de-identification in medical records
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
2007; 14 (5): 564-573
The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%.Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers.Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases.
View details for DOI 10.1186/gb-2008-9-S2-S3
View details for Web of Science ID 000278173900003
View details for PubMedID 18834494
Automating document classification for the Immune Epitope Database
This paper describes a successful approach to de-identification that was developed to participate in a recent AMIA-sponsored challenge evaluation.Our approach focused on rapid adaptation of existing toolkits for named entity recognition using two existing toolkits, Carafe and LingPipe.The "out of the box" Carafe system achieved a very good score (phrase F-measure of 0.9664) with only four hours of work to adapt it to the de-identification task. With further tuning, we were able to reduce the token-level error term by over 36% through task-specific feature engineering and the introduction of a lexicon, achieving a phrase F-measure of 0.9736.We were able to achieve good performance on the de-identification task by the rapid retargeting of existing toolkits. For the Carafe system, we developed a method for tuning the balance of recall vs. precision, as well as a confidence score that correlated well with the measured F-score.
View details for DOI 10.1197/jamia.M2435
View details for Web of Science ID 000249769700004
View details for PubMedID 17600096
Evaluating the automatic mapping of human gene and protein mentions to unique identifiers
PACIFIC SYMPOSIUM ON BIOCOMPUTING 2007
The Immune Epitope Database contains information on immune epitopes curated manually from the scientific literature. Like similar projects in other knowledge domains, significant effort is spent on identifying which articles are relevant for this purpose.We here report our experience in automating this process using Naïve Bayes classifiers trained on 20,910 abstracts classified by domain experts. Improvements on the basic classifier performance were made by a) utilizing information stored in PubMed beyond the abstract itself b) applying standard feature selection criteria and c) extracting domain specific feature patterns that e.g. identify peptides sequences. We have implemented the classifier into the curation process determining if abstracts are clearly relevant, clearly irrelevant, or if no certain classification can be made, in which case the abstracts are manually classified. Testing this classification scheme on an independent dataset, we achieve 95% sensitivity and specificity in the 51.1% of abstracts that were automatically classified.By implementing text classification, we have sped up the reference selection process without sacrificing sensitivity or specificity of the human expert classification. This study provides both practical recommendations for users of text classification tools, as well as a large dataset which can serve as a benchmark for tool developers.
View details for DOI 10.1186/1471-2105-8-269
View details for Web of Science ID 000249274000001
View details for PubMedID 17655769
Overview of BioCreAtIvE task IB: normalized gene lists
We have developed a challenge task for the second BioCreAtIvE (Critical Assessment of Information Extraction in Biology) that requires participating systems to provide lists of the EntrezGene (formerly LocusLink) identifiers for all human genes and proteins mentioned in a MEDLINE abstract. We are distributing 281 annotated abstracts and another 5,000 noisily annotated abstracts along with a gene name lexicon to participants. We have performed a series of baseline experiments to better characterize this dataset and form a foundation for participant exploration.
View details for Web of Science ID 000245296300027
View details for PubMedID 17990499
BioCreAtIvE task IA: gene mention finding evaluation
Our goal in BioCreAtIve has been to assess the state of the art in text mining, with emphasis on applications that reflect real biological applications, e.g., the curation process for model organism databases. This paper summarizes the BioCreAtIvE task 1B, the "Normalized Gene List" task, which was inspired by the gene list supplied for each curated paper in a model organism database. The task was to produce the correct list of unique gene identifiers for the genes and gene products mentioned in sets of abstracts from three model organisms (Yeast, Fly, and Mouse).Eight groups fielded systems for three data sets (Yeast, Fly, and Mouse). For Yeast, the top scoring system (out of 15) achieved 0.92 F-measure (harmonic mean of precision and recall); for Mouse and Fly, the task was more difficult, due to larger numbers of genes, more ambiguity in the gene naming conventions (particularly for Fly), and complex gene names (for Mouse). For Fly, the top F-measure was 0.82 out of 11 systems and for Mouse, it was 0.79 out of 16 systems.This assessment demonstrates that multiple groups were able to perform a real biological task across a range of organisms. The performance was dependent on the organism, and specifically on the naming conventions associated with each organism. These results hold out promise that the technology can provide partial automation of the curation process in the near future.
View details for DOI 10.1186/1471-2105-6-S1-S11
View details for Web of Science ID 000236061400011
View details for PubMedID 15960823
Data preparation and interannotator agreement: BioCreAtIvE task IB
The biological research literature is a major repository of knowledge. As the amount of literature increases, it will get harder to find the information of interest on a particular topic. There has been an increasing amount of work on text mining this literature, but comparing this work is hard because of a lack of standards for making comparisons. To address this, we worked with colleagues at the Protein Design Group, CNB-CSIC, Madrid to develop BioCreAtIvE (Critical Assessment for Information Extraction in Biology), an open common evaluation of systems on a number of biological text mining tasks. We report here on task 1A, which deals with finding mentions of genes and related entities in text. "Finding mentions" is a basic task, which can be used as a building block for other text mining tasks. The task makes use of data and evaluation software provided by the (US) National Center for Biotechnology Information (NCBI).15 teams took part in task 1A. A number of teams achieved scores over 80% F-measure (balanced precision and recall). The teams that tried to use their task 1A systems to help on other BioCreAtIvE tasks reported mixed results.The 80% plus F-measure results are good, but still somewhat lag the best scores achieved in some other domains such as newswire, due in part to the complexity and length of gene names, compared to person or organization names in newswire.
View details for DOI 10.1186/1471-2105-6-S1-S2
View details for Web of Science ID 000236061400002
View details for PubMedID 15960832
Gene name identification and normalization using a model organism database
JOURNAL OF BIOMEDICAL INFORMATICS
2004; 37 (6): 396-410
We prepared and evaluated training and test materials for an assessment of text mining methods in molecular biology. The goal of the assessment was to evaluate the ability of automated systems to generate a list of unique gene identifiers from PubMed abstracts for the three model organisms Fly, Mouse, and Yeast. This paper describes the preparation and evaluation of answer keys for training and testing. These consisted of lists of normalized gene names found in the abstracts, generated by adapting the gene list for the full journal articles found in the model organism databases. For the training dataset, the gene list was pruned automatically to remove gene names not found in the abstract; for the testing dataset, it was further refined by manual annotation by annotators provided with guidelines. A critical step in interpreting the results of an assessment is to evaluate the quality of the data preparation. We did this by careful assessment of interannotator agreement and the use of answer pooling of participant results to improve the quality of the final testing dataset.Interannotator analysis on a small dataset showed that our gene lists for Fly and Yeast were good (87% and 91% three-way agreement) but the Mouse gene list had many conflicts (mostly omissions), which resulted in errors (69% interannotator agreement). By comparing and pooling answers from the participant systems, we were able to add an additional check on the test data; this allowed us to find additional errors, especially in Mouse. This led to 1% change in the Yeast and Fly "gold standard" answer keys, but to an 8% change in the mouse answer key.We found that clear annotation guidelines are important, along with careful interannotator experiments, to validate the generated gene lists. Also, abstracts alone are a poor resource for identifying genes in paper, containing only a fraction of genes mentioned in the full text (25% for Fly, 36% for Mouse). We found that there are intrinsic differences between the model organism databases related to the number of synonymous terms and also to curation criteria. Finally, we found that answer pooling was much faster and allowed us to identify more conflicting genes than interannotator analysis.
View details for DOI 10.1186/1471-2105-6-S1-S12
View details for Web of Science ID 000236061400012
View details for PubMedID 15960824
Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup
2003; 19: i331-i339
Biology has now become an information science, and researchers are increasingly dependent on expert-curated biological databases to organize the findings from the published literature. We report here on a series of experiments related to the application of natural language processing to aid in the curation process for FlyBase. We focused on listing the normalized form of genes and gene products discussed in an article. We broke this into two steps: gene mention tagging in text, followed by normalization of gene names. For gene mention tagging, we adopted a statistical approach. To provide training data, we were able to reverse engineer the gene lists from the associated articles and abstracts, to generate text labeled (imperfectly) with gene mentions. We then evaluated the quality of the noisy training data (precision of 78%, recall 88%) and the quality of the HMM tagger output trained on this noisy data (precision 78%, recall 71%). In order to generate normalized gene lists, we explored two approaches. First, we explored simple pattern matching based on synonym lists to obtain a high recall/low precision system (recall 95%, precision 2%). Using a series of filters, we were able to improve precision to 50% with a recall of 72% (balanced F-measure of 0.59). Our second approach combined the HMM gene mention tagger with various filters to remove ambiguous mentions; this approach achieved an F-measure of 0.72 (precision 88%, recall 61%). These experiments indicate that the lexical resources provided by FlyBase are complete enough to achieve high recall on the gene list task, and that normalization requires accurate disambiguation; different strategies for tagging and normalization trade off recall for precision.
View details for DOI 10.1016/j.jbi.2004.08.010
View details for Web of Science ID 000225334800002
View details for PubMedID 15542014
Rutabaga by any other name: extracting biological names
JOURNAL OF BIOMEDICAL INFORMATICS
2002; 35 (4): 247-259
The biological literature is a major repository of knowledge. Many biological databases draw much of their content from a careful curation of this literature. However, as the volume of literature increases, the burden of curation increases. Text mining may provide useful tools to assist in the curation process. To date, the lack of standards has made it impossible to determine whether text mining techniques are sufficiently mature to be useful.We report on a Challenge Evaluation task that we created for the Knowledge Discovery and Data Mining (KDD) Challenge Cup. We provided a training corpus of 862 articles consisting of journal articles curated in FlyBase, along with the associated lists of genes and gene products, as well as the relevant data fields from FlyBase. For the test, we provided a corpus of 213 new ('blind') articles; the 18 participating groups provided systems that flagged articles for curation, based on whether the article contained experimental evidence for gene expression products. We report on the evaluation results and describe the techniques used by the top performing groups.
View details for DOI 10.1093/bioinformatics/btg1046
View details for Web of Science ID 000207434200048
View details for PubMedID 12855478
As the pace of biological research accelerates, biologists are becoming increasingly reliant on computers to manage the information explosion. Biologists communicate their research findings by relying on precise biological terms; these terms then provide indices into the literature and across the growing number of biological databases. This article examines emerging techniques to access biological resources through extraction of entity names and relations among them. Information extraction has been an active area of research in natural language processing and there are promising results for information extraction applied to news stories, e.g., balanced precision and recall in the 93-95% range for identifying person, organization and location names. But these results do not seem to transfer directly to biological names, where results remain in the 75-80% range. Multiple factors may be involved, including absence of shared training and test sets for rigorous measures of progress, lack of annotated training data specific to biological tasks, pervasive ambiguity of terms, frequent introduction of new terms, and a mismatch between evaluation tasks as defined for news and real biological problems. We present evidence from a simple lexical matching exercise that illustrates some specific problems encountered when identifying biological names. We conclude by outlining a research agenda to raise performance of named entity tagging to a level where it can be used to perform tasks of biological importance.
View details for DOI 10.1016/S1532-0464(03)00014-5
View details for Web of Science ID 000182607600005
View details for PubMedID 12755519