10th International Biocuration conference on March 26-29, 2017
Scientific Program
Scientific Program
Slides and posters can be shared on the Biocuration channel on F1000Research.
Session 1: Data Integration, Data Visualization, and Community-based Biocuration
Sunday, March 26, 9:30 AM - 12 noon
Chair: Edith Wong
Abstract: Brief summaries describing the function of each gene’s product are of great value to the research community, especially when interpreting genome-wide studies that reveal changes to hundreds of genes. However, writing such summaries is a daunting task, given the number of genes in each organism (e.g. 13,929 protein coding genes in Drosophila melanogaster). Automated methods often fail to capture the key functions or express them eloquently. In FlyBase (the Drosophila genetics database) we have therefore developed a pipeline to obtain such summaries from researchers who have worked extensively on each gene. An in-house algorithm predicts and ranks expert authors for each gene based on the data within FlyBase and extracts their email addresses from papers that we have curated. For genes that we classify as sufficiently characterized, emails are sent to the relevant author asking them to provide a draft summary that curators then revise for consistency, creating what we call a “Gene Snapshot". This approach yielded 1,800 gene snapshots within a three month period. We discuss the general utility of this approach for other databases that capture data from the research literature.
Abstract: Web Application Programming Interfaces (APIs) are interfaces that data providers build to empower the outside world to interact with their business logic. The number of biomedical web APIs has grown significantly over the past years, however, the interoperability of APIs remain a big challenge. The proper use of metadata for documenting Web APIs is critical for discovering and using the APIs of interest; as well as interconnecting different APIs. However, there is no standard metadata for documenting Web APIs. Moreover, most APIs are documented in isolation because it is not straightforward to access metadata used in existing relevant APIs. Therefore, reusing metadata and automatically discovering and connecting suitable APIs for a given application is difficult.
In our previous work, we identified the metadata elements that are crucial to the description of Web APIs and subsequently developed the smartAPI metadata specification using the FAIR (Findable, Accessible, Interoperable, Reusable) principles. To facilitate the creation of such metadata, and also to provide instant access to the metadata elements and values used by other API providers, we developed smartAPI editor by extending the existing Swagger editor. smartAPI editor facilitates the creation, sharing and reuse of API metadata. We extended the auto completion functionality of Swagger editor to suggest the list of metadata elements and values used by other API providers along with their usage frequency. The APIs can be then saved into our searchable API repository, which enables the API discovery, and indexing of API documents to automatically update the auto-suggestion list. Our smartAPI editor, available at http://smart-api.info/editor, is currently in use by several API providers which will increase the automated interoperability of Web APIs.
Abstract: The EMBL-EBI Complex Portal (www.ebi.ac.uk/intact/complex) is a central service that provides manually curated information on stable, on macromolecular complexes from model organisms. The database currently holds approximately 2000 complexes with the majority from Saccharomyces cerevisiae, human and mouse. It provides unique identifiers, names and synonyms, list of complex members with their unique identifiers (UniProt, ChEBI, RNAcentral), function, binding and stoichiometry annotations, descriptions of their topology, assembly structure, ligands and associated diseases as well as cross-references to the same complex in other databases (e.g. ChEMBL, GO, PDB, Reactome). Our stable identifiers are used as annotation objects in IntAct and the Protein2GO and as cross-references in ChEMBL, Intermine, MatrixDB and QuickGO. PDBe and Reactome are working towards integrating complex identifiers.
Having established the basic data structure and content we are now focusing on providing a better user experience. We have completely redeveloped our website, developing and incorporating many more visualization tools, such as the ComplexViewer, PDBe’s LiteMol Viewer, Reactome’s DiagramJS, the Atlas widget of expression data and the MI-Circle viewer, a bespoke Chord diagram developed to give an alternative representation of complex topology, binding regions, mutations and links to InterPro domains. Future plans include building a tool that can a) explore evolutionary relationships between complexes across the database and b) infer quaternary structure of complexes for which no structure exists, using the Periodic Table of Complexes developed by the Teichmann group.
This is a collaborative project, which has already been contributed to by groups such as UniProtKB, Saccharomyces Genome Database, the UCL Gene Annotation Team and MINT database. We welcome groups who are willing to contribute their expertise and will make editorial access and training available to you. Individual complexes will also be added to the dataset, on request. Contact us on intact-help@ebi.ac.uk for further information.
Abstract: Recent advances have enabled development of screens to characterize the action of small-molecules on cells using multidimensional assays such as gene expression and cell morphological imaging. The dramatic increase in the scale of these datasets has made it important to improve methodology. The use of a priori genetic pathway information has become an integral part of the analysis of genomic datasets. We reasoned that conceptually similar approaches could be informative in the analysis of small-molecule perturbational datasets.
While drug mechanism of action (MoA) has been curated by several databases, utilizing this information to analyze large-scale datasets is challenging due to 1) inconsistent terminology, and 2) lack of APIs to access the information from computational tools. Furthermore, information transfer is mainly in one direction: while the annotations are used for large-scale data analysis, results from that analysis are not easily translated into new annotations. To address these challenges, we leveraged the Connectivity Map (CMap), a project of the LINCS Center for Transcriptomics at the Broad Institute. CMap generates and catalogs differential gene expression signatures from human cells treated with perturbagens to enable discovery of connections between genes, diseases, and drugs. Enhancing the utility of CMap, we annotated 6,115 compounds with MoA, protein target, disease indication, and clinical status and developed a controlled vocabulary of MoA terms to facilitate grouping of compounds with similar activities. To expand understanding of any compound's MoA, we developed the concept of pharmacological class (PCL), a group of compounds related by curated MoA as well as by their observed connectivity in the CMap dataset. PCLs can be used with gene expression data to annotate new compounds with respect to known MoA. For example, while we observed that enzastaurin, a PKC inhibitor, connects to the PKC inhibitor PCL, we also observed strong connectivity to the GSK3 inhibitor PCL, a result supported by recent publications.
We will present data, curation processes, and new webapps at clue.io that enable users to see curated annotations alongside unexpected off-target activities indicated in the LINCS dataset. These results will inform drug repositioning, target identification and MoA deconvolution. Importantly, access via RESTful APIs will help computationalists incorporate curated information in the next generation of chemical genomics tools.
Abstract: Drug repurposing is a strategy to find new indications for approved pharmaceutical drugs to speed up the drug development process, decrease time to market as well as development costs.
In order to enable drug repurposing, detailed knowledge of candidate compounds is required. This includes the mechanism of action, in vivo targets, approved indications and also details about side effects, contraindications and unexpected outcomes in the original clinical trials. The latter information seems especially valuable in order to gain insight on new indications for existing drugs.
Several databases provide data on chemical compounds and/or drugs, but this wealth of information suffers from weak integration, is outdated or incomplete and frequently lacks the open, machine readable, semantic interoperability required. There is a pressing need for mechanisms to quickly and easily share, integrate, curate and expand existing knowledge for hypothesis generation in the context of drug repurposing while avoiding redundant efforts.
Here, we present a community portal with pre-integrated data which collects chemical compounds, their mechanisms of action, protein targets and standard chemical identifiers. Wikidata serves as completely open data backend, enabling data contribution, integration and consumption by the scientific community, primarily focused on drug repurposing but also suitable for any other drug development project. In total, we have imported data on 140,000 chemical compounds from FDA UNII, PubChem, Guide to Pharmacology, DrugBank and the PDB ligand database with up to 67 accession numbers and chemical properties. This collection encompasses compounds which are either approved drugs, drug candidates, ligands or metabolites, offering a comprehensive starting set of relevant compounds for a repurposing effort. We also imported high confidence protein target information (~26,000 interactions) and added the indications for approved drugs (~4,000 drug-disease associations).
For user friendly interaction with the data, we drafted a dedicated web interface for drug repurposing projects, intended as a domain-specific alternative to the standard Wikidata web interface. It will allow repurposing-specific queries, batch additions of data, as well as curation of existing data and enable third party data integration.
In summary, we describe our vision for an open data, fully expandable drug repurposing portal for the scientific community to facilitate discovery.
Session 2: Large Scale and Predictive Annotation/Big Data
Sunday, March 26, 3:30-5:30 PM
Chair: Zhang Zhang
Abstract: We previously developed a tool, Transcriptomine, for exploring expression profiling datasets involving small molecule or genetic perturbations of nuclear receptor (NR) signaling pathways. Here we describe advances in biocuration, query interface design and data visualization that enhance the discovery of uncharacterized biology in these pathways through this tool. Transcriptomine currently contains some 40 million data points, encompassing ~2000 experiments in a reference library of over 500 datasets retrieved from public archives and committed to a systematic biocuration pipeline. To simultaneously reduce the complexity and enhance the aggregate biological resolution of the underlying data points, we mapped regulatory small molecule and gene perturbations to NR signaling pathways, and biosamples (tissues and cell lines) to organs and physiological systems. Incorporation of these mappings into Transcriptomine empowers high-signal, low-noise visualization of tissue-specific regulation of gene expression by NR signaling pathways. Data points from experiments representing physiological animal and cell models, as well as clinical datasets, facilitate the discovery of intersections between NR pathways and transcriptomic events accompanying a variety of normal and pathological cellular processes. In addition, a growing number of datasets mapping to non-NR signaling pathways highlight transcriptional cross-talk between distinct cell signaling modalities. We demonstrate in a series of Use Cases how data points that are circumstantial or unobtrusive in individual datasets acquire validity and significance as a collective biological narrative. Transcriptomine represents a unique environment in which bench biologists can efficiently and routinely germinate research hypotheses, validate experimental data or develop in silico models to mechanize the transcriptional pharmacology of NR signaling pathways.
Abstract: The Open Targets Project (http://www.opentargets.org/) aims to provide evidence about the association between therapeutic drug targets with diseases and phenotypes from high-quality public data sources to inform drug research and development. Data from resources including the GWAS Catalog, European Variation Archive (EVA), UniProt, Expression Atlas, ChEMBL, Reactome, Cencer Gene Census, Phenodigm and Europe PMC were harmonised by mapping their content to the Experimental Factor Ontology (EFO). EFO provides the semantic backbone for data integration at Open Targets to normalise and link the mapped data to build the knowledgebase of disease-target evidence statements. EFO is a data-driven application ontology that reuses concepts from over 30 ontologies, and fills in the gaps where necessary by creating EFO-owned concepts and classifying them to EFO’s OBO-conformant hierarchy. In total, EFO-mapped curated content from Open Targets’ data providers yield 2,559,080 disease-target associations connecting 31,071 molecular targets to 8,659 diseases and clinical observations (as of December 2016). These associations are supported by 4,973,211 evidence statements annotated to EFO for the Open Targets downstream computation of the associations. Associations and evidence can be browsed at http://www.targetvalidation.org/. The success of Open Targets has showcased biocuration at its best where high-quality curated data build the foundation for a comprehensive data integration for translational medicine research.
Abstract: Each year, there are a significant number of children around world suffer from the consequence of the misdiagnosis and ineffective treatment of various diseases, without clear clinical descriptions of the affected children, the value of the molecular data and its relevance for understanding, diagnosing, and treating pediatric disease will be dropped. To fully make the precision medicine come true in the clinical application, it is necessary to combine clinical-pathological indexes and state-of-art molecular profiling together and use the resources for individual child’s precisely diagnostic, prognostic, and therapeutic strategies.
To this end, we built a database, called Pediatrics Annotation & Medicine to realize standardized name and classification of pediatric disease based on International Nomenclature of Diseases (IND), Standard Nomenclature of Diseases and Operations (SNDO), Disease Ontology (DO) and ICD-10. PedAM integrates both biomedical resources and clinical data from Electronic Medical Records and supports the development of computational tools which will enable robust data analysis and integration. In addition, we used disease-manifestation (D-M) pairs from existing biomedical ontologies as prior knowledge to automatically recognize D-M–specific syntactic patterns from 774,514 full text articles and 8,848,796 abstracts in MEDLINE, we also extracted drugs and their dose information in pediatric disease from records reported in Clinical Trials. Totally, 747,895 D-M pairs together with 297,186 disease-phenotype pairs have been identified respectively. We further build disease similarity network based on the phenotype similarity and implement the visualization on the web page of PedAM system. Meanwhile, we also extracted gene information of 981 gene-related diseases from existed databases, such as OMIM, Orphanet and DISEASES. Currently, PedAM contains standardized 5,699 pediatric disease records with 7 annotation fields for each disease, including definition, synonyms, gene, phenotype, manifestation, reference and cross-linkage. What’s more, we also provide 78,370 phenotypes from abstracts that record the patients as newborns and 267,262 phenotypes that record as pediatrics.
We will continually update PedAM by text mining of new published articles to enrich the disease phenotypes and manifestations, and adding genotype, drug information, as well as patient cases, which will make PedAM as a more comprehensive database and a more valuable platform.
Abstract: Genome Properties (GPs) annotation system was developed ~10 years ago as a way to enhance the functional annotation of genomes and proteins, and as a resource for comparative genomics. GPs allow the inference of complete functional attributes (e.g. pathways and complexes), based on the presence of a set of underlying molecular and/or protein family attributes found within the genome. For example, an organism may be proposed to synthesize proline from glutamate if its genome can be shown to encode the complete set of proteins required to perform the relevant biochemical steps in this pathway.
The current set of GPs represents over 1,000 individual properties, mostly determined by matches to hidden Markov models (HMMs) drawn from the TIGRFAM and Pfam databases. Recently, we have begun work to integrate the GPs annotation system into InterPro - a resource which is widely used for functional characterisation of individual protein sequences - to allow any of InterPro’s 14 member databases to be used in the inference of a GP. The addition of the GPs system complements InterPro’s existing functionality, providing a coarser grained annotation of the functional repertoire encoded by an organism’s genome.
By representing the presence or absence of GPs as a set of binaries, it is possible to generate a ‘fingerprint’ for each proteome. Comparing these fingerprints, rather than the underlying protein matches, vastly reduces the complexity of proteome comparison and allows rapid phylogenetic profiling of a GP. To enable this, we have a developed a visualization tool that allows users to view the presence or absence of GPs for a large number of proteomes, either by taxonomic classification or by GP. Such approaches can enable both the assessment of the evolution of properties, as well as the identification of InterPro entries that may lack sensitivity. Here we describe this work and our forthcoming plans to extend the GPs system to make it more comprehensive, and to provide coverage of eukaryotic genomes.
Abstract: Advances in biomedical sciences are increasingly dependent on knowledge encoded in curated biomedical databases. In particular, the Universal Protein Resource (UniProt) provides the scientific community with a comprehensive and accurately annotated protein sequence knowledgebase. The organization of UniProt protein entries and their associated publications in different topics, such as expression, function and interaction, helps users to find the information of interest in the knowledgebase. The current topic classification approach for computationally mapped bibliography, based solely on underlying sources, is limited. We investigate the use of (semi-) automated classifiers for helping UniProt to classify the scientific biomedical literature according to 11 topics. Publications annotated by UniProt curators are labeled using one or more of the classes above. As such, this is a multi-class multi-label classification problem. Our algorithm works as follows: given a text passage, such as an article abstract, it provides a ranked list of classes and the probabilities that the passage belongs to the available classes. We investigate a text embedding model, Doc2Vec, to compute document similarity and compare several machine learning methods, such as Naïve Bayes, kNN, Logistic regression, and MLP, for assigning class similarities. We use a collection of 100,000 documents classified by the UniProt team to train (99,000) and test (1,000) our algorithms. The validation of the classifier parameters was performed using 5% of the training collection. We compare the text embedded models with a baseline model based on bag of words, where divergence from randomness is used to compute document similarity and kNN to assign classes. In general, our algorithms achieved a high classification precision. The baseline model achieved a mean average precision (MAP) of 0.8270. Apart from the Naïve Bayes model (MAP 0.7163), all the models based on the text embedding approach outperformed the baseline method. Logistic regression achieved a MAP of 0.8376 (p=.32), MLP achieved a MAP of 0.8413 (p=.18), and kNN achieved a MAP of 0.8485 (p=.04). We believe that such classifiers could help improving the productivity in and reduce the cost of certain biocuration tasks. The next steps will be the application of the methodology to unclassified documents and to assess its effectiveness in assisting curators judge the relevance of articles for further curation.
Abstract: Chinese Human Proteome Project (CNHPP) is a major proteomics project funded by Chinese government, to delineate protein profiles of major human organs/tissues at normal and disease states with proteomics-centric pan-omics approaches. Big data infrastructure is initially built to provide a wide-range of data services through modules 1) Data Analysis Platform for Life Science (DAP-LS) for data acquisition and bulk data analysis; 2) Big Data Management System (BDMS) developed with SQL, NoSQL and Lucene indexing technologies; 3) Bioso! Framework for data integration & annotation, 4) Life-Omics Data Portal (LODP) for data search, browse and integrated presentation, and 5) Omics Data Analysis Studio (ODAS) for ad-hoc online data analysis. By working closely with public initiatives, the infrastructure includes components interfacing with public community in data sharing and data standardization. While CNHPP-BDI as a whole infrastructure is still under active development and optimization, all its functional modules have been in operation in supporting CNHPP project. The current status of CNHPP-BDI will be discussed, and its unique features in addressing big data challenges will be highlighted.
Abstract: The Methylation Bank (MethBank; http://bigd.big.ac.cn/methbank) is the first database covering both DNA and RNA methylation data. It features integration of six sub-databases, including three genome-wide methylation databases for human, animals and plants, denoted as MethBank-Human, MethBank-Animal and MethBank-Plant, respectively, and three knowledgebases for different methylation types, denoted as MethBank-6mdA, MethBank-5mrC and MethBank-5hmrC, respectively. MethBank-Human is a database for human epigenetic aging, integrating 4,769 blood samples from healthy people with different ages and providing age-specific differentially methylated regions (aDMRs) and related genes. MethBank-Animal is a database for embryonic development studies, integrating 18 genome-wide single-base-nucleotide methylomes of gametes and early embryos in different model organisms (Danio rerio and Mus musculus) and focusing on regulated mechanisms of DNA methylation in embryonic development. MethBank-Plant incorporates 72 high-quality whole-genome bisulfite sequencing methylome maps for five economically important crops (Oryza sativa, Glycine max, Manihot esculenta, Phaseolus vulgaris and Solanum lycopersicum) and features genome-wide profiling of methylation distributions across chromosomes, identification of differentially methylated promoters (DMPs) between a range of conditions, and visualization of methylation levels for genes, regions and CpG Islands across multiple different samples. Three knowledgebases, Methbank-6mdA, Methbank-5mrC and Methbank-5hmrC, collect all published information for 6mdA, 5mrC and 5hmrC (till Oct. 2016), respectively, such as species, tissues, types, enzymes, related genes, functions, and etc. Moreover, MethBank provides three analysis tools (viz., WBSA, IDMP, and BS-RNA) for whole-genome bisulfite sequencing data of DNA and RNA. As one of core database resources in BIG Data Center, MethBank will be continuously upgraded with the aim to serve as an important resource for epigenetic studies throughout the world.
Session 3: DATABASE Virtual Issue Session
Monday, March 27, 9:30AM-12 noon
Chair: J. Michael Cherry
Abstract: Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness, and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records.
Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using Principal Component Analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that 1 record out of 4 is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records.
Abstract: The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics (MGI) resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow.
We present an effective – yet relatively simple – biomedical document classification scheme to assist curators in identifying publications that are relevant to GXD. We use a large manually curated dataset, consisting of more than 25,000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method.
The classification scheme we propose uses readily-available tools while employing several of our strategies of statistical feature selection to reduce the document representation size, and to focus on terms that support the GXD classification task. As image captions in biomedical documents typically carry significant and useful information for determining the relevance of a publication to a topic, we select text-features obtained from image captions as well as from the title and the abstract. We employ and compare well-used classifiers such as Random Forest and Naïve Bayes, and assess them via widely accepted measures for document classification evaluation including precision and recall, as well as the utility measure, which favors high recall.
Our experiments over the large GXD dataset show that our classifiers retain a high level of performance across a variety of cross-validation settings and performance measures. This indicates that our classifier is effective and can indeed be useful in practice. We also ran multiple experiments using different sets of features, showing that classifiers relying on features obtained from title, abstract and image captions, perform better according to every measure, which demonstrates that image captions indeed provide valuable information supporting the GXD document classification task. Our classifier retains similar level of performance on documents outside the training/test set, demonstrating that our classifier is stable and practically applicable.
The biomedical document classification method we present is effective and robust over a realistically large biomedical document dataset derived from the GXD. Moreover, the performance of our classifier affirms the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area.
Abstract: Experimentally generated biological information needs to be organized and structured in order to become meaningful knowledge. However, the rate at which new information is being published makes manual curation increasingly unable to cope. Devising new curation strategies that leverage upon data mining and text analysis is therefore a promising avenue to help life science databases to cope with the deluge of novel information. In this paper, we describe the integration of text mining technologies in the curation pipeline of the RegulonDB database, and discuss how the process can enhance the productivity of the curators.
Specifically, a named entity recognition approach is used to pre-annotate terms referring to a set of domain entities which are potentially relevant for the curation process. The annotated documents are presented to the curator, who, thanks to a custom-designed interface, can select sentences containing specific types of entities, thus restricting the amount of text that needs to be inspected. Additionally, a module capable of computing semantic similarity between sentences across the entire collection of articles to be curated was integrated in the system. We tested the module using three sets of scientific articles and six domain experts.
All these improvements are gradually enabling us to obtain a high throughput curation process with the same quality as manual curation. The work presented in this paper is part of an NIH-sponsored collaborative project aimed at improving the curation process of the RegulonDB database. RegulonDB is the primary database on transcriptional regulation in Escherichia coli K-12, containing knowledge manually curated from original scientific publications, complemented with high throughput datasets and comprehensive computational predictions.
Abstract: The Immune Epitope Database (IEDB) project incorporates independently developed ontologies and controlled vocabularies into its curation and search interface. This simplifies curation practices, improves the user query experience, and facilitates interoperability between the IEDB and other resources. While the use of independently developed ontologies has long been recommended as a best practice, there continues to be a significant number of projects that develop their own vocabularies instead, or that do not fully utilize the power of ontologies that they are using. We describe how we use ontologies in the IEDB, providing a concrete example of the benefits of ontologies in practice.
Abstract: With the advancement of genome sequencing technologies, new genomes are being sequenced daily. While these sequences are deposited in publicly available data warehouses, their functional and genomic annotations mostly reside in the text of primary publications. Biocurators are hard at work extracting those annotations from the literature for the most studied organisms and depositing them in structured databases. However, the resources don’t exist to fund the comprehensive curation of the thousands of newly sequenced organisms in this manner. Curating data at this scope requires a new approach. With this in mind we have seeded Wikidata with the genetic data of over 120 reference genomes of various organisms, chemical data and disease data, as well as annotations that link these concepts together. Wikidata is an openly editable knowledge graph with the goal of aggregating published knowledge into a free and open database; it is a centralized and stable data warehouse that has the potential to serve as the community owned hub of scientific knowledge. WikiGenomes (wikigenomes.org) is a web application that facilitates the consumption and curation of genomic data in Wikidata by the entire scientific community. WikiGenomes empowers community curation of genomic and biomedical knowledge through providing a domain-specific application built on top of Wikidata, bringing that curated knowledge to the public domain. It is a technology layer that allows anyone to contribute to a central knowledge base through structured annotation forms, and the concept of WikiGenomes will hopefully demonstrate the potential that Wikidata has for being a central backend to all different types of domain specific applications.
Abstract: The Maize Genetics and Genomics Database (MaizeGDB) team prepared a survey to identify breeders’ needs for visualizing pedigrees, diversity data, and haplotypes in order to prioritize tool development and curation efforts at MaizeGDB. The survey was distributed to the maize research community on behalf of the Maize Genetics Executive Committee in Summer 2015. The survey garnered 48 responses from maize researchers, of which more than half were self- identified as breeders. The survey showed that the maize researchers considered their top priorities for visualization as: 1) displaying single nucleotide polymorphisms (SNPs) in a given region for a given list of lines, 2) showing haplotypes for a given list of lines, and 3) presenting pedigree relationships visually. The survey also asked which populations would be most useful to display. The following two populations were on top of the list: 1) 3000 publicly available maize inbred lines used in Romay et al. (Genome Biol, 14:R55, 2013), and 2) maize lines with expired Plant Variety Protection Act (ex-PVP) certificates. Driven by this strong stakeholder input, MaizeGDB staff are currently working in four areas to improve its interface and web-based tools: 1) presenting immediate progenies of currently available stocks at the MaizeGDB Stock pages, 2) displaying the most recent ex-PVP lines described in the Germplasm Resources Information Network (GRIN) on the MaizeGDB Stock pages, 3) developing network views of pedigree relationships, and 4) visualizing genotypes from SNP-based diversity datasets. These survey results can help other biological databases to direct their efforts according to user preferences as they serve similar types of data sets for their communities.
Abstract: Objectives: The report proposes a new method to prioritize articles containing information related to protein-protein interactions (PPIs) and post-translational modifications (PTMs), which are typical relationships curation tasks. Prioritizing papers to extract relationships requires the development of specific methods. Unlike the prioritization of papers to annotate the normal or pathologic functions of proteins, relationships (e.g., “binding”) cannot be described with simple onto-terminological descriptors, but require the recognition of both named-entities and relational entities.
Materials: We tested the model on a set of 300 protein kinases: 100 were used to tune the new services, while the effectiveness of the triage service was assessed with the remaining dataset.
Methods: We defined sets of keywords useful for the data of interest: one list for PPIs and one for PTMs. All occurrences of these descriptors in MEDLINE were marked-up and indexed. Next, the index was associated with a local vector-space search engine by linear combination in order to define an optimal ranking of PMIDs relevant to annotate protein interactions and PTMs. We also evaluated a query refinement strategy by adding specific keywords (such as “binds” or “interacts”) to the original query. We also designed specific use cases to illustrate the interactional architecture (mockups and interaction diagrams) between the text mining services and curation platforms.
Results: Compared to PubMed, the search effectiveness of the new triage service is improved by +191% for the prioritization of papers useful to curate PPIs and +261% for papers relevant to curate PTMs.
Conclusion: Combining simple retrieval and query refinement strategies with a ranking function based on the density of a priori descriptors is effective to improve triage in complex curation tasks.
URL: http://casimir.hesge.ch/nextA5
Abstract: Background
Neurodegenerative disorders such as Parkinson’s and Alzheimer’s disease cause numerous people (e.g., 9 million Parkinson’s by 2030) to suffer and healthcare costs to constantly increase. In order to provide successful interventions and reduce costs, both causes and pathological processes need to be understood. The ApiNATOMY project aims to contribute to our understanding of the origins and pathology of neurodegenerative disorders by manually curating and abstracting data from literature. As curation is labour-intensive, automated methods are sought to speed up manual curation. Here we present our tool to markup PDFs that streamlines the ApiNATOMY curation task.
Method
PDFs are first converted into XML files and all recognised sentences are processed individually. Sentences potentially relevant to the curator are identified through an algorithm that calculates a score based on linguistic (cardinal numbers preceding noun, characteristic subject-predicate pairs), semantic (named entities) and spatial features (splitting of papers into regions and section assignment). Relevance is determined based on this score exceeding a threshold.
To develop and evaluate the tool, we used PDF files that had been manually assessed and highlighted as part of the ApiNATOMY project. The data was divided into two sets: development (183 papers) and test (58 papers). As the evaluation relies on the automated recognition of curator highlights through externally provided software, we further manually investigated 22 papers from the test set to assess whether the automated recognition of highlights is reliable and to identify problems with the developed algorithm.
Result
Employing this algorithm on the test set manually corrected for the imprecision of PDF conversion, we achieved a macro-averaged F1-measure of 0.51, which is an increase of 132% compared to the best bag-of-words baseline model. A user based evaluation was also conducted to assess the usefulness of the methodology on 40 unseen publications, which revealed that in 85% of the publications, all highlighted sentences were relevant and in about 65% publications, the highlights were sufficient to support the curation task without the need to consult the full text.
Conclusion
Our initial results are encouraging and we believe that the results presented are a promising first step to automatically preparing PDF documents to speed up manual curation. The tool is open source and web accessible (http://napeasy.org/).
Abstract: Due to recent advancements in the production of experimental proteomic data, the Saccharomyces Genome Database (SGD; www.yeastgenome.org) has been expanding our protein curation activities to make new data types available to our users. Because of broad interest in post-translational modifications (PTM) and their importance to protein function and regulation, we have recently started incorporating expertly curated PTM information on individual protein pages. Here we also present the inclusion of new abundance and protein half-life data obtained from high-throughput proteome studies. These new data types have been included with the aim to facilitate cellular biology research.
Abstract: The Saccharomyces Genome Database (SGD; www.yeastgenome.org), the primary genetics and genomics resource for the budding yeast S. cerevisiae, provides expertly curated information about the yeast genome and its gene products freely to the public. As the central hub for the research community, SGD engages in a variety of social outreach efforts to inform our users about new developments, promote collaboration, increase public awareness of the importance of yeast to biomedical research, and facilitate scientific discovery. Here we describe these various outreach methods, from networking at scientific conferences to the use of online media such as blog posts and webinars, and include our perspectives on the benefits provided by outreach activities for model organism databases.
Session 4: Functional Annotation
Monday, March 27, 1:30-3:00 PM
Chair: Sylvain Poux
Abstract: The Enzyme Commission classifies enzymes based on the reactions they catalyze. In addition to a description of the enzymatic activity, each classified enzyme receives a descriptive and accurate name and a unique number, known as an EC number. The use of EC numbers over the past 55 years has made it possible for scientists to refer to enzymes in a consistent and unambiguous way, but EC numbers became even more useful with the advent of bioinformatics. By annotating genes with EC numbers, resources such as UniProt and SRI International’s BioCyc are able to computationally link these genes to precise enzymatic activities.
Despite the fact that the Enzyme Commission is a very small group of volunteers, in the last few years EC classification activity has picked up significantly. Since 2010 the commission has created over 1800 new entries and modified 650 existing ones. As a member of the Enzyme Commission and a curator of the MetaCyc database I will provide a short historical background of the commission’s activity, describe the current procedures of enzyme classification, discuss the linkage between MetaCyc curation and EC classification, and provide some data regarding the dissemination of updated EC information to relevant databases. Bioinformatics resources that make use of EC numbers should update their EC datasets frequently to take advantage of the rapid rate of changes to EC numbers.
Abstract: Protein kinases form one of the largest protein families and are found in all taxonomic groups from viruses to humans. They catalyse the reversible phosphorylation of proteins, often modifying their activity and localisation. They are implicated in virtually all cellular processes and are one of the most intensively studied protein families. Aberrant protein kinase function has been linked to the development of a wide range of diseases, and kinases have become key therapeutic targets in recent years. The vast amount of kinase-related data contained in the scientific literature and across data collections highlights the need for a central repository where this information is stored in a concise and easily accessible manner. The UniProt Knowledgebase (UniProtKB) meets this need by providing the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. Following on from previous work in UniProtKB to curate the human and mouse kinomes, comprehensive curation of all characterised kinases of the nematode worm Caenorhabditis elegans was undertaken. The C. elegans kinome consists of 438 kinases of which almost half have been functionally characterised, highlighting that C. elegans is a key model organism in contributing to our understanding of kinase function and regulation. An overview of the expert curation process for kinases and of the C.elegans kinome will be presented. All data are freely available from www.uniprot.org.
Abstract: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. Assigning functions to biological macromolecules, especially proteins, turn out to be one of the major challenges to understand life on a molecular level. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, properly assessing methods for protein function prediction and tracking progress in the field remain challenging as well.
The Critical Assessment of Functional Annotation (CAFA) is a timed challenge to assess computational methods that automatically assign protein function. Here we report the results of the second CAFA challenge, and outline the new changes that will take place in the third CAFA this year.
One hundred and twenty six methods from 56 research groups were evaluated for their ability to predict biological functions for 3,681 proteins from 11 species in the second CAFA challenge. These functions are described by the Gene Ontology (GO) and the Human Phenotype Ontology (HPO). CAFA2 featured increased data size as well as improved assessment metrics, especially the additional assessment metrics based on semantic similarity. Comparisons between top-performing methods in CAFA1 and CAFA2 showed significant improvement in prediction accuracy, demonstrating the general improvement of automatic protein function prediction algorithms. These comparisons also showed that the performance of different metrics is ontology-specific, revealing that the different evaluation metrics can be used to probe the nature of protein functions in different biological processes and human phenotypes.
CAFA3 was launched in September 2016, featuring expanded protein sets for predictions. We are using whole genome screens to generate term-centric tracks for Drosophila melanogaster, Pseudomonas aeruginosa and Candida albicans. Additionally, we will use sets of moonlighting proteins, and prediction of binding sites to further challenge function prediction methods.
Abstract: The Reference Sequence (RefSeq) project at NCBI provides annotated genomic, transcript and protein sequences for genomes across a broad taxonomic spectrum. The known RefSeq transcripts are derived from sequence data submitted to INSDC, and maybe subjected to additional curation to provide the most complete and accurate sequence and annotation for a gene. The curated RefSeq dataset is a critical reagent for NCBI’s eukaryotic genome annotation pipeline and is considered a gold standard by many in the scientific community. The RefSeq project recently has also focused on targeted curation of genes, such as those with exceptional biology.
The term recoding is used to describe non-standard decoding of the genetic code, events that are stimulated by signals embedded within the mRNAs of the recoded genes. Several highly conserved recoding events in vertebrates have been described in literature, such as: ribosomal frameshift (RF), where the ribosome slips either in a +1 or -1 direction at a specific site during translation to yield a protein product from 2 overlapping open reading frames; use of UGA (which normally functions as a stop codon) to encode the non-universal amino acid (aa) selenosysteine (Sec); and stop codon readthrough (SCR), where a stop codon is recoded as a standard aa, which results in translation extension beyond the annotated stop codon to an in-frame downstream stop codon, generating a C-terminally extended protein isoform.
The recoded gene products have important roles in human health and disease; hence their correct annotation is vital to preserve functional information. Conventional computational tools cannot distinguish between the dual functionality of the UGA codon or predict RF or SCR, resulting in misannotation of the coding sequence and protein on primary sequence records. Manual curation is thus essential, so our goal is to provide an accurately curated and annotated RefSeq data set of the recoded gene products to serve as standards for genome annotation and for biomedical research.
The curation and annotation of antizyme genes, which require +1 RF for antizyme synthesis, was the subject of our recent publication (PMID:26170238). To date, the paternally expressed PEG10 gene is the best characterized gene in mammals that utilizes -1 RF for protein expression. Currently, the RefSeq database includes 242 curated RefSeq records for antizymes, 64 for PEG10, 472 for Sec-containing selenoproteins, and 65 for genes reported to utilize SCR.
Abstract: Short text paragraphs that describe gene function, often referred to as gene summaries, are regarded as high-value data by users of biological databases for the ease with which they convey the key aspects of a gene’s role in biology. Fully manual curation of gene summaries, while desirable, is difficult for databases to sustain. Therefore we developed an algorithm that automatically generates gene summaries simulating natural language. The automated gene summaries consist of data that belongs to different biological aspects of a gene. The method uses curated, structured data in WormBase resulting from several curation projects such as the curation of homologs/orthologs, annotation of Gene Ontology (GO) terms to genes, curation of gene expression patterns in tissues and at the sub-cellular level, and curation of large scale data from microarray and tiling array experiments. The method also makes use of pre-built templates for the generation of natural language sentences. To improve readability of the summaries, specific rules for each semantic category were created. Initially developed for genes of the nematode Caenorhabditis elegans, the algorithm that generates gene summaries has been extended to a total of eight additional nematode species including the parasitic species, Brugia malayi, Onchocerca volvulus and Strongyloides ratti, suggesting that this approach will be broadly applicable. The automated gene summaries are regenerated with each WormBase release ensuring that they reflect new and/or updated primary data. The software and data are available at http://textpresso.org/automatedgenesummary under the software and release directories respectively. Though our method of generating gene summaries employs a relatively simple algorithm that may not represent a major technical advance in text summarization, it is a tool of immense practical value that effectively leverages data type-specific curation in a biological database project to generate thousands of gene summaries, saving valuable curator time and effort.
Abstract: BACKGROUND Biological processes are accomplished by the coordinated action of sets of gene products. However, some processes are rarely connected to each other because they functionally, temporally or spatially distant. We speculated that we could identify pairs of biological processes which were unlikely to be co-annotated to the same gene products (e.g. amino acid metabolism and cytokinesis), and use the mutually exclusive processes identified to create rules which alert curators to possible annotation errors.
RESULTS/CONCLUSIONS Three proof of principle case studies were performed across multiple species: 1. A longitudinal study tracking fission yeast GO slim terms for over 8 years ii) A species-wide study for gene products annotated to the cohesin complex iii) A focussed project to inspect annotations for a set of five GO slim terms across fission yeast, budding yeast, worm and mouse to identify incorrect manual annotations, automated annotations and ontology errors. To date, this project has corrected errors in the ontology and annotation affecting over one million individual annotations across all species and generated a preliminary rule set to prevent future co-annotation anomalies.
Session 5: Text Mining
Monday, March 27, 3:30-5:00PM
Co-chairs: Johanna McEntyre and Senay Kafkas
Abstract: Biological knowledgebases, such as UniProtKB/Swiss-Prot, constitute an essential component of daily scientific research by offering distilled, summarized, and computable knowledge extracted from the literature by expert curators.
While knowledgebases play an increasingly important role in the scientific community, the question of their sustainability is raised due to the growth of biomedical literature. By using UniProtKB/Swiss-Prot as a case study, we address this question by using different literature triage approaches. With the assistance of the PubTator text-mining tool, we tagged more than 10,000 articles to assess the ratio of papers relevant for curation.
We first show that curators read and evaluate many more papers than they curate, and that measuring the number of curated publications is insufficient to provide a complete picture. We show that a large fraction of published papers found in PubMed is not relevant for curation in UniProtKB/Swiss-Prot and demonstrate that, despite appearances, expert curation is sustainable.
Abstract: The DARPA Big Mechanism program is building computer systems that assemble large mechanistic models about cancer signaling. A key part of this is automated reading of scientific papers for molecular mechanisms that are relevant to the models. To accomplish this, reading systems must read full text articles, identify salient mechanistic findings in the article, accurately capture mechanisms, and assemble mechanisms from reading into models.
In the first year of the program, machine systems processed 1000 full text papers in one week, and returned candidate mechanistic interactions about proteins, drugs, and genes with supporting text evidence and the relationships of these interactions to a mechanistic model. In the second year of the program, the goal was to develop automated systems that could assemble interactions from a full text article into a ranked list of (non-redundant) major mechanistic findings, including binding, phosphorylation, gene expression, and translocation events. To be considered a major finding, a mechanism had to be mentioned in at least three text passages and/or figure legends in a paper. To measure the ability of the systems to extract major findings, three biologists independently curated papers for these mechanistic findings. Of the mechanistic findings independently found by all three biologists, 73% of the findings were correctly identified by the best performing automated machine system. The machine systems were particularly good at retrieving phosphorylation and binding events.
The second year reading task also included a measure of precision: the percent of top 10 findings returned by each system judged correct by human curators. For the best performing system, 67% of the relations returned were correct (ignoring linkage to database identifiers). Of the correct interactions extracted, reading systems correctly attached UniProt or PubChem identifiers for ~75% of the entities, with errors common for protein families and complexes involved in cancer signaling pathways.
In Year 3, the role of the reading systems is to extend an existing model to explain experimental findings for a set of cell lines, tumor types and drugs.
---
This work was supported under the DARPA Big Mechanism program, contract W56KGU-15-C-0010. This technical data was produced for the U. S. Government under Basic Contract No. W56KGU-16-C-0010, and is subject to the Rights in Technical Data Noncommercial Items clause at DFARS 252.227-7013 (FEB 2012)
Abstract: Curation is a high-precision task that is essential to both construct and keep biomolecular databases updated. Manual curation is labor-intensive, considering the tremendous growth in biomedical literature. There is an urgent need to support the curation process computationally to make the process more scalable. To this end, text mining offers a means to identify and extract statements from scientific texts that describe various biological entities and associations between them. Furthermore, these text-mined annotations when represented in machine readable format can be used to automatically link the annotations in publications to the related biomolecular databases providing a clear provenance for curators.
Here, we present SciLite [1], an annotation platform developed as part of the Europe PMC literature database [2] that exposes text-mined outputs to anyone reading content on the website. In the context of curation, this platform provides a mechanism to make deep links between the literature and data for clear provenance of curatorial statements. We will demonstrate how SciLite displays molecular interactions identified via both text mining and manual curation. Molecular interactions in SciLite currently covers, 100 manually curated interactions from IntAct [3] and 1364445 text mined annotations from the open access full text articles. We will show how linking between curated databases and the source - research articles - can be achieved. We also plan to provide links from the IntAct database interface to the molecular interaction annotations in SciLite.
References:
1. Venkatesan A, Kim JH, Talo F et al. SciLite: a platform for displaying text-mined
annotations as a means to link research articles with biological data [version 1;
referees: 1 approved with reservations]. Wellcome Open Res 2016, 1:25.
2. Europe PMC Consortium. Europe PMC: a full-text literature database for the life
sciences and platform for innovation. Nucleic Acids Res. 2015 Jan;43(Database
issue).
3. Orchard S, Ammari M, Aranda B et al. The MIntAct project--IntAct as a common
curation platform for 11 molecular interaction databases. Nucleic Acids Res. 2014
Jan;42(Database issue).
Abstract: Metabolic reconstructions provide deep understanding of mechanisms underpinning metabolic processes of an organism of interest, which have become central to studies in various application areas such as drug target discovery, metabolic engineering and synthetic biology. The production and revision of these reconstructions, however, are very costly in terms of time and effort, requiring manual analysis of vast amounts of scientific literature. Text mining methods lend themselves well to the task of automatically analysing huge amounts of scientific literature in order to extract fine-grained information. As part of the Enriching Metabolic Pathway Models with Evidence from the Literature (EMPATHY) project, we have developed text mining methods that aim to support the creation and revision of metabolic reconstructions by automatically extracting and consolidating information on metabolic reactions from full-text scientific articles. We cast the identification of metabolic reactions as an event extraction task. This was addressed through the development of an approach underpinned by: (1) the machine learning-based EventMine event extraction tool (http://www.nactem.ac.uk/EventMine), trained on the training subset of the Metrecon corpus of metabolic reaction-annotated PubMed abstracts (https://peerj.com/articles/1811); and (2) rules defined based on lexico-syntactic patterns and subcategorisation frames (SCFs) of metabolism-relevant verbs, which were derived using the Enju parser (http://www.nactem.ac.uk/enju). Our results on the evaluation subset of the Metrecon corpus show that the incorporation of SCF-based rules into the event extraction pipeline leads to improved performance especially in terms of recall. Using the web-based Argo text mining platform (http://argo.nactem.ac.uk), the proposed method was implemented as an automatic workflow that was applied on the PubMed Central Open Access collection. To facilitate straightforward querying and visualisation of results, the automatically extracted metabolic reactions were written to comma-separated values (CSV) files that a Neo4j graph database imports from. The resulting database serves as a supporting resource for metabolic reconstruction efforts by enabling the discovery of obscure relationships between metabolites and enzymes of interest, which are grounded to evidence from the literature.
Abstract: Modern precision medicine efforts rely heavily on up-to-date knowledge bases in order to clinically interpret genomic events. CIViC is a community-curated database of diagnostic, prognostic, predisposing and drug response markers in cancer. In order to guide curation, we present the CIViCmine system, which is an automated text-mining approach that identifies significant markers discussed in published literature. We initially manually annotate sentences from PubMed abstracts and PubMed Central full-text articles and then train the VERSE relation extraction tool. The resulting genomic events are then filtered to identify markers not yet in CIViC and prioritized by the number of unique papers in which they are discussed. Through these efforts, we have created a robust resource that directs CIViC curators towards important and relevant precision medicine publications.
Abstract: Understanding and assessing the associations of genomic variants with diseases or conditions and their clinical significance is a key step towards precision medicine. Despite recent efforts in expert curation, the function of most of the 154 million dbSNP reference variants (RS) remains “hidden” (only 200,000 SNV for 5300 genes can be found in curated databases such as ClinVar) but a wealth of information about the variant biological function and disease impact exists in unstructured literature data. In the past, a few computational techniques have attempted to harvest such information but their results are of limited use because those text-mined variant mentions are not standardized or integrated with existing curated data.
Despite the HGVS standards for variant nomenclature, a large number of genetic variants are freely mentioned in different names in the literature, making it impossible for assembling all known disease associations for a given variant. Hence, we developed an automatic method that extracts and normalizes the variant mentions to standard dbSNP RS number, which is a stable variant accession unique across all organisms with aggregated information such as associated gene & clinical significance. Furthermore, using machine learning, we pair each genetic variation with disease phenotypes. Both computational steps were evaluated against human gold standard with state-of-the-art performance: nearly 90% and 80% in F-measures for mutation normalization and relation extraction, respectively.
Next, we applied our approach over the entire PubMed and validated our results with dbSNP and ClinVar for data consistency such as 1) the text-mined SNV-gene pair match dbSNP gene annotation based on genomic position, 2) analyze variants curated in ClinVar and 3) discover novel connections between variants, gene, and diseases. Our analysis reveals 425,000 novel RS and 6,000 novel genes not found in ClinVar. Moreover, our results also include approximately 10,000 novel rare variants (MAF <= .001) in 3,000 genes which are presumed to be deleterious and are not frequently found in the general population. To our best knowledge, we are the first to develop such an automatic method for normalizing genomic variant names. Our genome-scale analysis shows that automatically computed information combined with existing database annotations can significantly aid human efforts to curate and prioritize variants for interpretation in personal genomes and precision medicine.
Session 6: Data Standards and Ontologies
Tuesday, March 28, 9:30AM-12 noon
Chair: Lynn Schriml
Abstract: Among other activities, the Global Alliance for Genomics and Health (GA4GH) develops data schemas to enable implementation of consistent data APIs, for federated access to health related genome and associated metadata. Parallel to the schema development activities, demonstration projects explore strategies and limitations of federated data access and try to engage a growing number of active participants.
An important benchmark for a given data schema is the successful projection of reference datasets, using implementations built around the schema definitions. However, so far the mapping of rich metadata attributes and structures into the GA4GH schema has not been explored yet for large, real world datasets or reference repositories.
As part of two GA4GH implementation studies supported through ELIXIR, we are working on the implementation of GA4GH data schemas and reference projects, utilising cancer genome profiling and associated metadata represented in our arrayMap resource (arraymap.org). The first project is aimed at the testing of the GA4GH object model, with a focus on metadata, and is using the experience to feed back into the schema development efforts of the GA4GH Metadata Task Team. In the second project, a GA4GH "transitional" schema conform representation of the arrayMap resource's data is used for the forward looking development of a genomic "Beacon", with focus on the representation of structural genome variants and the progressive inclusion of metadata parameters for enriched data queries.
This presentation will be aimed at reporting the current status of these projects, as well as the challenges represented in the design and implementation of data concepts centred around modern object based biodata data standards with heavy dependency on ontologies.
Abstract: There is need for standardized and industry accepted metadata schemas for reporting the computational objects that record the software and parameters used for computations together with the results of computations. Absence of such standards often make it impossible to reproduce the results of a previously performed computation due to missing information on parameters, versions, arguments, conditions, and procedures of application launch. In this talk I will describe the concept of biocompute objects developed specifically to satisfy regulatory research needs for evaluation, validation, and verification of bioinformatics pipelines. We envision generalized versions of biocompute objects called biocompute templates that support a single class of analyses but can be adapted to meet unique needs. To make these templates widely usable, I will discuss the reasoning and potential usability for such concept within larger scientific community through the creation of a biocompute object database. A biocompute object database record will be similar to a GenBank or UniProtKB record in form and will be machine and human readable; the difference being -- instead of describing a sequence, the biocompute record will include information related to parameters, dependencies, usage and other related information related to specific computations. This mechanism will extend similar efforts and also serve as a collaborative ground to ensure interoperability between different platforms, industries, scientists, regulators, and other stakeholders interested in biocomputing. Funded by the FDA, biocompute objects will initially target development of objects relevant to FDA needs. Currently, we are requesting community input to the biocompute object specification document and the biocompute database that we plan to build. It is envisioned that not just regulatory data analysis workflows developed and used by FDA will be shared with the community of stakeholders through a biocompute portal but also other HTS (NGS) workflows which are important in biomedical research.
URL: https://hive.biochemistry.gwu.edu/htscsrs
Abstract: Standardizing data is the first step for the integration of multiple distributed and autonomous data sources. A key requirement for data sharing is standardization: agreement on the types and definitions of structures and processes, and the formats used to access and share data.
Mass Spectrometry based approaches has accelerated the development of the proteomics and new biological mechanism discovery. To maximize the utility of the accumulating large scale of proteomics data produced by different kinds of experiments and standardizing the data at each stage of the experiments, it is essential that the metadata are collected and well organized. To fulfill this need, we developed a standardized ontology of Mass Spectrometry (MS)-based proteomics’ metadata. Our work includes mapping metadata fields from existing domain specified ‘MIAPE-compliant’ reports to controlled vocabularies (CV) that format the standard representation of each concept, expanding minimum metadata fields by using hierarchical structure of ontology, adding object properties from OBI, and modeling semantic relationships for all metadata fields. The standardized metadata contains 1860 fields with 84 object properties. 128 top terms (terms mapping from MIAPE-like documents that have no parent node or have more than one child node) are considered as “individuals”. As there are nearly no semantic relationships in CVs, semantic relationships had been added between these terms and corresponding relationship pictures also provided as visualization.
The metadata fields in the standard are composed of 7 modules – general features, mass spectrometry, mass spectrometry informatics, mass spectrometry quantification, molecular interaction experiment, bioactive entity and protein affinity reagent – based on both content and the corresponding stage through the entire workflow of a MS-based proteomics project. All the information and the .owl document of our metadata standard are available on http://www.unimd.org/ms/.
Abstract: Genetic interactions illuminate the functional organization and underlying mechanisms of biological systems, and often complement the biochemical analysis of signaling and metabolic networks. With the recent advent of efficient targeted gene disruption and control via CRISPR-Cas9 technologies, and high throughput genetic interaction screens in model systems and human cells, comes the need to precisely and consistently classify information about large numbers of genetic interactions in a variety of organisms. Due to historical reasons, existing nomenclature for genetic interactions is diverse, unsystematic, and at times ambiguous. An interoperable framework that can be implemented across biological databases is needed in order to enable reliable cross-species and cross-database comparisons of genetic interaction data. Here we propose the Genetic Interactions Structured Terminology (GIST), a systematic, flexible and modular method for naming and cataloging all forms of genetic interaction data. The GIST is designed for facile adoption by authors and biological curators alike to establish a common standard for annotation, interpretation and dissemination of genetic interaction data. Critically, the GIST separates structured descriptors of genetic interactions from phenotype, and when possible the GIST has been aligned with previous nomenclature conventions. The GIST system was developed by members of the WormBase and BioGRID databases and is supported by a number of biological databases including the Human Proteome Organization Proteomics Standards Initiative - Molecular Interactions working group (HUPO PSI-MI), Saccharomyces Genetics Database (SGD), Candida Genome Database (CGD), PomBase, FlyBase, the Zebrafish Model Organism Database (ZFIN), and The Arabidopsis Information Resource (TAIR). The widespread implementation of the GIST will enable researchers and biocurators to record and unambiguously describe genetic interactions from diverse sources, and thereby unify the investigations of the genetic basis for all phenotypes of living organisms.
Abstract: As databases geared toward organizing domain-specific biological knowledge have evolved, various ontologies have been developed and employed to standardize and provide structure to the representation of this knowledge. To cite a well-known example, attributes of gene products such as the processes they participate in, the molecular functions they carry out, or where they are found in a cell can be systematically described using the Gene Ontology. Representation of scientific concepts using ontologies allows us to connect seemingly disparate fields using a common descriptive framework, design quality control (QC) metrics for databases, and achieve other benefits.
As experienced biocurators know, documenting the evidence that supports a scientific assertion such as a protein annotation is essential. Indeed, capturing evidence is fundamental to the curatorial process: it allows us to say why we believe what we believe to be true. It also affords us a practical means by which to employ QC measures when importing data into databases. Furthermore, we can even draw inferences about our faith in an assertion/conclusion by looking at the associated types of evidence.
The Evidence & Conclusion Ontology (ECO) describes evidence arising from laboratory experiments, computational methods, curator inferences, and other means. ECO is currently used by dozens of databases, software tools, and other applications to provide structure and context for documenting evidence in scientific research. Using ECO allows users to query, manage, and interpret data in ways heretofore not possible.
This talk presents an overview of ECO, including its history, applications, and general structure, and describes more recent attempts to normalize the ontology. Current efforts to provide a framework for representing one’s “confidence” in an experimental technique or the “quality” of an annotation will be discussed. The possibility of expanding ECO to enable representing evidence in disciplines as diverse as anthropology, biodiversity, and psychology will also be addressed. The goal of ECO is to enable summary descriptions of relatively granular evidence types for a broad range of scientific disciplines, starting with biology. (ECO development is facilitated by National Science Foundation DBI award number 1458400).
Abstract: Comprehensive standardization of human disease data is essential for exploring the genetic variation represented in disease-associated animal models, clinically relevant genetic variants, immune diseases and comparative assessment and integration of genes, drugs and pathways through the common lens of human disease. Classification of genetic variation in the context of disease necessitates a shift in the DO’s approach to studying the broader context of associated cells of origin, molecular variants, contributing environmental factors and tissue of origin (anatomical site). In 2016, the DO initiated a stepwise approach to augment our etiology-based disease classification by establishing standardized ontological relationships between DO diseases, cell of origin, molecular variants and anatomical location (e.g. ‘contributes to condition’ relation, synonymous to OMIM’s ‘susceptibility_to’ for linking DO diseases to risk factors) and establishing a standard SOP for creating DO molecular-based subtypes and a definition template. This work involves researching the support for each subtype, documenting provenance, defining the associated linkages to other clinical vocabularies, composing DO textual definitions and creating DO logical definitions. The DO project has moved to a more open and semi-automated approach to integrate collaborator (expert curators at MGI, RGD & IEDB) defined DO subtypes using the ROBOT tool (https://github.com/ontodev/robot), created by James Overton and previously implemented for the OBI project, to facilitate batch-creation of terms in a specific disease grouping. This approach enables more efficient review of terms (~100 from IEDB & ~700 from MGI and RGD), has greatly reduced time-intensive term input activities and has expedited inclusion of disease-associated content to define DO logical axioms. This collaborative approach has the advantage of working with content experts, thus reducing the review time frame, and identifying targeted, priority datasets associated with clinical annotations. This activity will expand the DO’s mechanistic classifications for genetic diseases defined within OMIM phenotypic series and associated with MGI and RGD animal models and immune diseases defined by the IEDB. These novel models will provide a robust backbone for complex disease queries, exposing alternative classification systems in an intuitive format with data managed within a rigorous semantic structure.
Abstract: PhenoMiner at the Rat Genome Database (RGD, http://rgd.mcw.edu) is a data warehouse and data mining tool for quantitative phenotypes in the rat. In order to allow comparisons of results across multiple studies and multiple rat strains, a suite of ontologies is being developed to standardize the specification of what was measured (Clinical Measurement Ontology, CMO), how it was measured (Measurement Method Ontology, MMO) and under what conditions it was measured (Experimental Condition Ontology, XCO). Development of these ontologies is carried out on an as-needed basis concurrent with the curation of the relevant literature and community submission of high-throughput phenotyping data. Given the complexity of the subject matter, it is not surprising that challenges have arisen in the process of developing these ontologies. Examples of these challenges include the natural tendency to use the name of an instrument as a method in the MMO. Since an instrument is not a method, these had to be converted to methodology terms. In the XCO, initial development separated chemicals and chemicals incorporated into the diet or drinking water in separate branches of the ontology. A new relationship "has_component" was added to connect the diet term to the term for the chemical itself. A recent case was discovered where the same measurement term was used with different meanings depending on the method used to make the measurement. The question then became how to distinguish the two sets of terms in the CMO without resorting to including the method in the term. This has also been a major problem when working with data for behavior and movement research. In these cases, the distinctions between the trait being assessed, the actual measurement being made, the specific method and the external condition(s) can be extremely difficult to pin down. We are currently working with researchers who do these experiments to improve these sections of the ontologies and to create curation standards for capturing this data.
Session 7: Curation Standards and Best Practice, Challenges in Biocuration, Biocuration Tutorial
Tuesday, March 28, 4:00-5:30PM
Chair: Stacia Engel
Abstract: Some have claimed that curation is too expensive, but how much does curation really cost? We show that the cost of curation for the EcoCyc database was $219 per article over a 5-year period, which is quite low when one considers that figure is 6-15% of the cost of open-access publication fees for biomedical articles, and .088% of the cost of the average cost of the research projects that generated the experimental results that were curated. We also discuss the accuracy of biocuration in model organism databases, showing that the average error rate of curation in EcoCyc and in Candida Genome Database is 1.4%. Some have suggested that we should replace curation with information extraction software. We show that the error rates of current information extraction programs are too high to replace professional curation. Furthermore, current information extraction programs extract single narrow slivers of information; they cannot extract the large breadth of information extracted by professional curators for databases such as EcoCyc. We also discuss the significant experience accumulated by the bioinformatics community in crowd-sourcing curation, and in offloading curation to the authors of publications. We show that attempts at crowd-curation have had very low participation rates, whereas the author-curation model shows more promise.
Abstract: Curation of metadata is occurring across a wide range of industries with potential significant differences in reasons for curation and how it is accomplished. Even with acknowledging this variation, perhaps there are commonalities to be defined and addressed across all industries. With a focus towards clinical research data from the pharmaceutical industry, this presentation will identify some shared advantages, contemplation points, and challenges of the metadata curation processes in use today.
Use cases and real world experience related to metadata and its benefits have been well documented and demonstrate a clear signal. The pharmaceutical industry, while deeply rooted in science, is not recognized for fast adoption of technology. This industry, however, can no longer ignore the significant assets potentially concealed in their metadata. Unlike many industries where the standardization of metadata and its curation was top-down, the pharmaceutical industry is embracing metadata as a grassroots movement. Biopharmaceutical curators have a unique opportunity to grow their skillset and develop creative solutions for clinical research metadata for better data quality and faster development of treatments for patients.
This grassroots effort is not without its challenges. Even with clear value propositions for metadata-related initiatives, innovations in this area have been slow in an industry where regulatory compliance is embedded in every decision made. This phenomenon substantially impedes adoption of novel technology such as metadata management.
Adoption of metadata-based approaches require organizational commitment, process changes, and coaching. It adds new burdens to the already challenging job the Metadata Curators perform, from constant championing metadata initiatives with stakeholders to recommending metadata principles. Metadata Curators need to act as an agent of change, a crusader for standardization, and a master of negotiation. There are no scripted solutions to ease the adoption journey and no obvious path forward. The plethora of standards such as the ‘number soup’ of ISO publications adds extra stress to deciding on information architecture, system infrastructure, and other factors that have significant long-term impact.
To further intensify the Metadata Curator's responsibility, the role often requires them to be constantly watchful for misrepresentation or misuses of organizational metadata. Although governance is a good discipline to implement, the Metadata Curators may struggle to find the delicate balance between consensus and pragmatism, without sacrificing the guiding principles.
To summarize, this presentation will demonstrate that implementing metadata management is not a trivial task. In fact, it is a journey, requiring commitment, resources, and likely trial and error. Nevertheless, companies that are either contemplating or even resisting the benefits which metadata management bring are losing huge competitive edge to companies that have solutions in place. It is important to set clear objectives, realistic roadmaps, and most importantly, a step-wise maturity model to grow with the organization. Lastly, this is a transformative process which advances traditional clinical trial data management forward to encapsulate data science disciplines such as concept modeling, information architecture, and big data technologies.
Abstract: Mouse Genome Informatics (MGI, www.informatics.jax.org) has been using OMIM terms for annotations of mouse models to human disease for many years. While the OMIM disease records contain user friendly text descriptions of diseases and the OMIM group curates relationships between disease and human genes, the lack of disease groupings and relationships among diseases creates issues for display, retrieval and computational analysis of disease data. MGI has been working with the Rat Genome Database (RGD) and the Disease Ontology (DO) groups to enhance the DO. This work has focused on increasing and refining the incorporation of OMIM terms in the DO and enriching the relationships between disease terms in the DO. As a result of this work the DO has reached a level of development where MGI has begun to convert from OMIM to DO for display and searching for disease related data. The first phase of this conversion focused on the public web interface of MGI. Data annotations made to OMIM terms are translated into DO annotations using the OMIM xrefs in the DO file. The translated annotations are then used to populate disease related pages and searches in MGI. For example the disease tables shown on the marker detail pages now show DO terms and IDs. The disease browser and detail pages have been completely reworked to take advantage of the DAG structure of the DO. A user may now see all mouse models for Parkinson Disease, including all of the subtypes, on a single page rather than having to visit each subtype page individually. The Human - Mouse: Disease Connection has also been updated to use the DO. Users may search by DO terms names, synonyms, or IDs and may also search by many of the alternate IDs mapped to DO terms including OMIM, MESH, NCI, KEGG, ICD10, ICD9 and UMLS. In addition, the HMDC gene homologs x phenotype/disease page now groups diseases by a custom MGI DO slim set to improve visualization. By using the OMIM xrefs in the DO file to translate annotations we are able to continue to work on the coverage and representation of OMIM terms in DO without losing data due to incomplete coverage of OMIM or having to re-curate data following changes to OMIM representation in the DO. Future changes include annotating directly to DO terms and providing computational access to these data through APIs.
Abstract: The paradigm of precision medicine envisions a future in which a cancer patient’s molecular information can be interpreted in the context of accumulated knowledge to inform diagnosis, prognosis and treatment options most likely to benefit that individual patient. Accordingly, many groups have created knowledgebases to annotate cancer genomic mutations associated with evidence of pathogenicity or relevant treatment options. Integration of the available knowledge is currently infeasible because each group (often redundantly) curates their own knowledgebase without adherence to any interoperability standards. There is a clear need to standardize and coordinate clinical-genomics curation efforts, and create a public community resource able to query the aggregated information. To this end we have formed the Variant Interpretation for Cancer Consortium (VICC) as part of the Global Alliance for Genomic Health (GA4GH) to bring together the leading institutions that are independently developing comprehensive cancer variant interpretation databases.
VICC participants share a desire to coordinate efforts and thus enhance the value of each independent effort. Each participant has agreed to: (1) Sharing at least a minimal set of required data elements for cancer variant interpretations; (2) Protecting patient privacy by focusing on only clinical interpretations of variants derived from published findings, not variant-level observations from individual patients; (3) Sharing all or a significant proportion of interpretations accumulated by their ongoing curation efforts; (4) Releasing content under a permissive license (free and non-exclusive for research use); (5) Releasing software in public repositories with open source licenses; (6) Making data available through publicly accessible and documented APIs and cross-knowlegebase downloads; and (7) Using the existing schemas, APIs and demonstration implementations developed by GA4GH. Academic and commercial groups who have agreed to participate include those at Washington University (CIViC), MSKCC (OncoKB), Weill Cornell (PMKB), OHSU, DFCI, Institute for Research in Biomedicine, Illumina Inc, and others. Additional participants are welcome (http://ga4gh.org/#/vicc). We will present progress by the VICC to create a federated query service able to interrogate associations between cancer variants and clinical actions, for each cancer subtype, based on evidence amassed from all participating knowledgebases world-wide.
Abstract: The ClinGen Inborn Errors in Metabolism Working Group was tasked with creating a comprehensive, standardized knowledge base of genes and variants for metabolic diseases. Phenylalanine hydroxylase (PAH) deficiency was chosen for development of the Working Group’s variant curation protocol. Before curation began, the work group members modified the ACMG guidelines for variant interpretation to be specific for PAH deficiency. Notable modifications to the ACMG criteria include: diagnostic laboratory assay of PAH activity as evidence for in vivo studies; addition of 3 criteria (PS5, PM7, PP6) that use abnormal analytes (blood PHE) and metabolic and/or molecular assessment of disorders with similar analytes. PAH variant curation began using 895 PAH variants listed in the professional version of Human Gene Mutation Database (HGMD). Initially, an Excel spreadsheet was used to track variants for curation and related information. Two of the Working Group Chairs and 3 Working Group Members manually curated the first 20 variants in teams of 2, and presented each variant to the Working Group for final approval. After this first round of curation, a team of 3 biocurators manually curated 25 PAH variants. To aid in this process, a password protected web database was created which allows for editing and links to pdf files of the references used for each variant. Embedded in the web site are links to web sites used for curation (e.g. population databases, computational tools, PubMed), our Modified ACMG criteria, and ACMG Classification Guidelines. As the biocuration workforce grew, a variant curation protocol and workflow was created for use in training and standardization. This protocol has 5 main steps: (1) transcribe gene specific databases (BioPKU and PAHdb); (2) determine frequency of variant in large population studies, parsed by race; (3) determine if computational tools and conservation predict an effect on protein structure or splicing; (4) check clinical significance and phenotype in variant databases; (5) search for primary literature. This protocol includes PAH-specific search terms and nomenclature to use for each web site/database. Five curators subsequently tested the variant curation protocol and workflow. Creation of the modified ACMG guidelines, protected PAH curation web database, and Metabolic Workgroup Variant Curation protocol enabled the standardization of PAH variant curation. This curation is in progress (~433/956 PAH variants).
Abstract: Data relevant to any given scientific investigation is highly decentralized across thousands of specialized databases. Within the Biocuration community, we recognize that the value of open scientific knowledge bases is that they make scientific knowledge easier to find and compute, thereby maximizing impact and minimizing waste. The ever-increasing number of databases makes us necessarily question what are our priorities with respect to maintaining them, developing new ones, or senescing/subsuming ones that have completed in their mission. Therefore, open biomedical data repositories should be carefully evaluated according to quality, accessibility, and value of the database resources over time and across the translational divide.
Traditional citation count and publication impact factors as a measure of success or value are known to be inadequate to assess the usefulness of a resource. This is especially true for integrative resources. For example, almost everyone in biomedicine relies on PubMed, but almost no one ever cites or mentions it in their publications. While the Nucleic Acids Research Database issues have increased citation of some databases, many still go unpublished or uncited; even novel derivations of methodology, applications, and workflows from biomedical knowledge bases are often “adapted” but never cited. There is a lack of citation best practices for widely used biomedical database resources (e.g. should a paper be cited? A URL? Is mention of the name and access date sufficient?).
We have developed a draft evaluation rubric for evaluating open science databases according to the commonly cited FAIR principles -- Findable, Accessible, Interoperable, and Reusable, but with three additional principles: Traceable, Licensed, and Connected. These additions are largely overlooked and underappreciated, yet are critical to reuse of the knowledge contained within any given database. It is worth noting that FAIR principles apply not only to the resource as a whole, but also to their key components; this “fractal FAIRness” means that even the license, identifiers, vocabularies, APIs themselves must be Findable, Accessible, Interoperable, Reusable, etc. Here we report on initial testing of our evaluation rubric on the recent NIH/Wellcome Trust Open Science projects and seek community input for how to further advance this rubric as a Biocuration community resource.
Session 8: Curation for Precision Medicine
Wednesday, March 29, 3:30-5:30PM
Chair: Jean Davidson
Abstract: Correlating phenotypes with genetic variation and environmental factors is a core pursuit in biology and biomedicine. However, numerous challenges impede our progress: patient phenotypes may not match known diseases, candidate variants may be in uncharacterized genes, model organisms may not recapitulate human or veterinary diseases, and filling evolutionary gaps is difficult. Also, many resources must be queried to find potentially significant genotype–phenotype associations.
Advanced informatics tools can identify phenotypically relevant disease models in research and diagnostic contexts. Large-scale integration of model organism and clinical research data together can provide a breadth of knowledge not available from individual sources.
The Monarch Initiative (monarchinitiative.org) is a collaborative, international open science effort that aims to semantically integrate and curate genotype–phenotype knowledge from many species and sources in order to support precision medicine, disease modeling, and mechanistic exploration. To this aim, Monarch has created an integrated knowledge graph, analytic tools, and web services that enable diverse users to explore relationships between phenotypes and genotypes across species. Recent developments include the release of a new BioLink API, and the creation of a suite of phenotype annotation tools (phenotyper.monarchinitiative.org). The BioLink API allows researchers and clinicians to search for relationships between various types of data from patients and animal models, including genes, variants, diseases, phenotypes, clinical notes, environmental conditions, and metadata. Researchers can also use the Phenotyper tool to obtain Human Phenotype Ontology (HPO) terms from research articles and export those terms as a PhenoPacket (phenopackets.org) for data analytics.
Here we will debut new analyses over our integrated data corpus, highlighting the breadth of knowledge that is garnered from numerous species and sources. We will also show the application of these tools to aid diagnosis of patients with very rare diseases. Finally, we will provide an overview of new visualization and biocuration tools that leverage the aforementioned resources.
In conclusion, Monarch strives to provide researchers and clinicians with ontologies, algorithms, and software tools to perform quantitative comparisons of multi-organism data to advance our understanding of how genes, phenotypes, and the environment can lead to disease.
Abstract: Precision medicine for disease prevention and treatment strategies that take individual variability into account is currently the most popular practice in clinical application. However, the major limitations to the development of precision medicine are to develop personalized medicine trials and to keep track of the symptoms for each patient precisely. It has been well known that phenotype makes the area of medical come close to being a science: clinicians make observations about the phenotype, derive a hypothesis (called “diagnosis”), and test their hypothesis by prescribing a certain regimen of treatment, which may or may not be the optimal one for patients. Among all the diseases, the diagnosis of rare diseases is especially difficult. The human phenotype ontology (HPO) provides enrichment disease-phenotype information. However, there is only limited information in the system for rare disease.
To facilitate the rare disease diagnosis and treatment, we built a standards-based database called encyclopedia of Rare Disease Annotation for Precision Medicine (eRAM). eRAM integrates disease-specified concepts from ICD-10, SNOMED-CT and MeSH extracted from the Unified Medical Language System (UMLS), as well as other current open source databases. Up to now, 8,488,796 abstracts from PubMed and 774,514 full text articles from PubMed Central have been text-mined. The extracted sentences describing the disease-manifestation (D-M) and disease-phenotype (D-P) are all stored in our database and can be traced back to the original published papers through PubMed ID. In total, 9,258,371 D-M pairs together with 6,024,671 D-P pairs have been identified respectively. To ensure the reliable result, a pattern based approach was applied during text mining, which contains 10 sophisticated disease-manifestation relationship patterns summarized in previous paper. As a result, 549,619 D-M pairs and 321,792 D-P pairs are considered high confident. Currently, eRAM contains 16,922 diseases, with 7,151 unified human phenotype terms, 31,661 unified phenotypes from matched corresponding mouse phenotype ontology, 14,335 unified manifestations, 5554 genes and 47,366 genotypes. Each record-based disease term in eRAM now has 8 annotation fields containing definition, synonyms, manifestation, gene, genotype, cross-linkage (Xref), human phenotype and its corresponding phenotypes in mouse. All the information can be accessed at eRAM (http://www.unimd.org/eram/).
Abstract: Genetic risk for developing diseases such as type 2 diabetes, cardiovascular disease, or schizophrenia may be determined by variation at more than 100 loci. While genetic association data can potentially provide valuable insights into mechanisms and therapeutic targets for these diseases, there are large challenges inherent in translating them into practical knowledge that can improve risk prediction, diagnosis, or treatment.
Raw data and results from genetic association studies tend to be warehoused in disparate silos under the purview of multiple consortia or institutions, making it difficult to draw conclusions across studies. The size and complexity of these data sets make it challenging to design interfaces and tools that present the results intuitively and allow users to interact with the data. An additional roadblock in providing access to the data is that individual-level genetic data are protected health information, with restrictions on their accessibility and use.
The Type 2 Diabetes Knowledge Portal (type2diabetesgenetics.org), an open-access resource, addresses all of these challenges. It aggregates genetic and phenotypic data from more than 260,000 individuals in a framework that facilitates dynamic analysis across disparate data sets while protecting patient privacy. The Portal is produced by the Accelerating Medicines Partnership in Type 2 Diabetes (AMP T2D), a large collaborative effort that includes the National Institutes of Health (NIH), the Foundation for the NIH (FNIH), and academic, industrial, and nonprofit stakeholders.
The Portal provides a user-friendly interface that enables scientists to identify variants associated with T2D or related traits, or to investigate the effects of perturbing a specific gene or sequence. Versatile tools allow sophisticated queries, such as custom association analysis on individuals within specific phenotypic ranges. We are continuously improving the T2D Knowledge Portal to maximize global utilization of the wealth of genetic information for T2D research.
The analysis methods, user interface, and tools developed for the T2D Knowledge Portal are publicly available and are generally applicable to variant associations with any disease or trait. Two new disease-focused Portals are currently being built on the T2D Portal framework: the Cerebrovascular Disease Knowledge Portal, and a new Portal for the genetics of cardiovascular disease.
Abstract: The genomic landscape of tumorigenesis has been systematically surveyed in recent years, identifying thousands of potential cancer-driving alterations. Interpreting the events identified in data from even a single sequenced sample requires both extensive bioinformatics expertise as well as an understanding of cancer biology and clinical paradigms. To best integrate genomics into the practice of medicine, results must be placed in the context of therapeutic response and diagnostic, predisposing or prognostic associations. The evidence for these associations must be curated so that we can achieve a principled consensus among genomic experts, pathologists, and oncologists on how best to interpret a genomic alteration in a clinical context. This curation and interpretation now represents a significant bottleneck, preventing the realization of personalized medicine. To this end, we present CIViC (www.civicdb.org) as a forum for the clinical interpretation of variants in cancer. We believe that to succeed, such a resource must be comprehensive, current, community-based, and above all, open-access. CIViC allows massively-scalable curation of structured evidence coupled with free-form discussion for user-friendly interpretation of clinical actionability of genomic alterations without any barriers to access. CIViC supports multiple lines of evidence, stratified based on the type of study, from in vitro studies to large clinical trials. Standardization of this crowd-sourced data requires clear and concise guidelines. To minimize ambiguity, the CIViC data model establishes a formal structure for categorizing the information in evidence items, while ontologies such as the Disease Ontology and Sequence Ontology allow for standardization of language. This structure will allow for computational techniques to be applied to the data for novel hypothesis generation. CIViC currently contains clinical interpretations for over 733 genomic alterations in 286 genes spanning 205 cancer types and summarizing the evidence from over 1090 publications. The CIViC interface facilitates both discovery and collaboration by allowing a user to not only search and browse the current state-of-the art interpretations but also to join the community discussion by adding, editing, or commenting on genomic events, evidence for their clinical actionability and the resulting community consensus interpretation.
Abstract: Considering higher throughput and lower costs, large-scale sequencing projects based on population genomics and precision medicine are ongoing or in the planning stages around the world. Currently, Beijing Institute of Genomics (BIG), Chinese Academy of Sciences (CAS) is leading the CAS Precision Medicine Cohort Research Program and participating in the National Key Research and Development Projects on Precision Medicine, which undoubtedly result in massive whole-genome deep sequencing data. To meet this big trend, database resources in BIG Data Center are developed towards precision medicine, with the purpose to manage population sequencing data, perform genetic variation analysis on massive sequencing data, build a high-precision genetic variation map, and develop the Chinese Reference Genome. Specially, Genome Sequence Archive (GSA) accommodates raw sequence data for precision medicine projects, Genome Warehouse (GWH) provides a Chinese Reference Genome as a critical basis for precision health research, Genome Variation Map (GVM) contains a high-precision genetic variation map, and Methylation Bank (MethBank) integrates a large number of methylation profiles from healthy people with different ages. Taken together, BIG Data Center incorporates more data from precision medicine projects and consequently provides a suite of featured database resources to support precision medicine research all over the world. All of these resources are publicly available and can be found at http://bigd.big.ac.cn.
Abstract: The Genomics England 100,000 Genomes Project aims to sequence, analyse and interpret the genomes of around 50,000 rare disease patients and their relatives. The hope is to provide a diagnosis for the underlying cause of the disorder and identify treatment options. Simultaneously, the infrastructure for integrating genomic medicine into routine clinical practice within the National Health Service (NHS) in England is being established. Curation is a fundamental element in the interpretation of genomes within the project, as is engaging members of the clinical, research and industry community.
Analysis of rare disease genomes within the 100,000 Genomes Project includes the use of virtual gene panels, comprised of genes with a diagnostic-grade level of evidence for disease association. This helps filter the millions of variants in each genome to identify those that are potentially causative. It is well known that panel-based tests for the same disease can differ across diagnostic labs; the challenge for the Curation Team at Genomics England is to establish a consensus list of clinically-relevant genes with a high confidence level in order to allow genome analysis and reporting. The Genomics England “PanelApp” database (https://panelapp.extge.co.uk/crowdsourcing/PanelApp) was established to allow the curation of initial virtual gene panels, crowdsourcing of reviews by experts from the clinical and scientific community, and revision to establish a diagnostic-grade virtual gene panel. PanelApp has more than 490 registered reviewers from over 20 countries worldwide, contains >3600 genes and has >120 revised gene panels enabling genome analysis for more than 150 rare disease categories.
We present a preliminary analysis of diagnostic candidates using gene panels before and after review, clinical input and further curation, to show the importance of how a coordinated, combined curation effort can assist genome interpretation and help gain patient diagnoses. The existence of publically available curated resources such as OMIM, Orphanet and Gene2Phenotype is vital in supporting this process. Rules and procedures to evaluate evidence for a gene-disease association were established to gain concordance across different reviewers and curators, and iterative development of PanelApp tools to improve usability and aid curation. The challenges of crowdsourcing knowledge and the key lessons learned from the process will be presented to the Biocuration community.
Abstract: My Cancer Genome is a knowledgebase, web application (www.mycancergenome.org), and set of tools that provides information about molecular alterations in cancer and allows physicians to match patient tumor genotypes to therapies and clinical trials.
Until now, My Cancer Genome content has been generated on an ad hoc basis by a network of expert contributors: 75 contributors from 29 institutions in 10 countries have provided information on 23 cancer types, 621 drugs, 825 genes, and 444 cancer type–alteration relationships. Currently underway is a major transition of the My Cancer Genome knowledgebase to an assertion-based content model that is composed of structured data elements, integrates publicly available data resources, and is augmented by targeted manual curation. This transition will allow for scalability of the volume of content housed within My Cancer Genome, more rapid content updates, improved maintenance of existing content, and better annotation and management of primary source references.
My Cancer Genome has also included a database of cancer clinical trials searchable by structured diagnosis and by gene keywords. As molecular biomarkers and targeted therapies have become more common in cancer clinical trials, use of gene keywords has increasingly led to false positive search results. As a result, we embarked upon and are nearing completion of the first phase of re-creating the trials database. Using manual curation, we are annotating trials by diagnosis and detailed molecular inclusion and exclusion criteria. Currently we have reviewed 5,201 active cancer clinical trials from ClinicalTrials.gov, and, of those, curated biomarker eligibility criteria for 1,884. The next phase of clinical trial curation will result in curation of clinical trial outcomes as published in abstracts, journal articles, and submissions to ClinicalTrials.gov as well as curation of prior therapy eligibility criteria.
My Cancer Genome content is housed in a custom-built content management system. This system makes use of publicly available disease and drug ontologies as well as data sources to validate genes and alterations entered. Clinical trial documents are downloaded nightly and curators are alerted when new or updated trial documents are available for curation. Also underway is development of an integrated reference manager with automated levels-of-evidence analysis to enhance annotation and management of source references supporting assertion-based content.
Note: Poster boards measuring 30 inches x 40 inches (76 cm x 101 cm) and pins will be provided. In order to fit, posters should be printed in portrait (not landscape). Posters should be put up the morning of your session and taken down in the evening. All posters must be removed by 3pm Monday, March 28.
Session I: Sunday March 26, 2017, 12:00-1:30 PM
Berg Hall, Room A
Data Integration, Data Visualization, and Community-based Biocuration
Berg Hall, Rm A; Sunday, March 26, 12-1:30 PM
Abstract: One of the main purposes of biocuration is to bring carefully
assembled collections of biological information to biologists.
But how can we best ensure that biologists receive information
of interest to them?
The new BioCyc Update Notification system enables users to designate
pathways or genes or GO terms of interest to them; subsequently,
BioCyc will send a personalized notification email to a user when a
new BioCyc release includes updated curation for any of the designated
objects. The notification email will include details about which
attributes of which objects have been updated, and will include links
to the corresponding BioCyc page. Notification emails are sent a
maximum of four times per year; users may unsubscribe from these
emails at any time. Update Notifications are database-specific,
e.g. if a user subscribes to a pathway in EcoCyc, they will not be
notified if that pathway receives additional curation in BsubCyc.
Two of the types of update notifications are available are as follows.
- Individual Pathways: Users are notified when there is new curation
to the pathway itself, or to any of its reactions, enzymes or genes.
- GO Terms: Two possibilities are available:
-- Receive notification when new genes are annotated to the specified
GO term or any of its child terms.
-- Receive notification when either new genes are annotated to the
GO term or any of its child terms, or any existing genes (or
associated objects) annotated to the GO term or any of its child
terms have new curation.
Update notifications are computed by comparing two consecutive versions
of a BioCyc database. The addition of a new publication to a metabolic
pathway or to any of its enzymes is one of the signals that the software uses to detect that new curation had been added to that pathway.
Abstract: Semantic Web technologies allow life science databases to be stored and queried directly over the web, thus allowing such data to be traversed computationally. Since semantics are defined by ontologies, data on the Semantic Web can be used for inferences, such that unexpected relationships between data can be found by using inference algorithms. In terms of glycoscience data, related information, such as protein and lipid information that pertain to a particular glycan can be potentially retrieved online semi-automatically.
We have been developing an international glycan structure repository called GlyTouCan (http://glytoucan.org) using Semantic Web technologies which would enable the integration of various informatics resources. GlyTouCan is a freely available, uncurated registry for glycan structures that assigns globally unique accession numbers to any glycan independent of the level of information provided by the experimental method used to identify the structure(s). That is, any glycan structure, ranging in resolution from monosaccharide composition to fully defined structures, including glycosidic linkage configuration, can be registered as long as there are no inconsistencies in the structure.
GlyTouCan is fully based on Semantic Web technologies and currently provides links to other major glycan databases such as GlycoEpitope, the bacterial, plant and fungal carbohydrate structure database CSDB, GlycomeDB, UniCarbDB, UniCarb-KB and JCGGDB. Links are now also available to other databases such as PubChem and PDB. Users can register their own glycan structures in GlyTouCan and thus retrieve links to other databases containing the given structure. PubMed IDs can now also be attached to glycans in GlyTouCan, and the publication information is automatically retrieved from NCBI. Thus, although GlyTouCan is fundamentally a glycan structure repository, increasing functionality is making it possible for it to be used as a portal for glycan information.
Abstract: The pathogen-host interactions database PHI-base (www.phi-base.org) is a knowledge database. It contains expertly curated molecular and biological information on genes proven to affect the outcome of pathogen-host interactions reported in peer reviewed research articles. Genes not affecting the disease interaction phenotype are also curated. Viruses are not included. Here we describe a revised PHI-base Version 4 data platform with improved search, filtering and extended data display functions. Also a BLAST search function is now provided. The database links to PHI-Canto, a new multi-species author self-curation tool adapted from PomBase-Canto. The recent release of PHI-base version 4 has an increased data content containing information from >2000 manually curated references. The data provide information on 4460 genes from 264 pathogens tested on 176 hosts in 8046 interactions. Pro- and eukaryotic pathogens are represented in almost equal numbers. Host species belong ~70% to plants and 30% to other species of medical and/or environmental importance. Additional data types included into PHI-base 4 are the direct targets of pathogen effector proteins in experimental and natural host organisms. The different use types and the future directions of PHI-base as a community database are discussed.
Urban et al., (2017) PHI-base: a new interface and further additions for the multi-species pathogen-host interactions database. Nucleic Acids Research (04 January 2017) 45 (D1): D604-D610 doi: 10.1093/nar/gkw1089
This work is supported by the UK Biotechnology and Biological Sciences Research Council (BBSRC) (BB/I/001077/1, BB/K020056/1). PHI-base receives additional support from the BBSRC as a National Capability (BB/J/004383/1).
Absract: The Arabidopsis Information Resource (TAIR, www.arabidopsis.org) is a highly used global community resource for plant scientists that strives for maximum usability and accessibility by taking a user driven approach to interface development and design. Our goal is to present the high quality, relevant data, that is collected and curated at TAIR, in an intuitive fashion. To achieve this goal we have adopted User Driven Development (UDD) and lean principles of software design. We engage our user community in the development cycle from the outset and iterate between design and feedback cycles to inform both content and design. Here, we describe our process for user driven data integration and visualization with a specific example of gene family data. We show how a combination of user surveys, simple prototyping, design ‘sprints’, and user interviews was used to guide the selection, organization and display of gene family data in TAIR. We describe the effective use of readily available, low cost software (i.e. Microsoft Powerpoint, Survey Monkey, Constant Contact and Apple Quicktime) to gather and analyze user feedback. TAIR is one of several genome databases that have implemented user driven design methodologies; our goal in presenting this work is to engender discussion and sharing of best practices for user-driven biological database and software development that may be of particular benefit to newer databases.
Abstract: Many data/findings produced during research are never deposited into authoritative databases and are therefore excluded from downstream data parsing. The costs of curation and the large volumes of digital data generated by research laboratories provide a significant hurdle for databases to keep up with published research. Further, many data remain invisible to curation because it is never published: results often do not fit within the scope of a standard publication; trainee-generated data is often abandoned when the student or postdoc leaves the lab; results are left out of science narratives due to publication bias, e.g., negative results or results that are not groundbreaking are omitted from the reported research. We propose a means of reclaiming the ‘lost’ data by capturing them into archival repositories such as the Model Organism Databases and publishing them as Micropublications, according to findable, accessible, interoperable and reproducible (FAIR) data principles. Our Micropublication platform will allow researchers to directly submit data from single experiments through carefully designed forms that enable the use of standard vocabularies via autocomplete fields. These forms will ensure data and metadata captured meet all MIBBI/Biosharing standards as well as adhere to community-established nomenclature. Each Micropublication, and thus the data within, will be peer reviewed, assigned a unique digital object identifier (DOI), integrated into established data repositories, and published in our journal, Micropublication: biology, ultimately providing a legitimate citable reference. The micropublication pipeline will (1) empower authors to curate their own data; (2) capture and publish single experiment results, including negative results; (3) deposit data into the databases through automated pipelines or data alerts to curators that are triggered by the submission forms; and (4) give credit to researchers for their contributions as an incentive. This project will have long-reaching benefits to the science community as authors now learn the value of curation, data will not have a reason to be lost from the public, and curation efforts at the databases will be changed from one of data mining and extraction to that of data management. We will share our initial experience with gene expression and mutant drug-influenced phenotype submissions.
Abstract: In common with other model organism databases (MODs), WormBase has been trying to address the challenge of managing increasing numbers of curatable papers, which require manual curation as well as working through existing backlogs.
While it is desirable that authors and the wider community submit data directly to us, we have lacked the infrastructure to accommodate bulk data submissions of many data types in an efficient and user-friendly manner.
One such data type is phenotype data, which has a 6000-paper backlog and thus was chosen as the focus of our community curation efforts. We have explored a number of different ways to encourage community participation and from these efforts we have created a pipeline for community curation of phenotype data. We have updated four of our many online submission forms to make them more user-friendly.
These forms have online help, autocomplete functions and are linked to existing WormBase data to assist the submitter in making a quick and accurate submission. For example, author names will autocomplete and automatically populate an email field, submitters will be told right away if a gene name already exists and if not will direct them to the appropriate submission form. Pull down menus make the submission process easier, and mandatory fields are clearly marked. Other optional fields allow the submission of richer annotations.
Our automated data-type flagging pipelines identify papers that have phenotype data, allowing curators to then send a pre-composed email to authors of these papers, requesting phenotype data submissions. We have found this more personal approach results in higher participation than sending mass automated emails. We list the top 20 community curators on our website as a way to acknowledge and encourage participation.
Overall we have seen significant community participation: Since October 2015 (over the last 15 months), we have received 3,324 phenotype annotations for 437 papers from 322 unique community curators. We have sent 2,111 email requests for data and received community curation for 339 of them (16% response rate), with an additional ~100 papers for which authors took their own initiative to send data. This phenotype data request and submission pipeline can act as a model, which may be applied, to other data types.
Abstract: Gramene’s pathway portal Plant Reactome (http://plantreactome.gramene.org/) features Oryza sativa (rice) as a reference species for the curation of cellular-level pathway networks by employing the Reactome data model that organizes metabolites, proteins, enzymes, small molecules, and various macromolecular interactions into reactions and pathways in the context of their subcellular location within a plant cell. This reference set of curated rice pathways is being used to generate gene-orthology based pathway projections for 66 additional species, including plants and photosynthetic microbes, with sequenced genomes and/or transcriptomes. Currently, the Plant Reactome hosts various types of pathways, including metabolic, transport, genetic/regulatory, developmental and signaling pathways. Users can compare reference rice pathways with the projected pathways from any plant species from the available list. The pathway clustering across the broad phylogenetic spectrum of photosynthetic organisms shows distinct gene-pathway association patterns reflecting evolutionary history and ploidy levels. Plant Reactome also provides analysis tools for the visualization and analysis of omics data, links to external sources that provide detailed information on various pathway entities (e.g., Gramene’s species-specific Ensembl Genome browser and gene pages, UniProt, ChEBI, PubChem, PubMed, GO etc.). Users can access data programmatically via APIs or download it in various standard formats. The Gramene project is supported by an NSF award (IOS-1127112). The Plant Reactome database is produced with intellectual and infrastructure support provided by the Human Reactome award (NIH: P41 HG003751, ENFIN LSHG-CT-2005-518254, Ontario Research Fund, and European Bioinformatics Institute (EBI) Industry Programme).
Abstract: Precision medicine has shifted our understanding of the etiology and treatment of cancer from a focus on anatomical to molecular features. The genetic fingerprint of a patient can be deterministic in both the onset and treatment of the disease. However the etiological network of a specific disease consists of very diverse factors from genetic to environmental. With such diverse knowledge comes a diverse data infrastructure. Data is scattered across data silos and different data formats/structures. This poses a serious bottleneck when interpreting data in a clinical and/or research setting.
CIViC (http://www.civicdb.org) is an open-access, community-based, highly-curated cancer variant database. It is a platform where data on cancer genomic alterations from different data sources are curated and interpreted for clinical application. These interpretations with their evidence are captured and stored as structured data in the public domain. In order to reach an even broader audience an effort was made to include CIViC's data into Wikidata. Wikidata contains and feeds structured data into Wikipedia and to other Wikimedia projects. It has all the traits of Wikipedia (open-access, editable, community-driven) and is accessible to both humans and machines. Although Wikidata has a Wikipedia narrative, its application is not limited to it. The open APIs allow broader application.
Adding public domain datasets to Wikidata benefits audiences in both directions. In this case, Wikidata gains additional content from a highly-curated resource, while CIViC gains exposure to a wider audience, the ability to link to other data types and domains (e.g. drugs) and the benefits of Wikidata’s being a hub on the Semantic Web, allowing complex queries to be performed. We report the process involved in linking CIViC to Wikidata. This led to eight new Wikidata relations and a model to capture provenance. The resulting statements are built upon common standards coming from ontologies and other resources. The success of the data integration is proof that different data models can work together without any loss of information and we invite other resources to follow (See the following for an example CIViC record in Wikidata: http://tools.wmflabs.org/sqid/#/view?id=Q21851559). The Python-based platform used (WDIntegrator: https://github.com/sulab/wikidataintegrator) is available as open source.
Abstract: With the rapid development of sequencing technologies towards higher throughput and lower cost, sequence data are generated at an unprecedentedly explosive rate. To provide an efficient and easy-to-use platform for managing huge sequence data, here we present Genome Sequence Archive (GSA; http://bigd.big.ac.cn/gsa or http://gsa.big.ac.cn), a data repository specialized for archiving raw sequence data. In compliance with data standards and structures of the International Nucleotide Sequence Database Collaboration (INSDC), GSA adopts four data objects (BioProject, BioSample, Experiment, and Run) for data organization, accepts raw sequence reads produced by a variety of sequencing platforms, stores both sequence reads and metadata submitted from all over the world, and makes all these data publicly available to worldwide scientific communities. In the era of big data, GSA not only is an important complement to existing INSDC members by alleviating the increasing burdens of handling sequence data deluge, but also takes the significant responsibility for global big data archive and provides free unrestricted access to all publicly available data in support of research activities throughout the world.
Abstract: The Genome Warehouse (GWH; http://bigd.big.ac.cn/gwh) is a centralized resource for housing genome-scale data produced by different genome sequencing initiatives. It integrates high-quality genome sequences and related data and provides users with open access to a collection of genomes from featured important species. For each specie, GWH contains detailed genome-related information including species metadata, genome assembly, sequence data and the corresponding annotations. GWH provides free data archival services, accepts data submissions from all over the world and offers unrestricted open access to all available data. To archive high-quality genome sequences and genome annotation information, GWH adopts a uniform standardized procedure for quality control. Besides, GWH integrates popular online analysis tools to make genome analysis more simple and convenient.
Abstract: Recent discoveries in the area of genome engineering have revolutionized biomedical research. The available technologies have significantly changed our approach to studying cell function, with an impact on biomedical sciences similar to that of PCR and high-throughput sequencing. Specifically, the techniques derived from the discovery of the CRISPR/Cas9 system are ushering in a new wave of laboratory protocols that are undergoing rapid experimentation and modification. These include a broad range of techniques from gene editing through non-homologous end joining and homology directed repair, to transcriptional activation and repression, as well as imaging experiments. Protein activity can be controlled in both time and space, further expanding the potential uses of this approach.
A rapid increase in published experiments using CRISPR-related proteins has been driven by past successes and the increasing availability of commercial reagents. Combined with the ability to carry out genome-wide screens, this has resulted in a high volume of generated data. Curation is required to capture this wealth of information in a systematic fashion, enabling its visibility, discoverability and re-usability. To this end we are developing an EMBL-EBI CRISPR archive to collect, curate and convey the large amounts of data arising from these techniques.
We are currently assessing the depth of experimental data to capture, the set of suitable visualisations for experimental output and the systems that would facilitate submissions to the archive from the authors themselves. The standard operating procedures and best practices are being developed in line with other databases at the EMBL-EBI, such as the GWAS catalog, and through discussions with different groups of CRISPR users. Programmatic accession and manual curation will be combined to reduce man-hours while maintaining a high level of data quality. All submitted datasets will be required to provide a common set of parameters, allowing for cross-comparison irrespective of their origin. By including data from genome-wide pooled screens and single gene experiments we hope to provide an integrated resource that will aid research and facilitate new discoveries.
Abstract: Model organisms are widely used to augment investigation into the etiopathogenesis of human disease. Coupling human sequence, phenotypes, pathogenicity variant calls and other biological evidence with experimental data from tractable model organisms can accelerate the identification of causative disease variants and the testing of therapeutic candidates. At present, disease relevant data reside in different formats within multiple independent databases, making data extraction and synthesis difficult and time consuming. The Alliance of Genome Resources (AGR) seeks to standardize and integrate these data in order to facilitate the search and retrieval of disease information from a single interface.
Six model organism databases (MODs), including the Saccharomyces Genome Database (SGD), WormBase (WB), FlyBase (FB), the Zebrafish Information Network (ZFIN), the Rat Genome Database (RGD) and Mouse Genome Informatics (MGI), have collaborated to create annotation standards for model organism disease model data. We have converged upon using the Disease Ontology (DO) as a common source of disease terms and definitions, and we are actively working with the DO group to enhance the content and structure of the ontology so that it may be fully adopted by all MODs. We have created specifications for a common data exchange file, the Disease Association File (DAF), to standardize disease model annotation data for integration into the AGR unified resource. This specification includes: the ID of the object being annotated, such as a genotype, strain, allele or gene; experimental conditions required for the model, such as chemical exposure or dietary modification; the associated disease identifier (DO ID); the identity and effects of modifiers of the disease model, such as treatments; evidence codes for these assertions; and an attributed reference.
DAF data sets from the MODs have been created and are currently being pulled into a common resource. We will show examples of the first beta version of gene pages in the unified AGR interface showing model organism disease data, together with our plans for disease reports that will integrate MOD data with human gene-to-disease associations.
Abstract: RegulonDB contains information manually curated for different transcriptional regulation pieces of the E. coli K-12 genome. Currently, with the development of high-throughput (HT) technologies, huge amounts of information related to the regulation of genetic information are being generated. The management and integration of this kind of information are difficult due to the lack of well-defined data management processes. This has motivated us to implement tools that allow the extraction, handling, storing, and curation of these data to improve the overall understanding of transcriptional gene regulation.
We have defined three principal steps to integrate the different types of information available via HT technologies. First, we will perform a search and classification of articles for studies in which HT technologies were used; publicly available datasets will be obtained as well as relevant information for integrating these data. Second, we will format the information so that it can be used in an integrative bioinformatic pipeline. Third, we will work on the testing and standardization of the integration process.
For now, we are working on defining the requirements for the curation and integration of this kind of data. In this sense, we propose the development of a bioinformatic pipeline that will allow us to process and integrate HT data. The pipeline will include raw preprocessing data, processing data, data analysis, estimation of evidence consensus, and determination of evidence quality. This general workflow will enable us to select the evidence to be reported and to summarize the evidence with an integrative reliability score. The comparisons of bona fide regulons (i.e., those in RegulonDB) versus regulons extracted from HT data should allow us to assess the quality of the information gained from HT technologies and to optimize the choice of the tools to be utilized in the workflows, as well as their parameterization.
These results will be organized in a resource which will allow users to compare curated experimental data with data obtained via HT technologies.
Abstract: Urine is an important source of biomarkers. Urinary proteins are mainly composed of plasma proteins passing through the glomerular barrier as well as secreted or intracellular proteins from cells of renal and urogenital organs. Researches have shown that changes of urinary proteins could not only be indicators of urological system diseases, but also reflect various diseases occurring at remote organs such as brain and lung cancers. Current proteomic technologies can identify hundreds of differentially expressed proteins between disease and control samples; however, selection of promise biomarker candidates for further validation study remains difficult. A database that allows accurate and convenient comparison of all of the published related studies can markedly aid the assessment of biomarker candidates.
Our UPBD (Urinary Protein Biomarker Database) provides information about urinary biomarkers or biomarker candidates compiling from published literature. Proteomics and non-proteomics studies on the urine specimens from patients or experimental animals were collected in the database. Once a protein or a peptide was reported to express differently between disease and control groups or between different disease stages, it was regarded as a biomarker candidate and embodied in the database. To ensure the quality of the database, all research articles were manually reviewed. To serve the community, we offer free browse, search and download services to non-profit users. This database can be visited at either http://upbd.bmicc.cn/ or http://122.70.220.102/biomarker/.
The UPBD database was recently updated to version 2.0 in Jan, 2017. A new field describing potential usage of biomarkers (e.g. diagnosis, prognosis) was added. Standardization of database content was conducted by using terms from several commonly used ontologies and controlled vocabularies, including DO (Disease Ontology), NCIT (National Cancer Institute Thesaurus) and ICD10CM (International Classification of Diseases, Version 10 - Clinical Modification). Finally, improved website browse and search functions were developed based on standardized terms, offering more friendly experience to the users.
Abastract: The number of papers where authors use small molecules or nutritional challenges rather than purely genetic approaches to study biological functions and responses of model systems is significant and growing fast. However, unlike other model organism databases (MODs) FlyBase currently does not have a strategy in place to incorporate this data class into our portfolio. We are therefore launching a project designed to make this important data resource available to FlyBase users - in such a way that is most useful and comprehensible to our primary audience but that would also allow the data to be easily integrated with similar resources from other MODs.
Our current strategy is to focus on phenotypic data in the first instance and use the ChEBI ontology as a reference vocabulary of biologically relevant chemical compounds. ChEBI is employed by most of the major MODs that already curate chemical data from primary papers and is therefore the most suitable resource in regard to the planned integration of data from across MODs under the Alliance of Genomic Resources (AGR) initiative. We plan to mirror ChEBI entries for those compounds used by the community in FlyBase and create report pages for each of them. Besides basic information about the chemical compound itself (including structure, synonyms and biological roles – all available on the ChEBI website) these would also hold information on the phenotypic output of the compound in a wild-type Drosophila background as well as phenotypes observed in mutant animals. The latter would also be added to our existing gene/allele report pages.
Once a pipeline for the incorporation of phenotypic data has been established we intend to extend the project scope to include other data types such as changes in gene expression induced by chemical treatments or physical interaction data.
We hope and plan to coordinate our efforts with the other MODs to ensure our data can be incorporated into and easily searched from a common portal under the AGR website.
Abstract: The i5k Workspace@NAL provides tools and resources for data producers and consumers in the arthropod genomics community. Any arthropod genome project without a parent database is welcome to submit their data to us for hosting and make use of our curation, visualization, and dissemination resources. In addition to a central organism page for each project, the i5k Workspace provides web applications for BLAST, HMMER, and ClustalW and Clustal Omega. We also provide the JBrowse genome browser and the Apollo manual curation tool. Here, we outline the resources that the i5k Workspace provides, and describe the community curation process. Finally, we list developments that are underway to improve our services for the arthropod genomics community.
Abstract: Getting data out of papers is one of the major tasks of a model organism database. To do this in a systematic manner requires an infrastructure that enables tracking of publications from addition to the database through the data extraction process. Without a robust publication tracking system, papers can be overlooked or lost or redundant work is done because the paper is entered in the system multiple times. ZFIN has just completed a project that enhanced infrastructure for publication tracking by converting all of our tracking of physical copies of publications to electronic tracking of all publications. Here we describe our paper triage and acquisition process, as well as provide views of the new publication tracking interfaces, literature indexing interfaces, and author communication processes used at ZFIN.
Abstract: Diagnoses of cancer are increasing in frequency in exotic animal medicine and zoological institutions due to improvements in husbandry and reductions in infectious diseases. Increasingly, there are documented parallels between human and animal cancer and therapeutics and human cancer drug development suffers from a lack of appropriate pre-clinical models. Improved preventive medicine and earlier diagnoses of cancer in zoological species has expanded the therapeutic options in zoo animals.
Cancers have been diagnosed and treated in fish, reptiles, birds, and all forms of mammals.1-4, 7 Reported cancers include lymphoma, papillomas, carcinomas, sarcomas, and melanomas. There are likely to be many more cases that have gone unreported, whether as case reports or case series. Treatments in zoo animals are increasingly possible due to technological advances, such as vascular access ports for delivery of chemotherapy, and improved anesthetic techniques for radiation treatments and advanced imaging studies. Additionally, through operant conditioning and the development of newer, safer medications, treatment of zoological species is increasingly feasible.
The Exotic Species Cancer Research Alliance, in collaboration with Stanford University, North Carolina State University, and U.C. Davis is developing a database for clinicians in private practice, academia and zoological institutions to contribute cases, query for parameters related to tumor type, treatment types, adverse effects of treatment and outcomes. Institutions are encouraged to participate through case contributions to this database, which will include diagnoses, plus annotation of any therapeutics employed and outcomes observed. Through these collaborative efforts, we hope to discover trends in cancer incidence, better understand treatment and therapeutic options for exotic and zoological species, and provide a long-term archive of study material for veterinarians as well as researchers studying human cancers looking for biological insight in other species or improved spontaneous cancer models for testing therapeutics.
Abstract: Phylogenetic trees are increasingly used to visualize comparative information in an evolutionary context, and the development of large-scale phylogenies is allowing biologists to address questions at the macroevolutionary scale. However, the current tools and methods available to address questions pertaining to trait evolution at larger-scales are limited. Our goal was to develop a bioinformatics pipeline to automate the integration of large-scale synthetic trait data with large phylogenies, using classic questions regarding the frequency of paired fin loss in fishes as a case study: how often are the pectoral and pelvic fins lost, and are they ever regained? Here we demonstrate the use of new knowledge resources to address basic questions involving the pattern of evolution of phenotypic features. We developed the computational infrastructure necessary to integrate large trait data sets from the Phenoscape Knowledgebase (KB; kb.phenoscape.org) with a large-scale phylogeny for 38,419 teleost species from the Open Tree of Life (http://opentreeoflife.org). We curated all necessary publications pertaining to paired fin morphology into the KB, which uses anatomical, quality, and taxonomic terms from multiple sources including the Uberon Anatomy Ontology, Phenotype and Trait Ontology, and Vertebrate Taxonomy Ontology.
Integration of the trait dataset with the teleost phylogeny resulted in a high proportion of missing data (98.1%). Our novel approach to reducing missing data was to combine data based on computational logical inference with curator asserted data, which we are calling data propagation. Using inference enabled by ontology-based annotations, missing data was reduced to 86.25%; phylogenetic data propagation further reduced it to 36.19%.. This case study demonstrates the gain in knowledge that is possible from the large-scale integration of ontology-annotated phenotypic and phylogenetic data.
Abstract: We’re back! For the second year running we are encouraging BioCuration experts to put their skills to use on a variety of research publications. This year "GigaCuration Challenge" will be run over the full 4 days of the Biocuration2017 conference March 26th to 29th in Stanford, Paulo Alto, USA. The motivation remains the same, see below, but the format of the challenge will be slightly different. Instead of 1 prize for the overall highest number of annotations, we intend to provide incentives for a series of small challenges.
In recent years it has become clear that the amount of data being generated worldwide cannot be curated and annotated by any individual or small group. Currently, there is recognition that one of the best ways to provide ongoing curation is to employ the power of the community. To do this, the first difficulty to overcome was the development of user-friendly tools and apps that non-expert curators would be comfortable and capable of using. Such tools are now in place, including iCLiKVAL and Hypothes.is.
The second problem, which we are now facing, is bringing in and engaging the large number of people needed to perform the curation required to empower these tools. Eventually it is hoped that users will realize their utility and begin to both habitually use and add information to these apps. This will make these tools ever more useful and become a standard part of the process of carrying out research.
There are 4 parts to this years challenges, which means there are more chances for prizes!
1 – Tutorial Challenge (during the workshop on Sunday)
2 – Day 2 (all day Monday)
3 – Day 3 (all day Tuesday)
4 – The overall winner (covering from start of workshop through to midday on Wednesday)
Large Scale and Predictive Annotation/Big Data
Berg Hall, Rm A; Sunday, March 26, 12-1:30 PM
Abstract: The Comparative Toxicogenomics Database (CTD; http://ctdbase.org) is an extensive public resource that helps illuminate the molecular mechanisms by which environmental exposures affect human health. CTD’s literature-based content is derived from four manually curated modules: toxicogenomic core (describing chemical-gene interactions), disease core (providing chemical-disease and gene-disease associations), exposure (relating environmental stressor-receptor-event-outcome statements), and our new phenotype module (detailing chemical-induced modulations of molecular, cellular, and physiological traits). At CTD, we distinguish between phenotypes and diseases, wherein a phenotype refers to a non-disease-term biological event: e.g., abnormal cell cycle arrest (phenotype) vs. lung cancer (disease), increased fat cell differentiation (phenotype) vs. obesity (disease), decreased spermatogenesis (phenotype) vs. male infertility (disease), etc. All CTD chemical-phenotype interactions are notated in a structured format using controlled terms for chemicals (CTD Chemical), phenotypic traits (Gene Ontology), interaction qualifiers, species (NCBI Taxonomy), and anatomical descriptors (MeSH). These controlled terms are from well-established, community-accepted resources that include synonyms and accession identifiers, allowing terms to be computationally processed and mapped to additional vocabularies for database interoperability. To date, we have manually curated more than 15,800 scientific articles for this module, generating over 80,500 interactions that link 5,200 chemicals to 2,900 phenotypes for 630 anatomical terms in over 180 diverse species. Integrating this information with CTD’s extensive chemical-disease associations allows novel connections to be inferred between phenotypes and diseases (based upon shared chemicals), yielding potential insight into the biological processes of a pre-disease state before clinical manifestation of environmental-induced pathologies. Furthermore, integration of all four CTD modules furnishes unique opportunities to generate computationally predictive adverse outcome pathways, making connections between chemical-gene molecular initiating events, phenotypes, diseases, and population-level health outcomes. To our knowledge, this is the first comprehensive set of manually curated, literature-based contextualized chemical-phenotype data provided to the public.
Abstract: The UniRule system serves as an automatic annotation pipeline to supplement expert curation of the UniProt Knowledgebase (UniProtKB). Rules are created and tested by experienced curators based upon experimental data from annotated entries in UniProtKB/Swiss-Prot. Each UniRule uses the presence of specific protein family signature(s) coupled with taxonomy and/or calculated sequence features to predict the name, biochemical features and biological role of a protein. All predictions are refreshed with each UniProtKB release to reflect the latest knowledge. Rules created at PIR are mainly based on PIRSF homeomorphic families, and annotate both name/functional properties (PIRNRs) and sequence/site features (PIRSRs). PIRSF curation is at the core of rule curation. Provisional PIRSFs are created using UniRef50 clusters, then prioritized for curation based on experimental evidence (publications or 3D structure with ligand) for members and non-redundancy with existing UniRules. Curated PIRSFs are then integrated into InterPro for signature match calculation. The creation of seed rules and curation of PIRNRs are assisted by a semi-automatic method that checks for annotation consistency among reviewed members and taxonomic distribution of the family. Once curated, PIRNRs are uploaded into the UniRule system for final testing and production. Positional annotations in PIRSRs make use of structural templates and alignment-based site matches. To ensure scientific integrity and richness of annotation, rule conditions can be defined to propagate specific annotations to a subset of family members. PIRNRs and PIRSRs can be coupled to provide complementary annotation to family members. For example, PIRSF006230 is a family signature for UR000001476 and UR000177785. UR000001476 propagates functional annotation from characterized GTPase members to members with different subcellular location based on taxon scope, Eukaryota (mitochondria) and Bacteria (cytoplasm). UR000177785 further adds positional features (e.g., GTP-binding site) to the annotation. The PIRSR framework is being broadened for site rule curation using sequence features from the RESID database. For example, UR000248071 adds cysteine farnesylation lipid modification to members of the Phosphorylase kinase alpha/beta subunit (IPR008734) family. As member rules of the UniRule system, these PIRNRs and PIRSRs can be queried and viewed in the UniProt website, along with annotations in UniProtKB entries with evidence tags.
Abstract: An increasing amount of biomedical data are being stored in publicly-accessible online repositories. The ability to find and to access these data depends on the quality of the associated metadata. Despite the growing number of community-developed standards for describing biomedical experiments, the practical difficulties to creating accurate, complete, and consistent metadata are still considerable. The National Institutes of Health’s Big Data to Knowledge (BD2K) initiative is funding an array of projects to tackle different dimensions of the metadata challenge in the biomedical domain. The BD2K-funded CEDAR project focuses primarily on the metadata authoring process, and is building a suite of Web-based tools to make the authoring of high-quality metadata a manageable task. As a step towards decreasing authoring time and effort while increasing metadata quality, we have developed an array of predictive data entry capabilities. Our system analyzes previously generated metadata and uses this analysis to generate real-time suggestions for filling out metadata acquisition forms. These suggestions are context-sensitive, meaning that the values suggested for a particular field are updated continuously as a form is being populated. We present the techniques we have developed to provide this facility, and explain how our predictive entry technology enables easy and fast creation of high-quality metadata.
Abstract: In addition to the continuous growth of transcriptomics datasets, a few projects are generating very large amounts of data individually. This presents two challenges: analyzing these very large datasets in a meaningful way, and integrating them with the diversity of smaller datasets. Notably, GTEx provides very interesting information on human gene expression, but the full data is under restricted access and represents ≈80Tb of RNA-seq data from ≈12k samples. We present the integration of GTEx data with other human and animal transcriptome data in Bgee (http://bgee.org/), making it available from our website and our new R package BgeeDB [1]. We notably applied a stringent re-annotation process to the GTEx data to retain only healthy tissues. For instance, we rejected all samples for 31% of subjects, deemed globally unhealthy from the pathology report (e.g., drug abuse, diabetes, BMI>35), as well as specific samples from another 28% of subjects who had local pathologies (e.g., brain from Alzheimer patients). We also rejected samples with contamination from other tissues, using pathologist notes available under restricted access. In total, only 50% of samples were kept; these represent a high quality subset of GTEx. All these samples were re-annotated manually to specific Uberon anatomy and aging terms. Then all corresponding RNA-seq were reanalyzed in the Bgee pipeline, consistently with all other healthy RNA-seq from human and other species. These processed data are being made available both through the Bgee web-application, and through the R package (with sensitive information hidden).
The R package allows, first, to directly retrieve specific datasets (e.g., all RNA-seq from human bone), and integrate it seamlessly into new or existing analyses. Moreover, both GTEx and other RNA-seq data can be retrieved in the same manner and in the same format, and have been processed homogeneously. Second, the R package and the Bgee website allow to perform TopAnat analyses of enrichment of expression patterns from gene lists, which leverage the power of the abundant GTEx data integrated with many smaller datasets to provide biological insight into gene lists (e.g. [2]). TopAnat represents a new type of gene list enrichment analysis tool, unique to Bgee, which provides biological knowledge from such large transcriptomics datasets integrated with many smaller ones.
[1] Komljenovic et al F1000Research 2016, 5:2748
[2] Roux et al bioRxiv doi:10.1101/072959
Abstract: A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO).
We applied rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1,3 million GEO records. We examined the quality of well-supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table.
All algorithms perform significantly better in predicting class values than the majority vote classifier. We find that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases from platform to molecule due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 5 types, and 9 molecules). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.
Abstract: Rice (species in gen. Oryza) is not only an important model organism for monocots but also the most widely consumed staple food for a large part of the world's human population (about 2.5 billion people). Recently, thousands of rice accessions have been re-sequenced, in which single nucleotide polymorphism (SNP) is an important molecular marker for rice breeding. So we collected almost all the public available re-sequenced rice accessions (5152), and identified about 18M SNPs against Os-Nipponbare-Reference-IRGSP-1.0 pseudomolecule using unified standard SNP-calling pipeline. Here we provide detailed annotation information of variations, including SNP consequence, gene function, etc., and it was also hyper-linked to other databases (e.g. dbSNP). We curated the SNP effects by integrating available genotype-to-phenotype association results. We developed a powerful ‘Search Engine’ and a multi-track ‘Browser’ for users to obtain desired information. Also, we developed three practical online tools, RiceClustwl, Population Genetic Analysis and Gene Haplotype Analysis, for users to do advanced data mining and analysis, which is a useful molecular module breeding platform.
Abstract: The Gene Expression Nebulas (GEN; http://bigd.big.ac.cn/gen) is a data portal of gene expression profiles based entirely on RNA-Seq data. High-throughput sequencing technologies provide a revolutionary way for transcriptome profiling, enable facile generation of large-scale sequencing data and accordingly facilitate high-resolution quantification of gene expression levels across a variety of tissues and treatments. GEN currently hosts two featured resources, namely, Mammalian Transcriptomic Database (MTD) and Rice Expression Database (RED). MTD (http://bigd.big.ac.cn/mtd) is a mammalian transcriptomic database that is based on large quantities of RNA-Seq data across various tissues/cell lines and provides a valuable resource for mammalian transcriptomic and evolutionary studies. RED (http://expression.ic4r.org), a committed project of Information Commons for Rice (IC4R), is an integrated database hosting rice gene expression profiles derived entirely from high-quality RNA-Seq data. RED contains a comprehensive collection of 284 high-quality RNA-Seq experiments and thus houses a large number of gene expression profiles that span a broad range of rice growth stages and cover a wide variety of biotic and abiotic treatments. Thus, GEN, integrating RNA-Seq-derived gene expression profiles, is of fundamental significance for deciphering functional elements under diverse conditions and characterizing the dynamics of transcriptomic regulation.
Abstract: Biological data repositories that gather data through expert curation of published literature add new data types to their systems as research methods and priorities change. Additionally, resource limitations can result in incomplete curation of published papers. Therefore, expertly curated data sets can be incomplete when compared to what has actually been published. Knowing that a data set may be incomplete can justify further exploration of published literature for additional data of interest. In this work, the tested hypothesis was that machine learning methods could be used to identify genes in the Zebrafish Model Organism Database that had incomplete curated gene expression data sets. A strong linear correlation was observed between the number of experiment records in ZFIN gene expression annotations for a gene and the total number of journal publications associated with the gene. Starting with 36655 gene records from ZFIN, a data aggregation, cleansing, and filtering process reduced the set to 9870 gene records suitable for building and testing a predictive model. Feature selection and engineering reduced relevant features to the total number of journal publications, the number of journal publications already used for gene expression annotation, the ratio of those values, and the number of transgenic constructs associated with each gene. These features were used to train a linear regression model with 25% of the available gene records (2483 genes selected using a randomized stratified split) to predict how many gene expression experiments each gene should have. The remaining 7387 genes were used to test the model. Of the genes tested, 122 had a residual (actual - predicted) expression experiment count outside the 95% confidence interval (2x RMSE = 6.74) of the regression model, suggesting they were missing expression annotations. Publications not already annotated for expression data from 100 of those 122 genes were examined manually for missing expression annotations. Genes were scored as missing expression data as soon as one paper was found that contained expression data for that gene. The model was able to identify genes with unannotated published expression data with a precision of 0.97 and recall of 0.71. This method can be used to reliably identify specific genes that are likely to be missing curation of published expression data and gauge whether to look further for published data to augment the existing expertly curated information.
Abstract: Protein-protein interactions between a microbial pathogen and its host play a vital role in initiating and maintaining infection, and host-pathogen interactions (HPI) are required for understanding disease. The Host-Pathogen Interaction Database (HPIDB, http://www.agbase.msstate.edu/hpi/main.html) is a public resource that facilitates annotation, modeling, analysis and computational prediction of HPIs. HPIDB provides manually curated, experimentally verified HPI and integrates existing HPI compiled from multiple databases. The HPIDB curation effort complies with IMEx consortium standards and focuses on livestock pathogens. To date (February 2017), HPIDB contains a total of 52,953 HPI entries (58 hosts and 524 pathogens). Despite this current progress in experimental HPIs data, many diseases have little or no available HPIs. To aid researchers in targeting and experimentally validating other important HPIs, we are developing a prediction model by leveraging the existing experimentally verified HPIs. We use Machine Learning to compare multiple protein-related features and computationally predict novel HPIs for a broad range of pathogens. We will report on the effort to develop this prediction tool, including accuracy and ability to recover known HPIs. Our goal is to develop this prediction workflow as an online tool so that researchers can rapidly predict and model interactions for their own pathogens of interest in a genome wide scale; we expect that this modeling will complement wet lab work to identify potential therapeutic targets.
Abstract: The wide availability of the internet has greatly accelerated the pace of online media generation and utilization. Researchers are producing and consuming more and more online media than ever, in various forms, documents, images, audios, videos, software codes, and many more. There is uncountable information concealed in multimedia contents, though it is often not discoverable, due to lack of structured and curated annotations. Also, we keep developing applications with the mindset of permanent and fast internet connectivity, while in reality we have to face temporarily broken or low-bandwidth connections. To circumvent this issue, we have developed an offline-first browser extension for Google Chrome.
The iCLiKVAL browser extension is an easy-to-use application, which uses the iCLiKVAL API to save free-form but structured annotations as “key-value” pairs with optional “relationships” between them. iCLiKVAL is a web-based application (http://iclikval.riken.jp) that uses the power of crowdsourcing to collect annotation information for all scientific literature and media found online. The philosophy behind iCLiKVAL is to identify the online media by a unique URI (Uniform Resource Identifier) and assign semantic value to make information easier to find and allow for much more sophisticated data searches and integration with other data sources.
The browser extension facilitates users to add, update, and delete annotations in offline mode, and then whenever an internet connection is available the data is synchronized automatically with the server via the iCLiKVAL API. When users are online they can find or locate media using the media identifier (DOI, PubMed ID, YouTube ID, etc.) to create media lists to be reviewed or they can add media to the list using temporary media identifiers while they are offline and later the media identifier is validated when the application is back online.
Whether online or offline, we hope this extension simplifies the process for users to safely and conveniently submit their valuable annotations to iCLiKVAL.
Abstract: FlyBase (http://flybase.org) is an essential resource of genetic and molecular data for the Drosophila research community. In the past decade, the exponential growth in the number of large datasets has presented FlyBase (and similar databases) with the challenge of incorporating big data alongside more traditional data types. With the goal of increasing the research community's accessibility to big data, we propose a system of dataset tracking intended to provide the Drosophila researcher with a unified, comprehensive and well-indexed catalog of large datasets that provides links to data repository submissions and their related published results. Datasets are curated according to a four-level classification system (project, biosample, assay and result) to permit more sophisticated tracking of data provenance and facilitate metadata presentation to the public. Metadata for these datasets are described using a variety of controlled vocabularies to allow for improved searchability. The key genes related to datasets, both those that are experimentally manipulated and those that come up as hits in analyses, are captured to provide a listing of the most relevant datasets for a given gene. We anticipate that this dataset tracking system will not only increase accessibility to researchers, but also serve as a useful foundation for future incorporation of genomic data into FlyBase.
Abstract: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical science. Assigning functions to biological macromolecules, especially proteins, is a major challenge to understanding life on a molecular level. While experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for
computational function prediction. However, assessing computational methods and tracking progress in the field remains a challenge.
The Critical Assessment of Functional Annotation (CAFA) is a framework to assess computational methods that automatically assign protein function. Here we report the results of the second CAFA challenge, and outline the new changes that will take place in the third CAFA this year. One hundred and twenty six methods from 56 research groups were evaluated for their ability to predict biological functions for 3,681 proteins from 11 specie. These functions are described by the Gene Ontology (GO) and the Human Phenotype Ontology (HPO). CAFA2 featured increased data size as well as improved assessment metrics, especially the additional assessment metrics based on semantic similarity. Comparisons between top-performing methods in CAFA1 and CAFA2 showed significant improvement in prediction accuracy, demonstrating the general improvement of automatic protein function prediction algorithms. It also showed that the performance of different metrics is ontology-specific, and that the biochemical function of a protein is better predicted than the biological process it participates in, or its cellular location. The upcoming CAFA3 will feature a special track for term-centric predictions by utilizing whole-genome screening of several species for specific functions. Also, we are looking at predictions of moonlighting proteins: proteins with two or more unrelated functions. CAFA3 has recently closed its submission deadline, and it features 345 submitted methods from 63 groups.
Functional Annotation
Berg Hall, Rm A; Sunday, March 26, 12-1:30 PM
Abstract: Pathway Tools is a comprehensive software package that supports building, editing and publishing Pathway/Genome Databases (PGDBs). Using genome sequence/annotation files (e.g. from GenBank or RefSeq), Pathway Tools can automatically build PGDBs and predict the metabolic pathways present in an organism.
A typical genome annotation file contains only a selected number of data types and a small fraction of the experimental information available for an organism. Pathway Tools therefore provides a suite of editing tools that allow curators to add and edit a wide variety of data types in a PGDB. Here we highlight a small number of these tools.
Metabolic pathways have been at the core of the Pathway Tools software since its inception. Although cellular metabolism may appear old-fashioned, metabolic engineering, metabolic modeling, and large sequencing projects such as the human microbiome project have recently sparked renewed interest in and study of basic metabolism. Longstanding metabolic mysteries are being solved with new experimental tools, and metabolic pathways for the biosynthesis and breakdown of newly discovered metabolites are being elucidated. It is important to include these pathways in curated databases, and Pathway Tools contains a complete suite of editing tools to create new compounds, new reactions, and new pathways and associate them with the proper enzymes.
The functions of protein and RNA gene products can be captured in multiple ways. Curators can author summaries and add information such as protein features and modification sites, and can create homo- and heteromultimeric complexes and associate them with the proper functions. For structured annotations, a Gene Ontology editor enables association of GO terms, including evidence codes and citations, with gene products.
An essential facet of biology is the regulation of biological processes. Pathway Tools enables representation of multiple types of regulation at the level of transcription, translation, and activity within a PGDB. Regulation of transcription by sigma factors, transcription factors, RNA polymerase binding factors, and others can be added using a dedicated editor. Regulation of translation by mechanisms including attenuation, riboswitches and small RNAs can be represented. Cofactors, kinetic parameters, and regulation of the catalytic activity of enzymes, including the type of regulation (activation or inhibition; competitive, allosteric, or other types), can be added to enzymes and enzymatic reactions.
The sequencing and annotation of a genome can rarely be considered “finished”; both the sequence and structural and functional annotations can be updated within GenBank or RefSeq files. However, once a PGDB contains manually curated information, it can not be simply re-built without losing the curation work. Pathway Tools therefore offers tools to edit the genome sequence, which will automatically change the sequence and positional information of all database objects that are affected by the change. More frequently, the functional annotations in a genome are changed to incorporate the output of improved annotation tools or new experimental evidence. Pathway Tools also enables updating an existing PGDB with a new annotation file.
Abstract: Significant amount of scientific experimental data is available as published literature. While suitable for manual reading the experimental data is otherwise presented in a machine non-readable format. Semantic digitization of experimental data comes up as a solution by rendering it amenable to computer-based analysis. This is in contrast to usual text-mining based literature curation, which draws information from the text of the research article. MCDRP (Manually Curated Database of Rice Proteins; www.genomeindia.org/biocuration) is a comprehensive resource for browsing and retrieving knowledge on rice proteins, which has been manually curated from rice literature and stored in a protein-centric manner. The curation process utilizes in-house developed models to convert pictorial or graphical experimental data into digital format. Since the data has been curated with the help of universal ontologies and notations, information from different experiments is naturally correlated. Here we attempt to demonstrate the flexibility and depth of information gathered as a result of digitization of experimental data. Some of the most interesting correlations can be drawn by analyzing proteins that share a common ‘Trait’. Digitizing and integration of data from about 20,000 different experiments contained in over 500 research articles identified several traits that are regulated by one or more rice protein, resulting into a complex network of associations. The current release of data has around 840 different traits mapped on to ~394 rice proteins out of which around 286 traits are associated with more than one rice protein. Out of these 394 trait regulatory proteins, physical interaction data has been digitized for 76 proteins in MCDRP. Integration of the digitized protein-trait association data and protein-interaction data into a single model provides probabilistic functional gene networks. Analysis of these networks indicate several putative and yet unknown functional associations between rice proteins. Moreover, these networks can also be overlaid with information such as the associated tissue or molecular functions. Thus, it was possible to demonstrate that such biocuration/digitization endeavour can create knowledge nests, which are precise and comprehensive. The digitized experimental data has high granularity and ease of access while being amenable to semantic integration.
Abstract: InterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. The resource is based around diagnostic models (profile hidden Markov models, profiles, position-specific scoring matrices and regular expressions) provided by 14 different member databases, against which protein sequences are searched to determine their potential functions. Two recently added member databases, CDD and SLFD, also provide specific residue-level annotation. Their addition has enabled identification of important amino acids, such as those responsible for ligand-binding, protein-protein interactions, or comprising active sites, providing an additional tier to the annotations already provided by InterPro.
In a further expansion, InterPro has also added prediction of intrinsically disordered regions (IDRs) – polypeptide segments that have little or no three-dimensional structure. IDRs have a wide range of potential functions, from acting as flexible linkers between domains, to interacting with other proteins and modifying their activity. Due to their bias composition and differing pattern of amino acid conservation, IDRs are hard to model using profile approaches, hence annotation of these regions has been limited to date. To address this, InterPro has integrated the MobiDB Lite tool, which combines eight different prediction methods to generate consensus IDR annotations, focusing on long disordered regions (>20 amino acids). This approach has yielded IDR annotations for ~ 24% of sequences in UniProtKB, covering ~ 5.5% of amino acid residues.
Here we describe how the new per-residue and IDR developments in InterPro enable more richly-detailed and informative annotation of protein sequences, and how they can be used to support more specific functional inferences.
Abstract: In a world of digital revolution, data output in life sciences has undergone massive growth. Scientific data curators are more important than ever, and play a central role in deriving meaningful insights through careful analysis and manual annotation of biological data.
Pfam (pfam.xfam.org) is a widely used database for the identification of protein families and domains. Each Pfam entry relies on expert curation, enabling them to be systematically and routinely compared to the ever growing amount of life science sequence data. Pfam’s strategy uses profile hidden Markov models (HMMs), which group protein sequences into entries based on homology. Each profile HMMs is built from a manually curated multiple sequence alignment (termed seed alignment). Curators also annotate each entry with functional information where possible, thereby allowing the reliable transfer of annotation to all matching sequences.
Recently, Pfam seed alignments have increasingly been built on the UniProtKB Representative Proteomes (RP) sequence set, which is a reduced sequence set that covers the most scientifically important organisms. Migrating Pfam seed alignments to this data set means that each HMM is constructed on a more stable set of sequences with the highest level of experimental data, while still ensuring sensitive detection of homologs.
In an effort to incorporate the utmost experimental information available, we routinely integrate information from key databases. One such resource is the Protein Data Bank in Europe which ensures as many known structures as possible are covered by one or more Pfam entries. Furthermore, we also improve family information by grouping entries into related sets, known as Clans. This adds another layer of curation that requires a deeper understanding of the relationships between the families, which is aided by careful analysis of the aforementioned structural data, sequence information, scientific literature, and tools that allow detection of similarity between profile HMMs.
By combining all of these curation efforts, we are constantly increasing our coverage of protein datasets. This is reflected in our sequence and residue coverage (i.e. the proportion of sequences and residues covered by the database). The latest Pfam release contains 16712 entries, covering 76 % of sequences and 55 % of residues in UniProtKB.
Abstract: The Mouse Genome Informatics (MGI, http://www.informatics.jax.org) is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease. MGI uses the Gene Ontology (GO, http://www.geneontology.org) for functional annotation of mouse genes. The GO defines concepts used to describe gene product functioning, location, and participation in biological processes, as well as relationships between these concepts. However, single eukaryotic genes can encode multiple protein isoforms due to the usage of alternate promoters or polyadenylation sites, alternative splicing of the primary transcript to generate different mRNAs, and/or selection of alternative start sites during translation of an mRNA. Proteins can be further subjected to single or multiple post-translational processing events including proteolytic cleavage as well as protein amino acid modifications. Therefore, the functioning or cellular location of these different protein entities (proteoforms) can often be quite different. To provide the most accurate level of annotation MGI curators make literature-based manual GO annotations where specific proteoforms of the annotated genes are indicated using proteoform specific IDs provided by the Protein Ontology (PRO).
PRO (http://proconsortium.org) is a resource that supplies unique identifiers to specific proteoforms resulting from expression of a gene. These forms are organized in an ontological framework that explicitly describes how these entities relate. The ontology currently has over 68,600 isoforms and 6440 modified proteoforms, which are either imported from high-quality sources or added via literature-based annotation by PRO curators.
The GO annotations to proteoforms are grouped according to the encoding gene, and can be displayed at MGI, as well as at the Amigo database of the Gene Ontology Consortium (http://amigo.geneontology.org/amigo ). The annotations are also provided to the PRO website, where they can be viewed in the context of other proteoforms.
Supported by NIH Grants HG000330, HG002273, and GM080646.
Abstract: The Conserved Domain Database (CDD) is a resource for the functional annotation of proteins and protein-coding genes. CDD provides locations of evolutionarily conserved domain footprints and site-specific functional features associated with such footprints. Aside from providing multiple sequence alignments and BLAST-style position-specific scoring matrices (PSSMs) for a number of imported domain- and protein-family models, CDD runs a data curation program that generates fine-grained hierarchical classifications for large and widely distributed protein domain families, with the aim of providing more precise characterizations of molecular and cellular functions for corresponding protein families.
CDD can be searched with protein or nucleotide query sequences via NCBI’s CD-Search interface, which utilizes RPS-BLAST, a variant of PSI-BLAST. Pre-computed search results, available for the majority of proteins tracked by NCBI's Entrez database system can be retrieved quickly, and are used to group proteins by similarities of their domain architectures, for example. Recently, domain model hierarchies have been submitted to the InterPro resource, a freely accessible protein annotation service, for integration with the other model collections utilized by InterPro. Integration was made possible by providing an RPS-BLAST utility 'rpsbproc', implemented as an amendment to the standalone RPS-BLAST program, which allows users to locally reproduce detailed Conserved Domain (CD)-Search results, including domain superfamily assignments and the predicted locations of conserved functional sites.
Abstract: SPARCLE (Subfamily Protein Architecture Labeling Engine) is a curation interface and corresponding database developed at the National Center for Biotechnology Information (NCBI) that allows Conserved Domain Database (CDD) curators to functionally characterize and label protein sequences that have been grouped by their sub-family domain architecture. Sub-family domain architectures result from annotating protein sequences with domain footprints provided by CDD, a resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins. This resource includes NCBI-curated domains, which use 3D-structure information to explicitly define domain boundaries and provide insights into sequence/structure/function relationships, as well as protein and domain models imported from external source databases such as Pfam, SMART, and TIGRFAMs. The protein names and labels that are given reflect the underlying protein and domain model collections and therefore range from generic to very specific. In many cases they do provide concise functional annotations that are easier to interpret than raw representations of domain architecture. To date SPARCLE has been used to name over 10 million prokaryotic proteins in RefSeq.
Abstract: NCBI's Conserved Domain Database (CDD) is a collection of annotated multiple sequence alignment models that characterize protein segments conserved in molecular evolution. Evolutionarily-related alignments are arranged hierarchically according to conserved and divergent sequence and structural features. The construction of each hierarchy requires the identification of protein subgroups, the optimization of sequence alignments both globally and within each subgroup, and facilitates the annotation of site-specific functional features. CDD also supports comparative analyses of protein families via conserved domain architectures, and a recent curation effort focuses on providing functional characterizations of distinct subfamily architectures using SPARCLE: Subfamily Protein Architecture Labeling Engine.
Because these processes currently require substantial manual curation, we are developing automated strategies to generate well-defined protein domain hierarchies computationally. In an initial step, protein databases are searched for matches to an existing hierarchy of multiple sequence alignments. Next, Markov model Monte Carlo (MCMC) sampling refines the resulting, typically much larger alignment. Another MCMC sampler then partitions the sequences into subgroups based on distinguishing patterns of conserved and divergent residues within these alignments. At the same time, it arranges these subgroups hierarchically and, for each subgroup, identifies signature residue patterns that are conserved over long periods of evolutionary time and therefore presumably play critical functional roles within that subgroup. This output is further refined to consolidate and extend aligned blocks, remove partial sequences, name sub-groups thru text mining, and finally imported into CDD refinement and visualization tools. Subsequently, this entire process can be reiterated to further refine each of the alignment hierarchies.
Completed models will be made available for searches thru BLAST and other NCBI resources. By automating significant portions of the curation pipeline, we hope to be able to greatly expand the set of well-defined models in CDD, identify potentially important residues in the absence of experimental evidence, provide additional statistical support for hierarchical classifications, and support domain architecture-based family characterization via SPARCLE.
Abstract: Interest in primary cilia has increased dramatically over the last ten years as it has become clear that ciliopathies are an underlying cause of numerous human diseases including some types of retinitis pigmentosa and polycystic kidney disease. Once thought to be restricted to a few cell types, it is now clear that primary cilia are found on almost all vertebrate cells and are critical to developmental pathways such as those mediated by Sonic hedgehog (Shh) signaling. Mouse models play a key role in developing our understanding of the role of primary cilia in Shh signaling during development throughout the embryo and in ongoing maintenance of structures such as photoreceptors.
To maximize the utility of the wealth of experimental data generated by these mouse ciliopathy models, we have comprehensively annotated experimentally characterized ciliary genes of the laboratory mouse using Gene Ontology (GO) terms to describe their molecular functions, biological roles, and cellular locations, using the mouse orthologs of the SYSCILIA gold standard of known human ciliary components as a starting point. However, experimental annotations of mouse genes may allow additional genes to be added to the SYSCILIA gold standard list. We have also updated the Gene Ontology to add new terms and clarify existing terms to represent recent advances in our understanding of ciliary biology. These improvements to the representation of cilia within GO greatly improves our ability to make informative annotations for ciliary genes. Comprehensive GO annotation of ciliary genes in the mouse will be a great resource to those doing high throughput studies or comparative genomic analysis across species.
KRC, PR, and JAB were funded by NIH NHGRI grant U41 HG 002273 to the Gene Ontology Consortium. TJPvD and TJG were funded by EU FP7/2009 (SYSCILIA grant agreement no: 241955). TJPvD was also funded by the Virgo consortium (FES0908) and by the Utrecht Bioinformatics Center. TJG and JL were supported by EMBL core funding.
Abstract: Xenbase (http://www.xenbase.org) is a web accessible, NIH funded, bioinformatics database that integrates diverse genomic, expression and functional data for Xenopus, an important tetrapod model in numerous, diverse fields of biomedical and basic science research. The Xenbase website plays an indispensible role in making Xenopus data accessible to the entire research community. Xenbase is a data portal for researchers, accelerating scientific discovery by enabling novel connections between Xenopus, humans and other model systems. Xenbase provides the latest genome assemblies, functional data, expression profiles, relevant literature, specific experimental reagents, and links to numerous resources including the Xenopus stock centers. As the Xenopus community hub, Xenbase provides a forum for up-to date information on the community, events, funding, jobs, and the latest developments in Xenopus research.
Abstract: The ground-level annotations ENCODE applies to the epigenome and the transcriptome are produced by defined cloud-based computational pipelines. These pipelines are run and shared for the primary analysis of ChIP-seq, RNA-seq, DNase-seq, and whole-genome bisulfite experiments. By standardizing the computational methodologies for analysis and quality control, results from multiple labs can be directly compared and integrated into higher-level annotations, such as ENCODE Candidate Regulatory Elements (CREs). Furthermore, because the computational pipeline is integrated with the ENCODE Portal, analysis metadata and data outputs are automatically accessioned and made available immediately. Standardizing and integrating both the data analysis, quality assurance, and accessioning pipelines are important steps toward our goal of integrating submissions of complementary experiment results from anyone in the scientific community. In this way, standards-based analyses and standardized metadata to describe them are important products of ENCODE, but also important to allow complementary submissions at a crowd-source scale.
ENCODE analyses are distributed through the ENCODE Portal at https://www.encodeproject.org/ The pipelines are available as “ENCODE Uniform Processing Pipelines” at https://platform.dnanexus.com/projects/featured The ENCODE DCC codebase is at https://github.com/ENCODE-DCC
Abstract: The Reference Sequence (RefSeq) project at NCBI provides annotated genomic, transcript and protein sequences for genomes across a broad taxonomic spectrum. The known RefSeq transcripts are derived from sequence data submitted to INSDC, and maybe subjected to additional curation to provide the most complete and accurate sequence and annotation for a gene. The curated RefSeq dataset is a critical reagent for NCBI’s eukaryotic genome annotation pipeline and is considered a gold standard by many in the scientific community. The RefSeq project recently has also focused on targeted curation of genes, such as those with exceptional biology.
The term recoding is used to describe non-standard decoding of the genetic code, events that are stimulated by signals embedded within the mRNAs of the recoded genes. Several highly conserved recoding events in vertebrates have been described in literature, such as: ribosomal frameshift (RF), where the ribosome slips either in a +1 or -1 direction at a specific site during translation to yield a protein product from 2 overlapping open reading frames; use of UGA (which normally functions as a stop codon) to encode the non-universal amino acid (aa) selenosysteine (Sec); and stop codon readthrough (SCR), where a stop codon is recoded as a standard aa, which results in translation extension beyond the annotated stop codon to an in-frame downstream stop codon, generating a C-terminally extended protein isoform.
The recoded gene products have important roles in human health and disease; hence their correct annotation is vital to preserve functional information. Conventional computational tools cannot distinguish between the dual functionality of the UGA codon or predict RF or SCR, resulting in misannotation of the coding sequence and protein on primary sequence records. Manual curation is thus essential, so our goal is to provide an accurately curated and annotated RefSeq data set of the recoded gene products to serve as standards for genome annotation and for biomedical research.
The curation and annotation of antizyme genes, which require +1 RF for antizyme synthesis, was the subject of our recent publication (PMID:26170238). To date, the paternally expressed PEG10 gene is the best characterized gene in mammals that utilizes -1 RF for protein expression. Currently, the RefSeq database includes 242 curated RefSeq records for antizymes, 64 for PEG10, 472 for Sec-containing selenoproteins, and 65 for genes reported to utilize SCR.
Session II: Monday March 27, 2017, 12:00-1:30 PM
Berg Hall, Room A
Text Mining
Berg Hall, Rm A; Monday, March 27, 12-1:30 PM
Abstract: Mechanistic models of biomedical processes are a useful tool for understanding biological phenomena, however much of the knowledge about these mechanisms can only be found in the literature. In DARPA’s Big Mechanism program, performers are building (semi-) automated systems to read and extract these mechanisms from the literature at scale, with an initial focus on Ras-driven cancers. To understand the efficacy of such systems, we created a small expert-curated “reference set” against which to compare machine reading results. This required defining what kinds of information to curate, what level of completeness to curate to, and whether to curate only new experimental results or include background information. We hypothesized that constraining the curation task to identify key findings rather than all mechanistic findings in a paper was likely to minimize disagreement among curators, as well as providing a useful baseline for the evaluation of system recall. We defined key mechanistic findings as interactions between proteins/genes/chemicals that were supported by experimental evidence in the paper; each “key” finding had to be mentioned in at least three text or figure legend passages from different sections of the paper.
Three biologists independently curated 10 papers for key mechanistic findings. Half the findings were found independently by all three biologists, while 88% were found by at least two of the three. After discussion, the final reference set consisted of 51 findings.
For evaluation of automated reading, 3 machine systems processed the papers, submitting up to 10 findings per paper. For a test set of 5 papers, the best performing machine system found 73% of the interactions that had been curated independently by all three biologists, and 44% of the interactions identified independently by at least two biologists.
This small experiment shows that even with fairly narrowly defined criteria, human curators annotate different things. This suggests that the standard for measuring recall for automated systems should not be 100% of a reference set created by a single curator, but may be better compared to the overlap achieved by pairs of trained curators.
---
This work was supported by DARPA Big Mechanism contract W56KGU-15-C-0010. This technical data was produced for the U. S. Government under Basic Contract No. W56KGU-16-C-0010, and is subject to the Rights in Technical Data Noncommercial Items clause at DFARS 252.227-7013 (FEB 2012)
Abstract: Peer reviewed scientific literature continues to be the prime resource for accessing worldwide scientific knowledge. Machine learning techniques such as clustering, classification, association rules and predictive modeling uncover meaning and relationships in the underlying content. In this presentation, we will discuss the application of machine learning in three areas, viz, 1. Name entity retrieval (NER): the biological entities are named as per the researcher’s choice, increasing the number of synonyms and textual variants, making information retrieval incomplete. This necessitates the use of a systematic approach to perform efficient searches in the information ecosystem to maximize data retrieval. We describe in detail the various techniques and resources that were exploited in order to address the NER and normalization tasks. 2. Bioactivity prediction: Manually curated databases such as ChEMBL is available to the researchers to be used as training data to uncover underlying patterns, build models and make predictions. In this talk, we provide an overview of this emerging field of molecular informatics, present the basic concepts of prominent machine learning methods and offer motivation to explore these techniques for their usefulness in computer-assisted drug discovery and design, 3. Disambiguation: Last but not the least, for any work of literature, a fundamental issue is to identify the individual(s) who wrote it, and conversely, to identify all of the works that belong to a given individual. Seemingly a simple problem but yet it represents a major, unsolved problem for information science. Researchers have proposed numerous models and we survey the current research approaches.
Abstract: In consultation with other model organism databases, FlyBase has formulated a prototype "author reagent table" (ART). Our goal is to facilitate handling of reagent source and identifier information at multiple steps, benefiting researchers, journals, and biological databases. The proposed ART is in the format of a spreadsheet with standardized columns and invariant row labels. It is designed to be used regularly during the course of a research project, recording reagents as they are received and/or used. Lab-wide use of such a common reagent form would facilitate tracking of reagents within the lab. At the point of submission of a manuscript, with a completed ART in hand, provision of reagent data would be very straightforward, particularly to journals using formatted submission systems such as STAR Methods (Marcus, E. et al., 2016; PMID:27565332). Use of reagent identifiers is one of the key requirements of the system, encouraging the use of database and stock center identifiers, RRIDs (Bandrowski, A. et al., 2016; PMID:26589523), and catalog numbers for commercial providers. Wider use of identifiers and recognized symbols would increase the transparency and reproducibility of biological research, while facilitating curation into research databases. For genetic experiments, unambiguous identification of the genes studied could be an additional component of the ART. A secondary goal of this proposed system is to encourage journals to make such data available as downloadable TSVs, spreadsheets or similar formats. The author reagent table could also be incorporated into the evolving use of preprint repositories: an ART could simply be appended to the preprint manuscript. Feedback on this proposal from the larger biocuration community would be most welcome. Addendum: Genetics has now adopted the author reagent table; see their “Preparing Manuscripts for Submission” page.
Abstract: Construction of structured knowledge requires technology that links text mining and curation to knowledge repository. We recently presented BEL Information Extraction workFlow (BELIEF) as a tool that facilitates the transformation of unstructured information described in the literature into structured knowledge networks. BELIEF automatically captures causal molecular relationships from scientific text and encodes them in BEL statements. BEL (Biological Expression Language) is a computable and human readable language for representing, integrating, storing, and exchanging biological knowledge in causal and non-causal triples. Recently, we have improved the curation process by extending the biomedical vocabulary and by making the curation dashboard more flexible. Moreover, BELIEF was enhanced with the integration of the OpenBEL API that allows direct linkage to the OpenBEL platform and enables upload of curated documents into the BEL knowledge base. These technological developments of BELIEF greatly improve the curation process and make the BEL knowledge more manageable. We continually use the BELIEF to develop an extensively annotated knowledge base of BEL triples that serve as building blocks for causal biological network models.
Abstract: The Biological General Repository for Interaction Datasets (BioGRID) (http://www.thebiogrid.org) is an open source database for protein and genetic interactions, protein post-translational modifications and drug/chemical interactions, all manually curated from the primary biomedical literature. As of January 2017, BioGRID contains over 1,412,000 interactions captured from high throughput data sets and low throughput studies experimentally documented in more than 47,900 publications. Comprehensive curation of the literature is maintained for protein and genetic interactions in the budding yeast S. cerevisiae and the fission yeast S. pombe and protein interactions in the model plant A. thaliana. However, complete curation of human interaction data is currently not feasible due to the vast number of potentially relevant publications on H. sapiens. To address this issue in part, we have taken the approach of themed curation of interactions implicated in central cell biological processes, in particular those implicated in human disease. In order to enrich for publications that contain relevant interaction data, we use state-of-the-art text mining methods, which double the rate and coverage of curation throughput. To date, we have curated themed human interaction data in the ubiquitin-proteasome system (UPS), the autophagy system, the chromatin modification (CM) network, the Fanconi Anemia (FA) pathway and brain cancer. A curation pipeline has also been established to capture chemical/drug interaction data from the literature and other existing resources (see poster by Oughtred et al.). All BioGRID data is archived as monthly releases and freely available to all users via the website search pages and customizable dataset downloads. These new developments in data curation, along with the intuitive query tools in BioGRID that allow data mining and visualization, should help enable fundamental and applied discovery by the biomedical research community. This work is supported by the following grants: National Institutes of Health [R01OD010929 and R24OD011194 to M.T. and K.D.]; Genome Québec International Recruitment Award and Canada Research Chair in Systems and Synthetic Biology [to M.T.]. Funding for open access charge: National Institute of Health [R01OD010929].
Abstract: Genomic technologies are important to study human diseases. NCBI Gene Expression Omnibus (GEO) is the largest public repository (holding 80913 series & 2095407 samples) of genomic data, which is increasingly used to reanalyze the available data with newer set of questions. In 2015, over 600 papers were published which cited re-analysis of data from GEO. Though (re)analyzing GEO data for new hypotheses is proving to be a crucial tool for researchers, GEO stays relatively untapped, because of its complexity, extent of time required and also need of computational skills. A typical query requires analysis of large number of phenotypes with its associated datasets and complex meta-data. This is challenging because; datasets in GEO, are submitter annotated and vary significantly in description, labels, characteristics, and quantity. Just filtering relevant studies by annotation is not enough as manual efforts are needed to choose right datasets for comparisons e.g. disease vs. control, mutant vs. wild type, drug treated vs. untreated/vehicle treated etc. for analyzing relationships. It’s essential to document comparable batch sets to derive right conclusions. This being manual work, demands several weeks of a researcher, in turn compromising accuracy.
We, at Athena leveraged our rich experience of GEO biocuration with machine learning to build a platform GEOmAtik, designed to automate the entire process of GEO search, from mining till classifying the relevant GSM datasets. It lets user search, view and curate GEO datasets, which are then classified using our text mining and machine learning algorithms automatically. The user is given capabilities to select, edit and store lists of GSE IDs, GSM IDs, associated metadata and classification of the GSM IDs to perform further analysis. It empowers the user to perform entire process in a single go without manual intervention saving several weeks of manual tagging work. GEOmAtik is not just a query tool, but it does functional annotations on the datasets, essential for further studies. It offers tremendous time advantage while extracting large number of datasets with high accuracy for meta-analysis. GEOmAtik thus empowers a speedy study of gene expression analysis, which in turn may lead to the discovery of new relationships among diseases, drugs and pathways. At Athena, our primary effort is using biocuration and web development approaches to improve the re-use of available public ‘omics’ scale expression studies.
Abstract: It exist, at least 35 databases listing metabolic pathways or reactions [1]. These databases differ by the organisms and processes represented. It was shown that even for the same organism, the recovery between these databases is smaller than 16% [2]. Furthermore, some processes like the plant response to stress are still not well integrated in databases, even if a large number of publications are available.
We propose to automatically extract metabolic and signaling reactions involved in a list of processes given by the user. Our method combines abstract selection method (medline ranker [3]) with text mining methods (Banner [4], tmChem [5], TEES [6]), databases of metabolic reactions (Brenda [7]) and ortholog detection methods (MARIO [8]) to provide a complete and accurate set of compounds, genes and reactions involved in the processes of interest.
We tested the methodology on two well described pathways in plants and bacteria. We shown that this methodology allow to recover more than 90% of the genes and compounds described in metacyc [9]. Moreover, even for these well known pathways, it propose new entities and reactions possibly involved in the process and actually described in the literature.
[1] Metabolomics Society: Databases Available at: http://metabolomicssociety.org/resources/metabolomics-databases.
[2] Stobbe, M. et al. Critical assessment of human metabolic pathway databases: a stepping stone for future integration. BMC Systems Biology 5, 165(2011).
[3] Fontaine, J-F et al. "MedlineRanker: Flexible Ranking of Biomedical Literature." Nucleic Acids Research 37.Web Server(2009).
[4] Leaman, R. et al. Banner: An Executable Survey Of Advances In Biomedical Named Entity Recognition. Biocomputing 2008(2007). doi:10.1142/9789812776136_0062
[5] Leaman, R. et al. tmChem: a high performance approach for chemical named entity recognition and normalization. Journal of Cheminformatics 7,(2015).
[6] Björne, J. et al. Extracting Contextualized Complex Biological Events With Rich Graph-Based Feature Sets. Computational Intelligence 27, 541–557(2011).
[7] Schomburg, I. BRENDA, enzyme data and metabolic information. Nucleic Acids Research 30, 47–49 (2002).
[8] Pereira, C. et al. A meta-approach for improving the prediction and the functional annotation of ortholog groups. BMC Genomics 15,(2014).
[9] Caspi, R. et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Research 38,(2009).
Abstract: RegulonDB has been for 20 years a database focused on transcriptional regulation in Escherichia coli K-12, with a high quality of manual curation. This curation has been performed through the reading of papers on the one hand and through capturing the data related to transcription regulation in capture forms on the other. Now, we want to integrate these two processes in a new integrated system through digital curation by which the capture forms will be filled based on the selection of terms that we want to capture in the papers during the process of curation of complete articles. The new system will also contain tools to carry out assisted curation, a process that use specific natural language-processing filters to extract phrases that contain the information to be curated, possibly replacing the need to read complete papers. Using this type of curation, we have annotated the growth conditions in which the regulatory interactions (RIs) of 31 transcription factors work in E. coli. We have also initiated the curation of RIs in Salmonella by using the assisted process of annotation.
The new curation system will also include a user window to allow visualization and navigation regarding the ontology of growth conditions that affect gene expression; we are building this ontology now. The new system will also contain a platform to execute an automatic curation along with tools to read simplified phrases of the articles contained in RegulonDB, as well as tools to relate phrases with other phrases across several papers. We will evaluate the effects on the quality and efficiency of these different computational capabilities in our biocuration, and we expect that this system will facilitate curation of transcriptional regulation information for other bacteria.
Abstract: The exponential growth in the volume of biomedical data held in public data repositories has created tremendous opportunity to evaluate novel research hypotheses in silico. But such search and analysis of disparate data presupposes a consistent semantic representation of the metadata that annotate the research data. Semantic grouping of data is the cornerstone of efficient searches and meta-analyses. Existing metadata are either granularly defined as tag¬–value pairs (e.g., sample organism="homo sapiens") or implicitly found in long textual descriptions (e.g., in a study design overview). Current practice is to manually map metadata strings to ontological terms before any data analysis can begin. But manual semantic annotation is time-consuming and requires domain and ontology expertise, and therefore may not scale with metadata growth.
Under the umbrella of the Center for Enhanced Data Annotation and Retrieval (CEDAR) metadata enrichment effort, we are building the Semantic Annotation Pipeline (SAP), which automates semantic annotation of biomedical data stored in public data repositories. The pipeline has two major segments: 1) Reformat the metadata stored in the data repository to CEDAR JSON-LD format, using templates created in the CEDAR repository, and 2) Add semantic annotations to the CEDAR formatted metadata. We are employing NCBO BioPortal's Annotator to efficiently map metadata text segments to ontology terms. We are also evaluating the use of Apache’s UIMA ConceptMapper for this purpose.
We are using the GEO microarray data repository to build and evaluate SAP. Our initial focus is on annotating a specific set of experiment metadata including experiment design, and sample characteristics such as organism, disease, and treatment. We plan to evaluate the SAP annotations against manually curated data, including GEO datasets found in the GEO repository. We intend to show that SAP can ease the process of semantic annotation of metadata, and the enriched metadata can support efficient search and meta-analyses of biological and biomedical data.
Abstract: The increased availability of biological literature in a digitized form has enabled the potential for large scale and integrative computational access to previously unavailable natural language descriptions. Natural language processing (NLP) tools have been developed to automate the otherwise costly and time-intensive manual annotation effort. The ability to meaningfully and objectively evaluate the performance of such tools depends on the availability of expert-curated Gold Standard (GS) datasets. Here, we describe the first manually created GS dataset of annotated phenotypes for evolutionary biology, and its application in comparing curator and machine generated annotations. We created the GS according to curation guidelines previously published by the Phenoscape project for the formal Entity-Quality annotation of complex evolutionary character data. The GS was developed as a consensus dataset of annotations created jointly by expert curators for 203 characters, and we used it to evaluate the results from a curation experiment comparing annotations made by individual expert curators and a machine (CharaParser+EQ) using NLP algorithms. In the experiment, three curators independently annotated the set of 203 characters randomly chosen from seven publications on extant and extinct vertebrates with an emphasis on skeletal anatomy. In a first, or “Naïve”, round of annotation, curators were not allowed access to sources of knowledge beyond the character description, as a measure of the impact of the curators’ implicit knowledge. In the second, or “Knowledge”, round, curators were allowed to access external sources of knowledge. We used semantic similarity-based metrics to compute inter-curator consistency. We found the consistency of CharaParser+EQ with curators was significantly lower than curator to curator consistency. Unexpectedly, curators’ access to knowledge did not improve inter-curator consistency. Relative to the GS, curator consistency with the GS differed between Knowledge and Naive rounds for some but not all curators, perhaps reflecting the differences in subject matter expertise of the curators. Finally, increasing completeness of the requisite ontologies significantly improved machine consistency relative to human experts and also relative to the GS, suggesting ways to design NLP tools that can best complement and augment human curation.
Data Standards and Ontologies
Berg Hall, Rm A; Monday, March 27, 12-1:30 PM
Abstract: The Comparative Toxicogenomics Database (CTD; http://ctdbase.org) is a freely available resource that provides manually curated information on chemical, gene, phenotype, and disease relationships to further our understanding of environmental exposures on human health. Four primary modules are independently curated into CTD, including chemical-gene interactions, chemical-disease and gene-disease associations, chemically-induced phenotype relationships, and environmental exposure data which describe the effects of chemical stressors on human receptors during exposure events and the resulting outcomes. Exposure details, including stressor source, receptor age, sex, smoking status, measured biomarker levels, influencing factors, correlations with disease or phenotypes, and geographic location are curated using exposure ontology (ExO) terms and additional controlled vocabularies. The latter provides a centralized, searchable repository of exposure data that facilitates meta-analyses and informs study design by allowing comparisons among experimental parameters. Our use of controlled vocabularies during manual curation allows seamless integration of data among CTD’s four modules and with external data sets, such as Gene Ontology annotations and pathway information. To date, over 800 unique chemical stressors and 500 disease/phenotype outcomes have been described (from over 1500 articles) in our exposure module, and these data are now linked to over 1.7 million chemical-gene-disease interactions and 80,500 chemical-phenotype interactions in CTD. Analysis tools in CTD reveal direct and inferred relationships among the data, and help generate, interpret and refine hypotheses relating to chemically-influenced diseases. CTD’s centralization of exposure science data, integration with chemical-gene, disease and phenotype modules, and additional analysis tools provide a unique resource to advance our understanding of the molecular mechanisms of action of environmental exposures and their effects on human health.
Abstract: As we annotate more and more gene functions using vocabulary of complex ontologies we tend to exacerbate the challenge of knowledge comprehension. How do we see the pattern in a hairball? For example, the C. elegans daf-2 gene, an insulin-like growth factor receptor, has been annotated with 114 phenotype terms selected from a Worm Phenotype controlled vocabulary of 2,400 terms. Furthermore, daf-2 is annotated with 54 Gene Ontology terms, from over 40,000 total. When these annotations are presented as lists and tables, it can be difficult for users to get at the big picture of what a gene does. We have been trying to remedy this problem by representing annotations graphically with inference compilation and data-driven trimming.
We call our graph SObA (Summary of Ontology-based Annotations). We have implemented a SObA graph for WormBase gene phenotype annotations. It collects all phenotypes of the subject gene, directly annotated or inferred; maps them onto the ontology’s graph; trims the graph so that only the most relevant nodes are kept; and sizes nodes by the number of annotations carried by each node. The SObA graph was implemented using Cytoscape javascript library, supporting local zooming, highlighting, and other controls for a user to peruse the graph with ease, going from an overview to the inspection of a single node. The daf-2 graph can be found here <http://www.wormbase.org/species/c_elegans/gene/WBGene00000898#b--10>.
We believe the SObA graph helps users to better comprehend the biological meaning provided by ontology-based annotations. In addition to worm phenotype, we are expanding SObA to cover other ontologies.
Abstract: How to extract and normalize phenotype data in Chinese literatures and electronic medical records is a great barrier for the Precision Medicine Project of China. In order to standardize and integrate Chinese phenotype data, we began the project of Chinese HPO, including translating HPO into Chinese, constructing the OWL format of Chinese HPO, and building Chinese HPO website. A working group, who established the Standard Operation Procedure and guideline at the initial stage, coordinated the translation. The translation contents included the name, definition and comment of terms in the subclass of phenotypic abnormality of HPO. First, all the contents were translated by the professionals in medical sciences, who were trained before they undertook the tasks. Then, the experts on clinical medicine and genetics reviewed and revised the translation results. The translation and review work were supported by a curation system named BioMedCurator, which facilitated the distributed curation work and the collaboration of members of working group. To date, the translation has completed and 50% terms in HPO have been reviewed. For furthering usage of the Chinese phenotype terms, we began to construct the OWL format of Chinese HPO, in which the Chinese term has the same identifier as that of the corresponding term in the English HPO. User can visit http://www.hpochina.org/ to get the information of Chinese HPO. The website provides a powerful search engine, which is powered by Apache Lucene. The target phenotype could be retrieved by its identifier, Chinese or English name, and the related gene. Meanwhile, the Chinese HPO could be downloaded online.
Abstract: Antibodies are common reagents used in biological sciences to selectively target and isolate specific molecules of interest for various downstream applications. While antibodies are undoubtedly powerful tools, they can add complications when they have not been carefully vetted prior to use. Common problems include: lot-to-lot variation in efficacy and quality, cross-reactivity resulting in off-target binding, improper use in unintended applications, and poor implementation due to poor training. Indeed, the idea of establishing a standard validation framework to guard against antibody-related data reproducibility issues is gaining momentum in the community (Uhlen et al., 2016; Baker, 2015), as variations in antibody quality and lack of proper vetting have been implicated as major drivers behind the so-called “reproducibility crisis” facing biological research.
ENCODE (ENCyclopedia Of DNA Elements) is an ongoing NHGRI-funded project aimed at cataloguing functional sequence elements in the genome that may act to regulate activity under different cell type and condition contexts. Over half of the ~8,500 experiments done in the project involve the use of antibodies for identifying protein-DNA (ChIP-seq) and protein-RNA (eCLIP-, RIP-, and Bind-and-seq) interactions to delineate candidate regulatory elements. Each of the ~3,200 antibody lots considered for use to date in ENCODE is catalogued by a number of key attributes, including vendor, product number, lot id and are also searchable by gene name via the ENCODE Portal (https://www.encodeproject.org/search/?type=AntibodyLot) to retrieve the experimental data produced by aforementioned antibody-based assays. Moreover, the ENCODE consortium has developed a set of standards (https://www.encodeproject.org/about/experiment-guidelines/#antibody) for the characterization of antibodies to evaluate their sensitivity, specificity and reproducibility. These standards require antibody lots approved for use in ENCODE to be backed by at least two supporting characterizations by different methods (e.g. immunoprecipitations, mass-spectrometry, and knockdowns). Each characterization is then assessed by a panel of reviewers in a transparent manner against those standards to determine their eligibility for use in binding assays. Outcomes for each tested antibody lot as well as all data, metadata, documentation and standards are freely available at the ENCODE portal (https://www.encodeproject.org).
Abstract: Biomedical terminologies, i.e. the names of diseases, phenotypes, chemicals and drugs, genes, proteins, cells, human body organs and tissues, etc., are kinds of key denominators for all biomedical resources and would make these resources sharable, linkable and interoperable. The Institute of Medical Information(IMI), Chinese Academy of Medical Sciences (CAMS), being a national research center of medical information, always committed to promote the standardization of medical terminologies in China.
At the beginning of 1980s, with the development of Medical Subject Headings (MeSH) and its localization all over the world, we translated and published all MeSH descriptors and printed entries. Moreover, we integrated Chinese Traditional Medicine and Materia Medica Subject Headings, and also expanded Chinese entry terms, as well as mapping each heading to Chinese Library Classification. This work is called Chinese Medical Subject Headings (CMeSH). Currently, CMeSH vocabulary and its web-based CMeSH browser have been an important look-up aid for quickly localization of interested descriptors, and they are broadly used for Chinese biomedical literatures indexing and retrieving.
Lately, in order to well organize the exponentially growing biomedical big data, we seek to present a Chinese Medical Language System (CMLS) through extensive terminology integration, not only headings for biomedical literatures, but also terminologies covering the entire biomedical domain. Totally, the CMLS integrates over 1 million Chinese names for some 270,000 concepts from more than 30 families of biomedical vocabularies. Besides, CMLS concepts may also be linked to external resources such as SinoMed and UMLS. Recently,based on the advanced computer technologies, an elaborately designed biocuration platform is under development for parallel editing, peer review, interactive and batch auditing, as well as timely extension and maintenance. In the very near future, an open-accessed system supporting Chinese terminology services will come into being, which would offer users many options for searching and browsing CMLS knowledge sources, and provide both web interfaces and Web Services (APIs) to query and retrieve CMLS data.
We do believe that our efforts on the standardization of Chinese biomedical terminologies will not only enable interoperability between Chinese biomedical resources, but also promote innovative development of biomedical information in China.
Abstract: Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data). It is based exclusively on curated healthy wild-type expression data (e.g., no gene knock-out, no treatment, no disease), to provide a comparable reference of normal gene expression. Data are then integrated and made comparable between species thanks to the use of dedicated curation and ontology tools.
First, expression data are annotated to the anatomy of each animal species thanks to the use of the Uberon ontology. They are also annotated to life stage ontologies, developed by our team in collaboration with the Uberon developers. This includes a generic structure for life stage ontologies in any animals, and specific ontologies for each species integrated into Bgee, capturing useful information about, e.g., sexual maturity onset.
To capture relations of organ homology between species and make expression patterns comparable, a dedicated annotation format has been developed, similar to the Gene Ontology annotation format. Bgee provides thousands of organ homology annotations, reported to the putative taxon an organ has likely arisen in, using the NBI taxonomy ontology, and an ontology of similarity and homology-related concepts (HOM ontology), that was developed by our lab.
To further address the problem of assessing the confidence in our homology annotations, and more widely the problem of assessing confidence in biomedical annotations in general, the Confidence Information Ontology has been developed following a workshop held at the Biocuration 2012 conference, and is used to label our homology annotations. We also plan to use this ontology to provide more detailed information about the confidence in our expression calls in the Bgee database.
In conclusion, the Bgee team has developed a curation framework allowing to capture information about gene expression and make it comparable between species, including dedicated ontologies and annotation formats.
Bgee is available at http://bgee.org/
Uberon is available at http://uberon.org/
CIO is available at https://github.com/BgeeDB/confidence-information-ontology
HOM is available at https://github.com/BgeeDB/anatomical-similarity-annotations/
Abstract: With the proliferation of molecular tools designed for the targeted disruption of genes, characterization of the phenotypes produced by such disruptions is a key step in identifying suitable models for human diseases. At Xenbase, the Xenopus model organism database, we have begun to curate phenotypes from the published literature to aid in the discovery of suitable Xenopus models of human diseases. Our initial approach focused on phenotypes associated with human diseases that were generated either by morpholino knockdown, targeted mutation or overexpression. Here we describe a new configuration of Phenote, a Berkeley Bioinformatics Open Source Projects' software package, for the curation of Xenopus phenotypes, including the capture of multiple experimental manipulations and the use of several ontologies and controlled vocabularies to standardize curations. This approach allows the capture of phenotypes covering anatomy, gene expression and biological processes from the Gene Ontology (GO). We describe the specific post-composed Entity-Quality (EQ) approach and ontologies used to capture the curated phenotypes. We also outline modifications to specific ontologies, such as the Phenotypic Quality Ontology (PATO), to tailor them to our needs. The power of our approach is that the captured phenotypes can be used to: 1) generate multiple inferred phenotype statements through logical assumptions from the curated phenotypes, 2) relate gene expression phenotypes to wild type expression curations, and 3) extract information for the construction of Gene Regulatory Network models. Finally, we discuss the benefit of using graph paths to enhance the efficacy of phenotype queries, as well as automatically linking individual phenotype statements to human and mammalian phenotypes.
Abstract: Development of Avian Anatomy Ontology Annotation
Jinhui Zhang and Fiona McCarthy
School of Animal and Comparative Biomedical Sciences, University of Arizona, USA
One of the fastest growing types of data is gene expression data, and to support the use and re-use of this data in avian model species, we are developing an avian anatomy ontology. An avian subset of the Uberon ontology provided 15,300 terms and gave us the foundational structure of this expanded ontology. We have been working to add additional avian specific terms, adding/amending 100 terms to date. These anatomical terms mostly describe adult anatomical structures. We are now in the process of adding developmental terms, starting by mapping 150 anatomical terms used by the GEISHA database, a resource that curates chicken in situ expression data. By linking the GEISHA anatomical vocabulary with avian anatomy ontology terms, we will be able to annotate chicken expression data in a way that promotes searching and comparison between organisms commonly used for developmental biology. Since we expect to curate chicken and zebra finch expression data from a variety of sources, we are also developing standards for annotating expression data. This includes capturing information about gene products expressed, the source of this information, type of evidence (using the Evidence & Conclusion Ontology) and use of annotation extensions to capture additional information such as cellular or subcellular location, molecular function, and process involvement. We expect to combine curation of avian expression data with our ongoing biocuration efforts to provide Gene Ontology and molecular interaction data to better support functional modeling in avian species.
Abstract: With the steady increase in the use of large-scale genomic datasets, a challenge for the scientific community has been to process and store data in a widely accessible, user-friendly manner. Data Coordination Centers (DCCs) are valuable interfaces between data producers and the larger research public. DCCs collaborate with data producers to ensure that all experiments are represented accurately and have sufficient information so that outside users can integrate, repeat, or reuse the individual data elements in novel ways. To meet these goals, extensive metadata regarding the biological methods, sequencing methods, computational tools, and analysis inputs need to be documented such that data results and methods are transparent and consistent. With the vast amount of detail being collected, errors are an expected part of the process. The ENCyclopedia of DNA Elements DCC employs five main methods to check for such errors: schema validation, auditing scripts, unique file identification, file format validation, and file content verification. Schemas define the structure and organization of the metadata to be collected, and property values can be constrained for example, by using dependencies, enums and regular expressions. Custom cross-object “audit” checks are automatically run once new metadata are submitted or existing metadata change. The auditing system is designed to flag inconsistencies that are difficult to enforce by schema validation, such as missing experimental variables (i.e. no antibody specified in a ChIP-seq experiment). The first two methods are used primarily in metadata validation. For validating the data files themselves, we employ three main methods. We use multiple methods of unique file content identification to ensure that there is no duplication of files or of file content. We both restrict and verify the format of each file after it is submitted, but before it is fully uploaded to our system. Finally, we work to evaluate file content against the metadata provided whenever possible, including confirmation of read length and reasonable mapping rates against the appropriate organism’s reference file. The effort in verifying all of these details is repaid with ease of use for integrative analysis.
Abstract: The number of biological ontologies has grown as they have become an increasingly popular tool for supporting biocuration methodologies. To achieve semantic interoperability and better presentation of data, methods for reusing ontology terms must be applied in the development of new and existing ontologies to create a web of linked data. The Evidence & Conclusion Ontology (ECO) is a structured, controlled vocabulary for representing evidence in biological research, including author statements found in the literature, biocurator inferences, and evidence generated by experimental techniques or computational analyses. ECO is broadly used by model organism databases, ‘omics resources, and ontology projects (e.g Gene Ontology) for summarizing evidence. A related but orthogonal ontology, the Ontology for Biomedical Investigations (OBI), provides terms for describing investigations in detail. While both ontologies address biological investigations, OBI classes are in general more specific than ECO classes and represent specific instruments, reagents, processes, and so on, whereas ECO classes describe evidence at a summary level. Mapping classes between two ontologies when their semantic meanings are equivalent is one proven way to reuse ontologies. Thus, ECO terms are being logically defined following specific design patterns that leverage OBI terms to describe, for example, technique used, variable type, intervention, evaluant, and experiment design. Logically defining ECO terms using OBI classes (and creating one-to-one or one-to-many mappings between the ontologies) is helping ECO to disambiguate unclear definitions, increase expressivity, and achieve better logical organization. Here we describe recent advances in harmonizing the ontologies and integrating logical definitions into ECO. This work is supported by the National Science Foundation under Award Number 1458400.
Curation Standards and Best Practice, Challenges in Biocuration, Biocuration Tutorial
Berg Hall, Rm A; Monday, March 27, 12-1:30 PM
Abstract: The Protein Data Bank (PDB) is the single global repository for three-dimensional structures of biological macromolecules and their complexes. Over the past decade, the size and complexity of macromolecules and their complexes with small molecules deposited to the PDB have increased significantly. The PDB archive now holds more than 125,000 experimentally determined structures of biological macromolecules, which are all publicly accessible without restriction. These structures provide essential information to a large, diverse user community worldwide. Annual data file downloads from the PDB archive exceed 500 millions and more than 1 million unique IP addresses access the archive every year.
Expert curation of data coming into PDB is critical for ensuring findability, accessibility, interoperability, and reusability (FAIR). Biocurators enforce data standardization, help to maximize data quality, provide value-added annotation, and maintain uniformity data representation.
An overview of PDB archive biocuration processes for polymer sequence, taxonomy, ligand chemistry, and value-added annotation will be presented.
Abstract: With the exponentially increasing volume and complexity of biomedical data, and conversely, our increasing ability to use the data, or misuse it; the need to train researchers in biocuration is key. As part of the National Institute of Health Big Data to Knowledge (BD2K) initiative, a research team at Oregon Health & Science University is developing a set of Skills Courses and Open Educational Resources (OERs) that includes training researchers to make data structured, discoverable and reusable. At OHSU, basic science and medical students are not formally trained in data management or biocuration, so these optional trainings are valuable for students to learn these important skills. To date, we have offered five in-person skills courses; the initial skills course functioned as a testing ground for creating the OERs. One skills course course titled ‘Data After Dark’ course was offered over two evenings (for a total of 8 hours) and was sponsored by a micro-grant from the Biocuration Society. The Skills Courses are offered to learners at different levels, with some courses targeted towards beginner/novice students and focused on basic research data management and biocuration, and other courses were targeted towards more advanced students that focused on interactive visualization, scripting and analysis skills. In addition to offering the in-person Skills Courses, we created 20 OER modules that are available online for use by both learners and educators. The materials include slide decks, video tutorials, exercises, and recommended readings. Some of the modules especially relevant to biocuration are Data Annotation and Curation (BDK12), Ontologies 101 (BDK14), Team Science (BDK07), Basic Research Data Standards (BDK05), and Guidelines For Reporting, Publications, And Data Sharing (BDK22). The OERs and Skills Course materials are intended to be flexible and customizable and we encourage others to use or repurpose these materials for training, workshops, and professional development or for dissemination to instructors in various fields. The OERs and other materials are available here: www.dmice.ohsu.edu/bd2k; we welcome comments, requests, and contributions.
Abstract: There is agreement in the biomedical research community that data sharing is key to making science more transparent, reproducible, and reusable. Publishers could play an important role in facilitating data sharing; however, many journals have not yet implemented data sharing policies and the requirements vary across journals. To assess the pervasiveness and quality of data sharing policies, we curated the author instructions of 318 biomedical journals. The policies were reviewed and coded according to a rubric. We determined if if data sharing was required, recommended, or not addressed at all. The data sharing method and licensing recommendations were examined, as well any mention of reproducibility. The data was analyzed for patterns relating to publishing volume, Journal Impact Factor, and publishing models (Open Access or subscription). 11.9% of the journals stated that data sharing was required as a condition of publication; 9.1% of the journals required data sharing, but did not make clear it would affect publication decisions; 23.3% of the journals only encouraged authors to share data; 9.1% of journals mentioned data sharing indirectly; and 14.8% only addressed protein, proteomic and/or genomic data sharing. There was no mention of data sharing in 31.8% of the journals. Impact factors were significantly higher for journals with the strongest data sharing policies compared to all other data sharing mark categories. Open access journals were not more likely to require data sharing. Our study showed only a minority of biomedical journals require data sharing, and a significant association between higher Impact Factors and journals with a data sharing requirement. We found that most data sharing policies did not provide specific guidance on the practices that ensure data is maximally available and reusable. As a continuation of this work, we plan to build a curated public database of journal data sharing policies, and convene a community of stakeholders to further work on recommendations for strengthening and communicating journal data sharing policies.
Abstract: Information captured about biomedical experiments, or data about the data, is generally referred to as metadata. This information may include details such as the sample name, location within the body the sample is from, and how it was prepared for sequencing. This information is often captured into plain text tab-separated files to relate metadata to data files. Tab-separated columns are very easy for a computer to read, but are not particularly human readable. An ideal format would be readable for computers, computer scientists and biomedical scientists. We present the Tag Storm format, the method we use for representing metadata at the UCSC Genome Browser. A Tag Storm contains tag names followed by whitespace, with the rest of the line assigned as the tag value. Tag names can be any alphanumeric character or underbar, but cannot begin with a number. These tag names and tag values are organized into stanzas, delimited by blank lines. A stanza can be indented indicating a hierarchy, where child stanzas can inherit or override the parent values. The ability to inherit data greatly reduces the redundancy found in tab-separated metadata. Tag Storms are flexible, allowing one to extend metadata by simply adding new stanzas at the bottom of a heirarchy, for example when performing replicate experiments. We have developed several open source tools for working with Tag Storm files, such as the ability to convert tab-separated files to Tag Storm format, or to perform SQL-like queries on a Tag Storm file. There are also C APIs for working with Tag Storm files, allowing scientists to write their own software. Tag Storms are simple to write, easily read by a human or a computer, and allow a visual representation of data.
Abstract: At Biocuration 2016 in Geneva, Switzerland, we organised a workshop entitled Training needs for biocuration. This was prompted by the fact that there is currently no recognised qualification in biocuration, either to provide new curators with a route for gaining the requisite skills to work within this field, or as a way for curators working in the field to gain recognition for the varied skill-sets that such a career provides them with. This workshop was intended to stimulate discussion on the training biocurators should have and defining the training needs and gaps, and the set of competencies. We will present the very interactive and productive workshop discussions, the potential role of the ISB in driving this forward and some suggested outcomes that the participants recommended for this very critical aspect of a curator’s career.
Abstract: Wikidata is a data repository and access protocol in the public domain and open to the web. Ready for use by both humans and machines. While the scientific community is increasingly using this valuable infrastructure to distribute findings to a larger audience, Wikidata is really a jack of all trades.
However, as the figure of speech continues, a ”jack of all trades” is also a “master of none.” As a truly open data infrastructure, community issues such as disagreement, bias, human error, vandalism, etc. manifest themselves on Wikidata.
From a curator's perspective, it can be challenging at times to filter through the different Wikidata views while maintaining one's own definitions and standards. Whether stemming from benign differences in opinions/views or more malignant forms of vandalism or the introduction of low quality evidence, public databases face extra challenges in providing data quality in the public domain.
Here we propose the use of W3C Shape Expressions (ShEx: https://shexspec.github.io/spec/ ) as a toolkit to model, validate and filter the interactions between designated public resources and Wikidata. It is a language for expressing constraints on RDF graphs. Wikidata is available as an RDF graph. ShEx can be used to validate documents, communicate expected graph patterns, and generate user interfaces and interface code.
It will also allow us to efficiently: (1) Exchange and understand each other’s models, (2) Express a shared model of our footprint in Wikidata, (3) Agilely develop and test that model against sample data and evolve, and (4) catch disagreement, inconsistencies or errors efficiently at input time or in batch inspections.
Shape Expressions has already performed this function in the development of FHIR/RDF (https://www.hl7.org/fhir/rdf.html). This expressive language was sufficient to capture constraints in FHIR and the intuitive syntax helped people to quickly grasp the range of conformant documents. The publication workflow for FHIR tests all of these examples against the ShEx schemas, catching errors and inconsistencies before they reach the public.
Abstract: The Asian citrus psyllid (Diaphorina citri Kuwayama) is the insect vector of the bacterium Candidatus Liberibacter asiaticus (CLas), the pathogen associated with citrus Huanglongbing (HLB, citrus greening). HLB threatens citrus production worldwide. Suppression or reduction of the insect vector using chemical insecticides has been the primary method to inhibit the spread of citrus greening disease. Accurate structural and functional annotation of the Asian citrus psyllid genome, as well as a clear understanding of the interactions between the insect and CLas, are required for development of new molecular-based HLB control methods. A draft assembly of the D. citri genome has been generated and annotated with automated pipelines. However, knowledge transfer from well-curated reference genomes such as that of Drosophila melanogaster to newly sequenced ones is challenging due to the complexity and diversity of insect genomes. To identify and improve gene models as potential targets for pest control, we manually curated several gene families with a focus on genes that have key functional roles in D. citri biology and CLas interactions. This community effort produced 530 manually curated gene models across developmental, physiological, RNAi regulatory, and immunity-related pathways. As previously shown in the pea aphid, RNAi machinery genes putatively involved in the microRNA pathway have been specifically duplicated. A comprehensive transcriptome enabled us to identify a number of gene families that are either missing or misassembled in the draft genome. In order to develop biocuration as a training experience, we included undergraduate and graduate students from multiple institutions, as well as experienced annotators from the insect genomics research community. The resulting gene set (OGS v1.0) combines both automatically predicted and manually curated gene models. All data are available on https://citrusgreening.org/.
Abstract: Data driven science means generating biological data at unprecedented rates. Yet, much effort is needed when it comes to making big data comprehensively organized and publicly accessible to the scientific community worldwide. Knowledge derives from interpretation of biological information. The latter is increasingly (and necessarily) relying on the ability to integrate data/datasets, store, access and explore databases including actual integration of lager datasets and information from scientific literature. This is only possible, with proper annotation/curation of data and datasets, so that the biological information is properly represented and hence can be translated in actual biological knowledge of the systems and processes studied.
Data-driven science is highly dependent on curated databases and bioinformatics tools. Since this is an emerging field, training practices are not yet well established. Biocuration and bioinformatics could benefit from improved standards all around, for best practice in tools and aspects around data as well as on the actual training in such topics, including training materials. Easily adoptable, updatable tools for creating and sharing training materials would be an advantage for the community of trainers and users.
VISION
To become a community and platform for the development and best practices in standards in bioinformatics and biocuration with emphasis on these as topics within learning, education and training as well as standards in these fields per se.
MISSION
This committee aims to:
(1) raise general awareness about Biocuration and Bioinformatics, best practice in metadata/annotation.
(2) promote training elements (courses, materials, processes, tools) between GOBLET members and countries.
(3) enhance the significance of biocuration and open science through training as a vehicle for awareness and best practice across the data life-cycle.
Current committee members (from Nov 2016):
-Vicky Schneider, EMBL-ABR Deputy Director Australia and University of Melbourne (chair)
-Manuel Corpas, Re-positive, Scientific Lead England
-Patricia Palagi, SIB, Head of Training Switzerland
-Judit Kumuthini, CPGR, Human Capacity Development Manager South Africa
-Sarah Morgan, EMBL-EBI, Training Programme Manager England
-Michael Charleston, UTAS, Associate Professor in Bioinformatics Australia
-Pascale Gaudet, SIB, Scientific Manager neXtProt Switzerland
Abstract: Wikidata, a project of the Wikimedia Foundation, is an openly editable, semantic web-compatible framework for knowledge management. Wikidata has a large and active community that contributes to, maintains, and improves the quality of the data in Wikidata as well as deciding how the data itself should be represented. Our team has been populating Wikidata with a foundational semantic network linking genes, proteins, drugs, and diseases. Upon this foundation, we hope to stimulate the growth of this knowledge graph that can be used to build new knowledge-based applications that drive new discoveries.
A cornerstone of the wikidata knowledge graph is its built-in model for tracking the evidence underlying claims. For any claim in the graph (e.g. gene A regulates gene B), it is possible to provide evidence supporting or refuting that claim. The manner in which these evidence statements can be constructed is left open for the community to decide. Therefore, the patterns for representing the semantics of the associated evidence and the provenance trails linking back to the original sources of information must be defined and consistently used, such that this information is easily accessible by the end user or by software that exposes this information to end users.
We defined specific guidelines for referencing the source and determination method of scientific claims in Wikidata. As an example, we represented the evidence underlying Gene Ontology-based assertions of protein function. Our Wikidata model captures the original source of the claim, the original curator, journal article(s) supporting the claim, and the determination method used to establish the claim. This model is used in a domain-specific web application built on top of Wikidata for exploring and contributing to microbial genome annotations, WikiGenomes, by enforcing this model on new claims created by users. These guidelines serve to standardize our team's efforts and to serve as a model for other groups importing data into Wikidata.
Abstract: The amount of data being produced by small grains research has grown rapidly in recent years, resulting in an expansion of our understanding of small grains. For example, the releases of high-quality genome assemblies and mapping studies for small grains germplasm have opened up new possibilities for research and discovery. The abundance of information has also created new challenges for curation. The role of biocurators is, in part, to filter research outcomes as they are generated, not only so that information is formatted and consolidated into locations that can provide long-term data sustainability, but also to ensure that the data is reliable, reusable, and accessible. At GrainGenes, a hard-funded central location for small grains data, curators have implemented a workflow for locating, parsing, and uploading new data so that the most important, peer-reviewed research is available to users as quickly as possible, with optimal data quality and rich connections to past research outcomes. This poster will describe the workflow used by GrainGenes curators, and also highlight new datasets and features that have been added to GrainGenes in recent months.
Abstract: Researchers are increasingly using model organisms to investigate potential treatments for human disease. Disease models can be created via genetic lesions and/or chemical treatments and the resulting phenotypes can be modified by additional lesions or treatments. These synergistic interactions can ameliorate or exacerbate phenotypes, illuminating genetic pathways and suggesting possible treatments for human disorders.
To facilitate the distribution of this information, ZFIN has added support for more detailed phenotype annotations that include ameliorated and exacerbated tags. In addition, we have added the ability to utilize ChEBI terms, like cholesterol (CHEBI:16113), in phenotype annotations to effectively report the effects of treatments or genetic lesions on biomarkers.
Here we report the use of these new phenotype tags and molecular entities in the curation interface and how this information is displayed on ZFIN web interfaces.
Abstract: Drug repurposing, the repositioning of existing therapeutics for new applications, is compelling as it offers a less expensive and more rapid clinical impact than the development of new untested compounds. However, large-scale opportunities have been limited partially by the lack of a comprehensive library of existing clinical compounds for use in repurposing experiments. To address this problem, we have assembled and annotated a collection of ~5,000 compounds, containing over 3,000 drugs, that are approved or have reached clinical trials.
To enhance the utility of this library, we annotated the compounds for several properties. One major challenge when assembling compound information was inconsistencies and contradictions across multiple resources. Therefore, we manually curated drug mechanisms of action, protein target(s), clinical development status, drug indication, and disease area. To ensure our own use of consistent terminology, we developed standardized vocabularies to describe mechanisms of action, drug indications, and disease areas; protein targets were mapped to the official HUGO gene symbol identifiers.
This information is currently available through an on-line resource, the Drug Repurposing Hub (www.broadinstitute.org/repurposing), and is searchable by the above-mentioned annotations. Additionally, a JSON API allows for programmatic access to the dataset.
A portion of this library has been assayed as a part of a project for the LINCS Center for Transcriptomics. Based on the results of these analyses in combination with our annotations, we have been able to assign ~1,500 compounds to pharmacological classes (PCLs). We have applied an algorithmic approach to identify a subset of compounds that show unexpected connectivity to more than one PCL, suggesting novel mechanisms of action.These additional PCL assignments may lead us to Repurposing hypotheses for existing clinical compounds. We have created apps on clue.io to visualize these connections and classes.
Curation for Precision Medicine
Berg Hall, Rm A; Monday, March 27, 12-1:30 PM
Abstract: We are at the dawn of a new era of personalized genomic medicine where advances in human healthcare will be powered by the integration of data from many sources, including structured electronic patient records and data linking genomic variants to computable descriptions of functional and clinical impact. Here we describe work performed in UniProt/Swiss-Prot that aims to standardize the curation and provision of variant data using a range of ontologies including VariO, GO, and ChEBI. Our focus on variants with functional impact demonstrated using biochemical assays makes UniProtKB/Swiss-Prot variant data highly complementary to that from resources which use genetic data (such as pedigree analyses or GWAS studies) to link variants to specific diseases, phenotypes, or traits. UniProtKB/Swiss-Prot currently provides more than 8,000 variants with curated functional impact.
Abstract: Personalized medicine in oncology relies on the use of treatments targeting specific genetic variants. However, identifying the variants of interest (e.g. for which a treatment exists or could be tested) is a fastidious task: a patient usually presents several thousands of genetic variants. We propose here a system to automatically rank genetic variants of a given patient based on their occurrences in the biomedical literature (i.e. PubMed Central). Our system receives as input a patient with a diagnosis and a set of text files containing all identified mutations. While copy number variants (CNV) are available also, only single nucleotide variants (SNV) are currently taken into account. First, a set of equation triplets are generated: a diagnosis (e.g. squamous cell carcinoma), a gene (ERBB2) and a mutation (p.Ala867Th). Each triplet is submitted to our search engine and is assigned a score. The score is based on a set of three queries - each weighted differently: 1) diagnosis, gene and mutation; 2) gene and mutation; and 3) diagnosis and gene. Due to data scarcity, our set is currently tuned on a set of two patients, corresponding to 108 triplets. The mean average precision (MAP) of our system reaches 89.24%, with a precision at rank 5 (P5) of 50%. It means that half of the mutations ranked in the top 5 positions by our system were reported as relevant in our benchmark. However, it is to be noted that the other half top-ranked mutations might also be of interest. Indeed, the benchmark we are using is based on reports manually generated and has limited recall. We can thus assume the real precision of the system is higher. To conclude, we are proposing here a system to facilitate the analysis of genetic variants in order to fasten the process of proposing targeted treatments in oncology. While we are currently focusing our efforts on the first step (i.e. the ranking of genetic variants), future steps include the automated processing of the literature to extract potential chemotherapies.
Abstract: Human Genome Variation Society (HGVS) nomenclature is a set of recommendations for describing biological sequence variants. Evolution of the nomenclature may lead to discrepancies in its application across different systems. Tools that manipulate HGVS-formatted variants may differ in the completeness or interpretation of the recommendations, potentially resulting in variants that are incorrectly formatted. This has the potential to lead to misinterpretation of clinically significant variants. Currently, there is no way to easily assess the accuracy of tools that either manipulate the nomenclature of, or predict the consequences of genomic variants. The goal of this project is to develop an evaluation framework which will use test-cases to assess the capabilities and measure the accuracy of HGVS tools. This framework: hgvs-eval will be made up of 3 components: 1) a command-line script that runs a set of verified test cases through each tool; 2) a REST API which allows for standardized and language agnostic access to the framework; and 3) a website which will display up-to-date results of tests. Test cases will assess features of HGVS-formatting tools like parsing a variant, validating that the variant is correctly represented in HGVS nomenclature, normalizing a variant by 3’ shifting and projecting of a sequence variant from one type of sequence to another. An evaluation of select tools will be performed and results will be displayed via a website to allow for community-wide utilization and continual assessments and updates of results. This project was part of the GA4GH 2016 hackathon. Source code for this project is available at https://github.com/biocommons/hgvs-eval.
Abstract: Each year, many publications appear in the scientific literature that include computational models in biology, and there is a growing need for the databases of these models. One of the efforts for the database of biomodels has been the “BioModels Database” launched by the European Bioinformatics Institute which is a repository of computational models of biological processes that are from the “peer-reviewed scientific literature and generated automatically from pathway resources”. In this database, “models described from literature are manually curated and enriched with cross-references from external data sources”. [1]
According to the National Institutes of Health (NIH), precision medicine is "an emerging approach for disease treatment and prevention that takes into account individual variability in genes, environment, and lifestyle for each person." [2] In contrast to “one-size-fits-all” approach, patients receive a personalized therapy based on their biomarkers, and this innovative approach necessitates the mathematical optimization of their treatment planning.
In this work, we are going to discuss the potentials of databases of biocomputational models for precision medicine with the example of “BioModels Database”. We believe that there are two main aspects to emphasize. The first aspect is the technical and user-friendly characteristics of these databases for the computational models to be integrated with clinical or experimental data so that they could serve as tools of decision making. The second aspect is their potentials in precision medicine for the optimization of planning of individual treatments for different chronic diseases including cancer, diabetes, and alzheimer’s disease by considering different dynamıcs of the diseases and personal characteristics of the individuals. Based on these two main aspects, we develop a basic conceptual framework for strategies that could be involved to increase the potentials of these databases for precision medicine.
References
[1] BioModels Database. Frequently Asked Questions. Available at https://www.ebi.ac.uk/biomodels-main/faq#DIFFER_MOD.
[2] NIH. U.S. National Library of Medicine. What is precision medicine? Available at https://ghr.nlm.nih.gov/primer/precisionmedicine/definition
Abstract: Despite the availability of structured variant nomenclature and ontologies, representing the wide variety of variants observed in cancer known to have clinical relevance remains a challenge. These variants range from the most simple single nucleotide variants to complex chromosomal rearrangements at the DNA level and include additional layers of complexity when accounting for transcript-level or epigenetic alterations. As numerous groups undertake the daunting task of interpreting variants of clinical relevance, a need for representing the genomic coordinates of these variants in ways that provide resource interoperability and integration with downstream applications has become notable. Development of one of these resources, the crowd-sourced open-access knowledgebase CIViC (Clinical Interpretation of Variants in Cancer; civicdb.org), has led us to develop standard practices for representing variants at the genomic level. These standards, however, continue to be debated and developed in the CIViC interface as new variants and downstream applications are developed. Variant nomenclature such as HGVS representations provide a strong foundation for inter-operable genomic representations. HGVS nomenclature can be captured in CIViC but does not universally handle the variety of variants already represented in the CIViC knowledgebase. CIViC seeks to have a representative set of genomic coordinates for all variants in the knowledgebase and, to guide curators, maintains help documentation which can be accessed in the help section of the website at any time (https://civic.genome.wustl.edu/#/help/variants/variants-coordinates). All evidence supporting clinical interpretations of variants in CIViC are derived from the published medical literature and aggregated under the most detailed variant possible (i.e., “ALK fusion” when fusion partners are not defined by methods used by the literature source versus the more specific “EML4-ALK” whenever possible). Of particular interest is the best approach for representing challenging use-cases such as variants presented in the literature without explicit genomic and transcriptomic designations, literature results which aggregate multiple variants in the same gene, or complex variants such as fusions where different genomic breakpoints can have the same transcriptomic result. We will present our current approach to identifying genomic coordinate representations for variants in CIViC and the logic that drove these decisions.
Abstract: Genetic and protein interaction networks underpin normal development and physiology and are perturbed in diseased states. To help chart these networks, the Biological General Repository for Interaction Datasets (BioGRID) (http://www.thebiogrid.org) curates genetic and protein interactions for human and major model organisms, including yeast, worm, fly, and mouse. As of February 2017, BioGRID houses over 1,415,000 interactions manually curated from low- and high-throughput studies reported in more than 48,000 publications. This dataset includes over 369,700 human interactions implicated in central biological processes, many of which contain validated as well as potentially novel drug targets. To facilitate network-based drug discovery we have recently begun to incorporate chemical-protein and chemical-genetic interaction data into BioGRID. As a pilot project, BioGRID has integrated over 27,700 manually curated chemical-protein interactions from DrugBank (http://www.drugbank.ca/), which contains more than 4,450 bioactive molecules and 2,130 unique human proteins. Targets within this dataset span many biological processes, such as the ubiquitin proteasome system, as well as disease-associated processes, such as the EGFR and VEGF signaling pathways implicated in glioblastoma and other cancer types. These small molecule-target associations can be visualized via an interactive network viewer embedded in the BioGRID search page results. The combination of well annotated genetic, protein and chemical interaction data in a single unified resource will facilitate network-based approaches to drug discovery, including drug repurposing towards new disease indications. The complete BioGRID interaction dataset is freely available without restriction and can be downloaded in standardized formats suitable for computational studies. BioGRID is supported by grants from the National Institutes of Health [R01OD010929 and R24OD011194 to M.T. and K.D.], Genome Canada Largescale Applied Proteomics (OGI-069) and Genome Québec International Recruitment Award and a Canada Research Chair in Systems and Synthetic Biology [to M.T.]. Funding for open access charge: National Institute of Health [R01OD010929].
Abstract: Tohoku University Tohoku Medical Megabank Organization (ToMMo) have sequenced whole genomes of 2,049 cohort participants, and constructed the whole-genome reference panel (2KJPN) as a catalogue of genomic variants (28 million autosomal SNVs) for foundation for genomic medicine in the Japanese population. We opened a website “integrative Japanese Genome Variation Database (iJGVD; http://ijgvd.megabank.tohoku.ac.jp/)”, and publicly released allele frequency data of SNVs obtained from the 2,049 individuals. We are annotating variants of 2KJPN with biological and medical information to identify variants having possible biological or pathological effects and to estimate their frequency in the population. By using a database of known pathological variants (HGMD and ClinVar), we identified about 7000 pathologically annotated variants in 2KJPN. On the other hand, based on gene-based annotation, we identified loss-of-function variants including more than 5000 stop-gained SNVs. Among these stop-gained SNVs in 2KJPN, only a small proportion (4.5%) were annotated as reported pathological variants, and biological effects are unknown for most of stop-gained variants. However, we detected about 1000 stop-gained SNVs as candidates of expected pathogenic variants, by comparing them with a set of disease-causing genes. By focusing on newborn screening (NBS) genes, we identified reported pathogenic variants in 2KJPN and estimated their frequencies. A large part of the NBS genes showed lower carrier frequencies in 2KJPN compared with European ancestry. This difference in allele frequencies of responsible variants may explain lower incidence rates in congenic metabolic disorder in Japan than Europeans. We are also inspecting variant frequency for actionable genes, which were recommended by ACMG. Our results, through variant review with literature survey, showed that variant review through literature survey is indispensable to construct information infrastructure for genomic medicine in the Japanese population.
Abstract: COSMIC, the Catalogue Of Somatic Mutations In Cancer (http://cancer.sanger.ac.uk) is a comprehensive resource for exploring the full range of curated genetic mutations across all human cancers. The database catalogues multiple types of genetic alterations in tumors along with associated clinical data, enabling researchers and clinicians to easily determine the distribution of specific mutations across different cancer types, with the aim of informing targeted drug development and patient diagnosis and treatment.
With the increasing impact of personalized medicine, scientific literature is emerging which describes genetic responses to targeted therapies. Resistance to particular drugs may occur in some patients following an initial response, and this is often due to new acquired mutations. Resistance can develop within the tumor where sub-populations of cells either already possess or gain mutations enabling them to emerge under selective drug pressure. Patients who initially respond to treatment relapse as a result of the emergence of the dominant resistant clone, for example the acquisition of the EGFR T790M mutation overcomes gefitinib and erlotinib treatment in non-small cell lung cancer, and this tumour evolution can induce relapse in cancer patients rapidly, sometimes only within weeks of an apparently successful response.
In 2016, genetic curation in COSMIC was extended to include these acquired resistance mutations and their effects on drug responses, as well as inherent (pre-treatment) resistance mutations (http://cancer.sanger.ac.uk/cosmic/drug_resistance). Focusing initially on well-characterized drugs and genes, COSMIC v79 contains resistance mutation profiles across 21 drugs, detailing the occurrence of 261 unique resistance alleles across 2017 tumor samples.
Within the COSMIC website, pie charts and histograms show the frequency of mutations in the evolution of therapeutic resistance, and additional mutational and clinical information associated with resistant samples can be explored. The full dataset can be downloaded from the COSMIC SFTP site. In combination with improved molecular diagnostics, it is hoped that this new curation will further aid drug development and allow clinicians to better target therapies to individual patients.
Abstract: Non-steroidal anti-inflammatory drugs (NSAIDs), frequently taken to treat pain and inflammation, are among the most commonly used drugs in the world. NSAIDs act by inhibiting the activity of cyclooxygenase (COX)-1 and/or COX-2, thereby preventing the production of inflammatory prostaglandins from arachidonic acid. COX-2 selective NSAIDs, which were first introduced to mitigate the gastrointestinal side effects of the more general NSAIDs, have been linked to increased cardiovascular (CV) risk. The Personalized NSAID Therapeutics Consortium (PENTACON: www.pentaconhq.org) aims to develop a paradigm for the personalization of drug treatment by focusing on this class of frequently used drugs. Key to such a paradigm is identifying genetic sources and underlying mechanisms of individual variability to drug response and CV risk. Toward this end, a Curated Data Resource (CDR: cdr.pentaconhq.org) has been created to consolidate and standardize the vast amount of arachidonic acid pathway (AAP) and blood pressure regulation data available in the literature and various databases. This resource contains a comprehensive list of genes and major small molecules that play a role in the AAP and related networks, including genes involved in blood pressure regulation. An interactive web tool has been developed to help researchers more readily visualize the relationships between central pathway genes and small molecules. Additionally, detailed molecular information for these genes including isoforms, variants and relevant disease phenotypes, along with cell line, tissue and cell type specificity, subcellular localization and enzyme kinetic data was curated using standardized ontology terms. These data are available for browsing or searching within the online Curated Data Resource (CDR) web tool and may also be downloaded in their entirety from the Pentacon data page. This collection also includes relevant data for identified orthologs of human arachidonic acid pathway genes found in mouse, rat, fish, and yeast. Over ~9,000 human AAP genetic and protein interactions were also curated and may be accessed through the BioGRID database (www.biogrid.org). These resources lend themselves to in depth analysis that can guide hypothesis-driven experimentation and help to elucidate underlying mechanisms of variable drug response.
Abstract: IMGT®, the international ImMunoGeneTics information system®, http://www.imgt.org, is the global reference in immunogenetics and immunoinformatics. IMGT® manages the extreme diversity and complexity of the antigen receptors of the adaptive immune response, the immunoglobulins (IG) or antibodies and the T cell receptors (TR) (2.1012 different specificities per individual), and is at the origin of immunoinformatics, a science at the interface between immunogenetics and bioinformatics. IMGT® is based on the concepts of IMGT-ONTOLOGY and these concepts are used for expert annotation in IMGT/LIGM-DB, the IMGT® database of IG and TR nucleotide sequences from human and other vertebrate species and in IMGT/GENE-DB, the IMGT® gene and allele database. The IMGT/LIGM-DB biocuration pipeline includes IMGT/LIGMotif, for the analysis of large genomic DNA sequences, and IMGT/Automat, for the automatic annotation of rearranged cDNA sequences. Analysis results are checked for consistency, both manually and by using IMGT® tools (IMGT/NtiToVald, IMGT/V-QUEST, IMGT/BLAST, etc.). The annotation includes the sequence identification (IMGT keywords), the gene and allele classification (IMGT nomenclature), the description (IMGT labels in capital letters), the translation of the coding regions (IMGT unique numbering). In parallel, the IMGT Repertoire is updated (Locus representations, Gene tables and Protein displays (for new genes), Alignments of alleles (for new and/or confirmatory alleles) and the IMGT reference directory is completed (sequences used for gene and allele comparison and assignment in IMGT® tools (IMGT/V-QUEST, IMGT/HighV-QUEST for next generation sequencing (NGS), IMGT/DomainGapAlign) and databases (IMGT/2Dstructure-DB, IMGT/3Dstructure-DB). IMGT® gene names are approved by the Human Genome Nomenclature Committee (HGNC) and endorsed by the World Health Organization (WHO)-International Union of Immunological Societies (IUIS) Nomenclature Subcommittee for IG and TR. Reciprocal links exist between IMGT/GENE-DB and HGNC and NCBI. The definition of antibodies published by the WHO International Nonproprietary Name (INN) Programme is based on the IMGT® concepts, and allows easy retrieval via IMGT/mAb-DB query. The IMGT® standardized annotation has allowed to bridge the gaps for IG or antibodies and TR between fundamental and medical research, veterinary research, repertoire analysis, biotechnology related to antibody engineering, diagnostics and therapeutical approaches.
Abstract: The cBioPortal for Cancer Genomics is an open source web resource, which provides a common platform for visualization of clinical and research data of multidimensional cancer cohorts. Originally developed at Memorial Sloan Kettering (MSK), it is now maintained by a multi-institutional team including MSK and the Dana-Farber Cancer Institute, among others. The public instance of cBioPortal (cbioportal.org) currently hosts data from over 20,000 tumor samples in 149 studies. Additionally, there are dozens of private installations of cBioPortal around the world, hosted in academia and industry for the analysis of private, pre-publication data. The instance at MSK alone contains over 600 sequencing studies. As genomic sequencing is becoming more commonplace and affordable in research and clinical settings, it becomes critical that heterogeneous data from various sources in the cBioPortal is standardized during data curation, for proper comparison and interpretation.
The import of genomic data into cBioPortal can be automated for the most part, particularly data that is sourced from pipelines like the TCGA Firehose, ICGC, TARGET, etc. Genomic data curation is manual in cases of published papers and unpublished data from individual laboratories. Annotation of genomic data with clinical attributes is becoming an essential part of translational research. For homogenous standardization of these data, we have recorded approximately 3000 attributes utilizing RedCap, which allows curators to vet and validate datasets and streamline the process of data import/export into the portal. In parallel, we have also developed a normalized classification of cancer types on the basis of tissue/histology (http://oncotree.mskcc.org/oncotree/), which is now used for all cBioPortal studies. Lastly, as genomic sequencing has begun to guide patient care, we are also leveraging OncoKB, a precision oncology knowledgebase, for treatment implications of specific cancer gene alterations that have been curated in detail from diverse resources like FDA, NCCN, ASCO, ClinicalTrials.gov and literature.
cBioPortal data curation is an evolving process with the intention of incorporating different tools and pipelines to maintain the right balance between manual and automated curation, reduce the import of duplicate data, optimize storage and turn around time, and incorporate clinical data to facilitate translational research.
Abstract: Critical to the field of clinical genomics is an understanding of the strength of correlation between genotype and phenotype. The central mission of ClinGen (The Clinical Genome Resource, clinicalgenome.org) is to define the “clinical relevance of genes and variants for use in precision medicine and research.” Inherent to this mission is the ability to access and evaluate evidence for a gene’s and/or variant’s role in disease in a consistent and efficient manner. Towards this goal, ClinGen has built interconnected curation interfaces, one for gene curation and one for variant curation. The rich evidence captured by both interfaces is available in a structured format that allows it to be easily accessed via ClinGen’s public portal (clinicalgenome.org).
The ClinGen gene and variant curation interfaces have been developed according to the following important specifications: 1) variant curation follows the ACMG-AMP Standards and Guidelines (Richards et al 2015), while gene curation follows the Clinical Validity Classifications framework established by ClinGen’s Gene Curation Work Group, 2) both interfaces centralize evidence from relevant resources in order to enable efficient and consistent curation, 3) curated literature evidence and evidence retrieved from external resources are shared between the gene and variant curation tools, 4) the interfaces are designed to guide biocurators through the curation process, 5) controlled vocabularies and ontologies are an important part of the design in order to promote the capture of discrete evidence, facilitate connections, and promote consistency, 6) external and curated evidence is viewable by all biocurators, while curated evidence can only be edited by its creator, 7) the interfaces support expert review of provisional classifications and interpretations, 8) contextual help and documentation are included to assist the biocurator, and 9) data is stored in standard JSON-LD format to define rich relationships and facilitate data exchange.
Both tools are now in production (curation.clinicalgenome.org) and currently accessible to ClinGen biocurators and approved groups from the broader community. A demo version of the interfaces (curation-test.clinicalgenome.org), which does not permanently save data, is available to allow exploration. We expect active use of the ClinGen curation interfaces will facilitate the implementation of the ACMG variant classification guidelines across diverse disease genes.
Abstract: Having consistent, comprehensive and normalized dictionaries of biological concepts such as genes, rare diseases and phenotypes is crucial for network-based phenotype-gene association mining and phenotype-based rare disease gene prioritization. We created dictionaries of terms for different concept types, including a phenotype dictionary derived from HPO [Köhler et al, 2017], a rare disease dictionary from Orphanet [http://www.orpha.net/] and a gene dictionary from HGNC [Gray et al, 2015]. The concepts in these dictionaries were propagated both as nodes in the network analysis and for named entity recognition (NER) in the text-mining of concept-pair associations from MEDLINE. Terms along with their associated synonyms were extracted from specific fields in the individual raw data files downloaded from the respective data sources. Wherever possible, the dictionaries were augmented with terms from MeSH [https://www.ncbi.nlm.nih.gov/mesh] in order to increase coverage, using cross-references provided by Orphanet and HPO. For example, MeSH introduced more names of subtypes of Charcot-Marie Tooth disease than were present in Orphanet.
However, these dictionaries had to be curated due to issues we found around overlaps, ambiguity, generic terms and incompleteness. We found terms and synonym overlaps, both within individual sources as well as across sources. These include 561 overlaps between HGNC and Orphanet (e.g. maple syrup urine disease [ORPHA:511, HGNC:987]), 507 overlaps between Orphanet and HPO (e.g. methylcobalamin deficiency [HP:0003223, ORPHA:622]) and 18 between HGNC and HPO (e.g. neurofibromatosis [HP:0001067, HGNC:7765]). Further, a large number of generic terms such as “severity” [HP:0012824] were removed from the dictionaries. We also found the phenotype and rare disease dictionaries to be incomplete, which we addressed by adding synonyms (e.g. "duplex collecting system" as a synonym for "duplicated collecting system" [HP:0000081]). Finally, we found ambiguities with respect to abbreviations and English words (e.g. ‘fame’ is a synonym for ORPHA: 86814).
Addressing the issues of overlaps, ambiguity and incompleteness in these sources will help in improving the output in rare disease gene prioritization. Future challenges include word-order agnostic NER and global vs local abbreviation handling, particularly for complex phenotype and disease terms.
GigaScience Curation Challenge
Sunday, March 26, 2017, 1:30-3:30, LK 120
Organizers: Chris Hunter, Todd Taylor, Maryann Martone
Summary: This workshop will introduce community annotation tools, iCLiKVAL and Hypothes.is and challenge curators to use these tools. There will be three short presentations:
1. iCLiKVAL introduction and use of - by an iCLiKVAL team member
2. Hypothes.is introduction and use of - by an hypothes.is team member
3. Competition outline, rules and registration details - by Chris Hunter (GigaScience)
If time allows, there will be a short hands-on trial/mini competition session.
Duration: 2 hours
Reading, Assembling and Reasoning for Biocuration
Sunday, March 26, 2017, 1:30-3:30, Berg Hall B/C – LK 240/250
Organizers: Sophia Ananiadou, Riza Batista-Navarro, Paul Cohen, Diana Chung, Emek Demir, Lynette Hirschman, Parag Mallik
Summary: We will focus on recent advances in the development of integrated systems to capture "Big Mechanisms" for biological systems, including machine reading of journal articles, (semi-)automated assembly of signaling pathway models, and machine-aided analysis of these models for tasks such as drug repurposing and explaining drugs' effects. This workshop will consist of invited speakers and contributed talks and/or panel discussions from experts in biocuration, machine reading, and biological modeling.
Duration: 2 hours
Addressing the High Throughput, Low Information Data Crisis in Biology
Sunday, March 26, 2017, 1:30-3:30, LK 130
Organizers: Sean Mooney, Predrag Radivojac, Claire O’Donovan, Iddo Friedberg
Summary: This workshop aims to improve the understanding of protein function prediction methods, database biases, and the Critical Assessment of Functional Annotation (CAFA) challenge. We will also discuss how to improve automatic annotation, reduce database bias, and increase annotation accuracy. There will be four talks, followed by a group discussion:
• Sean Mooney: Introduction to the world of community challenges.
• Predrag Radivojac: Introduction to function prediction, and CAFA
• Iddo Friedberg: Understanding annotation bias in biological databases
• Claire O’Donovan: The ECO ontology as a solution to annotation biases
Duration: 2 hours
Biocuration and the Research Life Cycle: Advances and Challenges
Tuesday, March 28, 2017, 1:30-3:30, Berg Hall B/C – LK 240/250
Organizers: Cecilia Arighi, Pascale Gaudet, Lynette Hirschman, Rezarta Islamaj-Dogan, Fabio Rinaldi
Summary: This workshop will revisit and identify the major advances and new challenges in the biocuration workflow in connection to the research cycle, from publication to data acquisition to a database entry and subsequent updates. Brief introduction to the different topics (15 min), followed by breakout sessions to discuss those topics (1h), and concomitant report from each group on the outcomes and future steps (30 min). The last 15 min will be used for general discussion and workshop closing.
Duration: 2 hours
Google Summer of Code
Tuesday, March 28, 2017, 1:30-3:30, Berg Hall A – LK 230
Organizers: Marc Gillespie, Robin Haw
Summary: The Open Genome Informatics group will be discussing Google Summer of Code, a fantastic platform for student training, project development, and collaboration. All of these are key aspects of a good biocuration project, and in our experience the student projects result in valuable deliverables. There will be an introduction followed by a panel discussion.
Duration: 2 hours
Consensus Building for Cancer Molecular Subtyping
Wednesday, March 29, 2017, 8:30-10:30 AM, Alway M106
Organizers: Lynn Schriml, Sherri De Coronado, Warren Kibbe, Pascale Gaudet, Raja Mazumder
Summary: This workshop's goal is to bring together community members to identify common and alternative methods of molecular modeling. We will be exploring the status, mechanisms, and uses for molecular characterizations of cancer, ways of defining cancer subtypes, and the relations between subtypes and associated data (e.g., anatomy, OMIM phenotype ‘susceptibility_to’, animal models, drug modeling).
Duration: 2 hours
Scientific Evidence for Biocuration
Wednesday, March 29, 2017, 8:30-10:30 AM, Berg Hall A – LK 230
Organizers: Marcus Chibucos
Summary: This workshop, hosted by the Evidence & Conclusion Ontology (ECO), will introduce fundamental concepts of representation of scientific evidence, provide new users an overview of recent ECO developments, applications, and collaborations, serve as an open forum for discussion of community evidence needs, including confidence/quality metrics, and invite new collaborations. Background & applied talks will be followed by an open discussion.
Marcus Chibucos: Evidence in biocuration; Evidence & Conclusion Ontology (ECO) development & applications
Elvira Mitraka: Applications for ECO in WikiData and Clinical Interpretations of Variants in Cancer (CIViC)
Gully Burns: Fine-grained models and methods for evidence modeling and extraction
Rebecca Tauber: Live demonstration - harmonizing ECO and the Ontology for Biomedical Investigations (OBI)
Group discussion with audience participation: Community issues related to evidence
Duration: 2 hours