Dr. Michel Dumontier is an Associate Professor of Medicine (Biomedical Informatics) at Stanford University. His research focuses on the development of computational methods to increase our understanding of how living systems respond to chemical agents. At the core of the research program is the development and use of Semantic Web technologies to formally represent and reason about data and services so as (1) to facilitate the publishing, sharing and discovery of scientific knowledge, (2) to enable the formulation and evaluation scientific hypotheses and (3) to create and make available computational methods to investigate the structure, function and behaviour of living systems. Dr. Dumontier serves as a co-chair for the World Wide Web Consortium Semantic Web for Health Care and Life Sciences Interest Group (W3C HCLSIG) and is the Scientific Director for Bio2RDF, a widely recognized open-source project to create and provide linked data for life sciences.

Academic Appointments

Boards, Advisory Committees, Professional Organizations

  • Chair, World Wide Web Consortium (W3C) Semantic Web for Health Care and Life Sciences Interest Group (2011 - Present)

Professional Education

  • Postdoc, Samuel Lunenfeld Research Institute, Bioinformatics (2005)
  • BSc, University of Manitoba, Biochemistry (1998)
  • PhD, University of Toronto, Biochemistry (Bioinformatics) (2004)

Research & Scholarship

Current Research and Scholarly Interests

The Dumontier laboratory for biomedical knowledge discovery develops computational methods to better understand how living systems respond to chemical agents. We use semantic technologies to integrate and analyze large biomedical data and enable knowledge-based discoveries in biology, biochemistry and medicine. Our major research interests include i) drug repositioning using large scale animal model data, ii) elucidating the mechanism by which complex phenotypes (e.g. side effects) arise from consumption of pharmaceutical products, iii) determining the extent to which drug metabolic products contribute to toxicity, iv) optimizing novel drug therapeutic regimes so as to minimize undesireable side effects and v) understanding the systemic basis of an altered response due to genetic variation. We develop novel methods to accurately capture, publish, discover and re-use biomedical data, ontologies and services using formal knowledge representation and automated reasoning.


2013-14 Courses


Journal Articles

  • Evaluation of the OQuaRE framework for ontology quality EXPERT SYSTEMS WITH APPLICATIONS Duque-Ramos, A., Tomas Fernandez-Breis, J., Iniesta, M., Dumontier, M., Egana Aranguren, M., Schulz, S., Aussenac-Gilles, N., Stevens, R. 2013; 40 (7): 2696-2703
  • State of the art and open challenges in community-driven knowledge curation JOURNAL OF BIOMEDICAL INFORMATICS Groza, T., Tudorache, T., Dumontier, M. 2013; 46 (1): 1-4

    View details for DOI 10.1016/j.jbi.2012.11.007

    View details for Web of Science ID 000315362300001

    View details for PubMedID 23219718

  • Linked Data in Drug Discovery IEEE INTERNET COMPUTING Dumontier, M., Wild, D. J. 2012; 16 (6): 68-71
  • Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics BIOINFORMATICS Hoehndorf, R., Dumontier, M., Gkoutos, G. V. 2012; 28 (16): 2169-2175


    Many complex diseases are the result of abnormal pathway functions instead of single abnormalities. Disease diagnosis and intervention strategies must target these pathways while minimizing the interference with normal physiological processes. Large-scale identification of disease pathways and chemicals that may be used to perturb them requires the integration of information about drugs, genes, diseases and pathways. This information is currently distributed over several pharmacogenomics databases. An integrated analysis of the information in these databases can reveal disease pathways and facilitate novel biomedical analyses.We demonstrate how to integrate pharmacogenomics databases through integration of the biomedical ontologies that are used as meta-data in these databases. The additional background knowledge in these ontologies can then be used to enable novel analyses. We identify disease pathways using a novel multi-ontology enrichment analysis over the Human Disease Ontology, and we identify significant associations between chemicals and pathways using an enrichment analysis over a chemical ontology. The drug-pathway and disease-pathway associations are a valuable resource for research in disease and drug mechanisms and can be used to improve computational drug repurposing.

    View details for DOI 10.1093/bioinformatics/bts350

    View details for Web of Science ID 000307501100011

    View details for PubMedID 22711793

  • Aptamer base: a collaborative knowledge base to describe aptamers and SELEX experiments DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION Cruz-Toledo, J., McKeague, M., Zhang, X., Giamberardino, A., McConnell, E., Francis, T., DeRosa, M. C., Dumontier, M. 2012


    Over the past several decades, rapid developments in both molecular and information technology have collectively increased our ability to understand molecular recognition. One emerging area of interest in molecular recognition research includes the isolation of aptamers. Aptamers are single-stranded nucleic acid or amino acid polymers that recognize and bind to targets with high affinity and selectivity. While research has focused on collecting aptamers and their interactions, most of the information regarding experimental methods remains in the unstructured and textual format of peer reviewed publications. To address this, we present the Aptamer Base, a database that provides detailed, structured information about the experimental conditions under which aptamers were selected and their binding affinity quantified. The open collaborative nature of the Aptamer Base provides the community with a unique resource that can be updated and curated in a decentralized manner, thereby accommodating the ever evolving field of aptamer research. DATABASE URL:

    View details for DOI 10.1093/database/bas006

    View details for Web of Science ID 000304920200004

    View details for PubMedID 22434840

  • Self-organizing ontology of biochemically relevant small molecules BMC BIOINFORMATICS Chepelev, L. L., Hastings, J., Ennis, M., Steinbeck, C., Dumontier, M. 2012; 13


    The advent of high-throughput experimentation in biochemistry has led to the generation of vast amounts of chemical data, necessitating the development of novel analysis, characterization, and cataloguing techniques and tools. Recently, a movement to publically release such data has advanced biochemical structure-activity relationship research, while providing new challenges, the biggest being the curation, annotation, and classification of this information to facilitate useful biochemical pattern analysis. Unfortunately, the human resources currently employed by the organizations supporting these efforts (e.g. ChEBI) are expanding linearly, while new useful scientific information is being released in a seemingly exponential fashion. Compounding this, currently existing chemical classification and annotation systems are not amenable to automated classification, formal and transparent chemical class definition axiomatization, facile class redefinition, or novel class integration, thus further limiting chemical ontology growth by necessitating human involvement in curation. Clearly, there is a need for the automation of this process, especially for novel chemical entities of biological interest.To address this, we present a formal framework based on Semantic Web technologies for the automatic design of chemical ontology which can be used for automated classification of novel entities. We demonstrate the automatic self-assembly of a structure-based chemical ontology based on 60 MeSH and 40 ChEBI chemical classes. This ontology is then used to classify 200 compounds with an accuracy of 92.7%. We extend these structure-based classes with molecular feature information and demonstrate the utility of our framework for classification of functionally relevant chemicals. Finally, we discuss an iterative approach that we envision for future biochemical ontology development.We conclude that the proposed methodology can ease the burden of chemical data annotators and dramatically increase their productivity. We anticipate that the use of formal logic in our proposed framework will make chemical classification criteria more transparent to humans and machines alike and will thus facilitate predictive and integrative bioactivity model development.

    View details for DOI 10.1186/1471-2105-13-3

    View details for Web of Science ID 000299825400001

    View details for PubMedID 22221313

  • Building an HIV data mashup using Bio2RDF BRIEFINGS IN BIOINFORMATICS Nolin, M., Dumontier, M., Belleau, F., Corbeil, J. 2012; 13 (1): 98-106


    We present an update to the Bio2RDF Linked Data Network, which now comprises ∼30 billion statements across 30 data sets. Significant changes to the framework include the accommodation of global mirrors, offline data processing and new search and integration services. The utility of this new network of knowledge is illustrated through a Bio2RDF-based mashup with microarray gene expression results and interaction data obtained from the HIV-1, Human Protein Interaction Database (HHPID) with respect to the infection of human macrophages with the human immunodeficiency virus type 1 (HIV-1).

    View details for DOI 10.1093/bib/bbr003

    View details for Web of Science ID 000298888200007

    View details for PubMedID 22223742

  • The Chemical Information Ontology: Provenance and Disambiguation for Chemical Data on the Biological Semantic Web PLOS ONE Hastings, J., Chepelev, L., Willighagen, E., Adams, N., Steinbeck, C., Dumontier, M. 2011; 6 (10)


    Cheminformatics is the application of informatics techniques to solve chemical problems in silico. There are many areas in biology where cheminformatics plays an important role in computational research, including metabolism, proteomics, and systems biology. One critical aspect in the application of cheminformatics in these fields is the accurate exchange of data, which is increasingly accomplished through the use of ontologies. Ontologies are formal representations of objects and their properties using a logic-based ontology language. Many such ontologies are currently being developed to represent objects across all the domains of science. Ontologies enable the definition, classification, and support for querying objects in a particular domain, enabling intelligent computer applications to be built which support the work of scientists both within the domain of interest and across interrelated neighbouring domains. Modern chemical research relies on computational techniques to filter and organise data to maximise research productivity. The objects which are manipulated in these algorithms and procedures, as well as the algorithms and procedures themselves, enjoy a kind of virtual life within computers. We will call these information entities. Here, we describe our work in developing an ontology of chemical information entities, with a primary focus on data-driven research and the integration of calculated properties (descriptors) of chemical entities within a semantic web context. Our ontology distinguishes algorithmic, or procedural information from declarative, or factual information, and renders of particular importance the annotation of provenance to calculated data. The Chemical Information Ontology is being developed as an open collaborative project. More details, together with a downloadable OWL file, are available at (license: CC-BY-SA).

    View details for DOI 10.1371/journal.pone.0025513

    View details for Web of Science ID 000295943000034

    View details for PubMedID 21991315

  • Controlled vocabularies and semantics in systems biology MOLECULAR SYSTEMS BIOLOGY Courtot, M., Juty, N., Knuepfer, C., Waltemath, D., Zhukova, A., Draeger, A., Dumontier, M., Finney, A., Golebiewski, M., Hastings, J., Hoops, S., Keating, S., Kell, D. B., Kerrien, S., Lawson, J., Lister, A., Lu, J., Machne, R., Mendes, P., Pocock, M., Rodriguez, N., Villeger, A., Wilkinson, D. J., Wimalaratne, S., Laibe, C., Hucka, M., Le Novere, N. 2011; 7


    The use of computational modeling to describe and analyze biological systems is at the heart of systems biology. Model structures, simulation descriptions and numerical results can be encoded in structured formats, but there is an increasing need to provide an additional semantic layer. Semantic information adds meaning to components of structured descriptions to help identify and interpret them unambiguously. Ontologies are one of the tools frequently used for this purpose. We describe here three ontologies created specifically to address the needs of the systems biology community. The Systems Biology Ontology (SBO) provides semantic information about the model components. The Kinetic Simulation Algorithm Ontology (KiSAO) supplies information about existing algorithms available for the simulation of systems biology models, their characterization and interrelationships. The Terminology for the Description of Dynamics (TEDDY) categorizes dynamical features of the simulation results and general systems behavior. The provision of semantic information extends a model's longevity and facilitates its reuse. It provides useful insight into the biology of modeled processes, and may be used to make informed decisions on subsequent simulation experiments.

    View details for DOI 10.1038/msb.2011.77

    View details for Web of Science ID 000296652600009

    View details for PubMedID 22027554

  • Integrating systems biology models and biomedical ontologies BMC SYSTEMS BIOLOGY Hoehndorf, R., Dumontier, M., Gennari, J. H., Wimalaratne, S., de Bono, B., Cook, D. L., Gkoutos, G. V. 2011; 5


    Systems biology is an approach to biology that emphasizes the structure and dynamic behavior of biological systems and the interactions that occur within them. To succeed, systems biology crucially depends on the accessibility and integration of data across domains and levels of granularity. Biomedical ontologies were developed to facilitate such an integration of data and are often used to annotate biosimulation models in systems biology.We provide a framework to integrate representations of in silico systems biology with those of in vivo biology as described by biomedical ontologies and demonstrate this framework using the Systems Biology Markup Language. We developed the SBML Harvester software that automatically converts annotated SBML models into OWL and we apply our software to those biosimulation models that are contained in the BioModels Database. We utilize the resulting knowledge base for complex biological queries that can bridge levels of granularity, verify models based on the biological phenomenon they represent and provide a means to establish a basic qualitative layer on which to express the semantics of biosimulation models.We establish an information flow between biomedical ontologies and biosimulation models and we demonstrate that the integration of annotated biosimulation models and biomedical ontologies enables the verification of models as well as expressive queries. Establishing a bi-directional information flow between systems biology and biomedical ontologies has the potential to enable large-scale analyses of biological systems that span levels of granularity from molecules to organisms.

    View details for DOI 10.1186/1752-0509-5-124

    View details for Web of Science ID 000294781500001

    View details for PubMedID 21835028

  • MoSuMo: A Semantic Web service to generate electrostatic potentials across solvent excluded protein surfaces and binding pockets COMPUTERS & GRAPHICS-UK Gawronski, A., Dumontier, M. 2011; 35 (4): 823-830
  • Prototype semantic infrastructure for automated small molecule classification and annotation in lipidomics BMC BIOINFORMATICS Chepelev, L. L., Riazanov, A., Kouznetsov, A., Low, H. S., Dumontier, M., Baker, C. J. 2011; 12


    The development of high-throughput experimentation has led to astronomical growth in biologically relevant lipids and lipid derivatives identified, screened, and deposited in numerous online databases. Unfortunately, efforts to annotate, classify, and analyze these chemical entities have largely remained in the hands of human curators using manual or semi-automated protocols, leaving many novel entities unclassified. Since chemical function is often closely linked to structure, accurate structure-based classification and annotation of chemical entities is imperative to understanding their functionality.As part of an exploratory study, we have investigated the utility of semantic web technologies in automated chemical classification and annotation of lipids. Our prototype framework consists of two components: an ontology and a set of federated web services that operate upon it. The formal lipid ontology we use here extends a part of the LiPrO ontology and draws on the lipid hierarchy in the LIPID MAPS database, as well as literature-derived knowledge. The federated semantic web services that operate upon this ontology are deployed within the Semantic Annotation, Discovery, and Integration (SADI) framework. Structure-based lipid classification is enacted by two core services. Firstly, a structural annotation service detects and enumerates relevant functional groups for a specified chemical structure. A second service reasons over lipid ontology class descriptions using the attributes obtained from the annotation service and identifies the appropriate lipid classification. We extend the utility of these core services by combining them with additional SADI services that retrieve associations between lipids and proteins and identify publications related to specified lipid types. We analyze the performance of SADI-enabled eicosanoid classification relative to the LIPID MAPS classification and reflect on the contribution of our integrative methodology in the context of high-throughput lipidomics.Our prototype framework is capable of accurate automated classification of lipids and facile integration of lipid class information with additional data obtained with SADI web services. The potential of programming-free integration of external web services through the SADI framework offers an opportunity for development of powerful novel applications in lipidomics. We conclude that semantic web technologies can provide an accurate and versatile means of classification and annotation of lipids.

    View details for DOI 10.1186/1471-2105-12-303

    View details for Web of Science ID 000294361500001

    View details for PubMedID 21791100

  • Interoperability between Biomedical Ontologies through Relation Expansion, Upper-Level Ontologies and Automatic Reasoning PLOS ONE Hoehndorf, R., Dumontier, M., Oellrich, A., Rebholz-Schuhmann, D., Schofield, P. N., Gkoutos, G. V. 2011; 6 (7)


    Researchers design ontologies as a means to accurately annotate and integrate experimental data across heterogeneous and disparate data- and knowledge bases. Formal ontologies make the semantics of terms and relations explicit such that automated reasoning can be used to verify the consistency of knowledge. However, many biomedical ontologies do not sufficiently formalize the semantics of their relations and are therefore limited with respect to automated reasoning for large scale data integration and knowledge discovery. We describe a method to improve automated reasoning over biomedical ontologies and identify several thousand contradictory class definitions. Our approach aligns terms in biomedical ontologies with foundational classes in a top-level ontology and formalizes composite relations as class expressions. We describe the semi-automated repair of contradictions and demonstrate expressive queries over interoperable ontologies. Our work forms an important cornerstone for data integration, automatic inference and knowledge discovery based on formal representations of knowledge. Our results and analysis software are available at

    View details for DOI 10.1371/journal.pone.0022006

    View details for Web of Science ID 000292812400024

    View details for PubMedID 21789201

  • Chemical Entity Semantic Specification: Knowledge representation for efficient semantic cheminformatics and facile data integration JOURNAL OF CHEMINFORMATICS Chepelev, L. L., Dumontier, M. 2011; 3


    Over the past several centuries, chemistry has permeated virtually every facet of human lifestyle, enriching fields as diverse as medicine, agriculture, manufacturing, warfare, and electronics, among numerous others. Unfortunately, application-specific, incompatible chemical information formats and representation strategies have emerged as a result of such diverse adoption of chemistry. Although a number of efforts have been dedicated to unifying the computational representation of chemical information, disparities between the various chemical databases still persist and stand in the way of cross-domain, interdisciplinary investigations. Through a common syntax and formal semantics, Semantic Web technology offers the ability to accurately represent, integrate, reason about and query across diverse chemical information.Here we specify and implement the Chemical Entity Semantic Specification (CHESS) for the representation of polyatomic chemical entities, their substructures, bonds, atoms, and reactions using Semantic Web technologies. CHESS provides means to capture aspects of their corresponding chemical descriptors, connectivity, functional composition, and geometric structure while specifying mechanisms for data provenance. We demonstrate that using our readily extensible specification, it is possible to efficiently integrate multiple disparate chemical data sources, while retaining appropriate correspondence of chemical descriptors, with very little additional effort. We demonstrate the impact of some of our representational decisions on the performance of chemically-aware knowledgebase searching and rudimentary reaction candidate selection. Finally, we provide access to the tools necessary to carry out chemical entity encoding in CHESS, along with a sample knowledgebase.By harnessing the power of Semantic Web technologies with CHESS, it is possible to provide a means of facile cross-domain chemical knowledge integration with full preservation of data correspondence and provenance. Our representation builds on existing cheminformatics technologies and, by the virtue of RDF specification, remains flexible and amenable to application- and domain-specific annotations without compromising chemical data integration. We conclude that the adoption of a consistent and semantically-enabled chemical specification is imperative for surviving the coming chemical data deluge and supporting systems science research.

    View details for DOI 10.1186/1758-2946-3-20

    View details for Web of Science ID 000300226500001

    View details for PubMedID 21595881

  • Semantic Web integration of Cheminformatics resources with the SADI framework JOURNAL OF CHEMINFORMATICS Chepelev, L. L., Dumontier, M. 2011; 3


    The diversity and the largely independent nature of chemical research efforts over the past half century are, most likely, the major contributors to the current poor state of chemical computational resource and database interoperability. While open software for chemical format interconversion and database entry cross-linking have partially addressed database interoperability, computational resource integration is hindered by the great diversity of software interfaces, languages, access methods, and platforms, among others. This has, in turn, translated into limited reproducibility of computational experiments and the need for application-specific computational workflow construction and semi-automated enactment by human experts, especially where emerging interdisciplinary fields, such as systems chemistry, are pursued. Fortunately, the advent of the Semantic Web, and the very recent introduction of RESTful Semantic Web Services (SWS) may present an opportunity to integrate all of the existing computational and database resources in chemistry into a machine-understandable, unified system that draws on the entirety of the Semantic Web.We have created a prototype framework of Semantic Automated Discovery and Integration (SADI) framework SWS that exposes the QSAR descriptor functionality of the Chemistry Development Kit. Since each of these services has formal ontology-defined input and output classes, and each service consumes and produces RDF graphs, clients can automatically reason about the services and available reference information necessary to complete a given overall computational task specified through a simple SPARQL query. We demonstrate this capability by carrying out QSAR analysis backed by a simple formal ontology to determine whether a given molecule is drug-like. Further, we discuss parameter-based control over the execution of SADI SWS. Finally, we demonstrate the value of computational resource envelopment as SADI services through service reuse and ease of integration of computational functionality into formal ontologies.The work we present here may trigger a major paradigm shift in the distribution of computational resources in chemistry. We conclude that envelopment of chemical computational resources as SADI SWS facilitates interdisciplinary research by enabling the definition of computational problems in terms of ontologies and formal logical statements instead of cumbersome and application-specific tasks and workflows.

    View details for DOI 10.1186/1758-2946-3-16

    View details for Web of Science ID 000300226200001

    View details for PubMedID 21575200

  • A common layer of interoperability for biomedical ontologies based on OWL EL BIOINFORMATICS Hoehndorf, R., Dumontier, M., Oellrich, A., Wimalaratne, S., Rebholz-Schuhmann, D., Schofield, P., Gkoutos, G. V. 2011; 27 (7): 1001-1008


    Ontologies are essential in biomedical research due to their ability to semantically integrate content from different scientific databases and resources. Their application improves capabilities for querying and mining biological knowledge. An increasing number of ontologies is being developed for this purpose, and considerable effort is invested into formally defining them in order to represent their semantics explicitly. However, current biomedical ontologies do not facilitate data integration and interoperability yet, since reasoning over these ontologies is very complex and cannot be performed efficiently or is even impossible. We propose the use of less expressive subsets of ontology representation languages to enable efficient reasoning and achieve the goal of genuine interoperability between ontologies.We present and evaluate EL Vira, a framework that transforms OWL ontologies into the OWL EL subset, thereby enabling the use of tractable reasoning. We illustrate which OWL constructs and inferences are kept and lost following the conversion and demonstrate the performance gain of reasoning indicated by the significant reduction of processing time. We applied EL Vira to the open biomedical ontologies and provide a repository of ontologies resulting from this conversion. EL Vira creates a common layer of ontological interoperability that, for the first time, enables the creation of software solutions that can employ biomedical ontologies to perform inferences and answer complex queries to support scientific analyses. Availability and implementation: The EL Vira software is available from and converted OBO ontologies and their mappings are available from

    View details for DOI 10.1093/bioinformatics/btr058

    View details for Web of Science ID 000289162000017

    View details for PubMedID 21343142

  • The RNA Ontology (RNAO): An ontology for integrating RNA sequence and structure data APPLIED ONTOLOGY Hoehndorf, R., Batchelor, C., Bittner, T., Dumontier, M., Eilbeck, K., Knight, R., Mungall, C. J., Richardson, J. S., Stombaugh, J., Westhof, E., Zirbel, C. L., Leontis, N. B. 2011; 6 (1): 53-89
  • Computational approaches toward the design of pools for the in vitro selection of complex aptamers RNA-A PUBLICATION OF THE RNA SOCIETY Luo, X., McKeague, M., Pitre, S., Dumontier, M., Green, J., Golshani, A., DeRosa, M. C., Dehne, F. 2010; 16 (11): 2252-2262


    It is well known that using random RNA/DNA sequences for SELEX experiments will generally yield low-complexity structures. Early experimental results suggest that having a structurally diverse library, which, for instance, includes high-order junctions, may prove useful in finding new functional motifs. Here, we develop two computational methods to generate sequences that exhibit higher structural complexity and can be used to increase the overall structural diversity of initial pools for in vitro selection experiments. Random Filtering selectively increases the number of five-way junctions in RNA/DNA pools, and Genetic Filtering designs RNA/DNA pools to a specified structure distribution, whether uniform or otherwise. We show that using our computationally designed DNA pool greatly improves access to highly complex sequence structures for SELEX experiments (without losing our ability to select for common one-way and two-way junction sequences).

    View details for DOI 10.1261/rna.2102210

    View details for Web of Science ID 000283047900020

    View details for PubMedID 20870801

  • Relations as patterns: bridging the gap between OBO and OWL BMC BIOINFORMATICS Hoehndorf, R., Oellrich, A., Dumontier, M., Kelso, J., Rebholz-Schuhmann, D., Herre, H. 2010; 11


    most biomedical ontologies are represented in the OBO Flatfile Format, which is an easy-to-use graph-based ontology language. The semantics of the OBO Flatfile Format 1.2 enforces a strict predetermined interpretation of relationship statements between classes. It does not allow flexible specifications that provide better approximations of the intuitive understanding of the considered relations. If relations cannot be accurately expressed then ontologies built upon them may contain false assertions and hence lead to false inferences. Ontologies in the OBO Foundry must formalize the semantics of relations according to the OBO Relationship Ontology (RO). Therefore, being able to accurately express the intended meaning of relations is of crucial importance. Since the Web Ontology Language (OWL) is an expressive language with a formal semantics, it is suitable to de ne the meaning of relations accurately.we developed a method to provide definition patterns for relations between classes using OWL and describe a novel implementation of the RO based on this method. We implemented our extension in software that converts ontologies in the OBO Flatfile Format to OWL, and also provide a prototype to extract relational patterns from OWL ontologies using automated reasoning. The conversion software is freely available at, and can be accessed via a web interface.explicitly defining relations permits their use in reasoning software and leads to a more flexible and powerful way of representing biomedical ontologies. Using the extended langua0067e and semantics avoids several mistakes commonly made in formalizing biomedical ontologies, and can be used to automatically detect inconsistencies. The use of our method enables the use of graph-based ontologies in OWL, and makes complex OWL ontologies accessible in a graph-based form. Thereby, our method provides the means to gradually move the representation of biomedical ontologies into formal knowledge representation languages that incorporates an explicit semantics. Our method facilitates the use of OWL-based software in the back-end while ontology curators may continue to develop ontologies with an OBO-style front-end.

    View details for DOI 10.1186/1471-2105-11-441

    View details for Web of Science ID 000282631900001

    View details for PubMedID 20807438

  • Chemical entity semantic specification: Knowledge representation for efficient semantic cheminformatics and facile data integration ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY Chepelev, L. L., Dumontier, M. 2010; 240
  • Semantic envelopment of cheminformatics resources with SADI ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY Chepelev, L. L., Willighagen, E., Dumontier, M. 2010; 240
  • CHEMINF: Community-developed ontology of chemical information and algorithms ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY Chepelev, L. L., Hastings, J., Willighagen, E., Adams, N., Steinbeck, C., Murray-Rust, P., Dumontier, M. 2010; 240
  • Modeling and querying graphical representations of statistical data JOURNAL OF WEB SEMANTICS Dumontier, M., Ferres, L., Villanueva-Rosales, N. 2010; 8 (2-3): 241-254
  • Towards pharmacogenomics knowledge discovery with the semantic web BRIEFINGS IN BIOINFORMATICS Dumontier, M., Villanueva-Rosales, N. 2009; 10 (2): 153-163


    Pharmacogenomics aims to understand pharmacological response with respect to genetic variation. Essential to the delivery of better health care is the use of pharmacogenomics knowledge to answer questions about therapeutic, pharmacological or genetic aspects. Several XML markup languages have been developed to capture pharmacogenomic and related information so as to facilitate data sharing. However, recent advances in semantic web technologies have presented exciting new opportunities for pharmacogenomics knowledge discovery by representing the information with machine understandable semantics. Progress in this area is illustrated with reference to the personalized medicine project that aims to facilitate pharmacogenomics knowledge discovery through intuitive knowledge capture and sophisticated question answering using automated reasoning over expressive ontologies.

    View details for DOI 10.1093/bib/bbn056

    View details for Web of Science ID 000264388500005

    View details for PubMedID 19240125

  • yOWL: An ontology-driven knowledge base for yeast biologists JOURNAL OF BIOMEDICAL INFORMATICS Villanueva-Rosales, N., Dumontier, M. 2008; 41 (5): 779-789


    Knowledge management is an ongoing challenge for the biological community such that large, diverse and continuously growing information requires more sophisticated methods to store, integrate and query their knowledge. The semantic web initiative provides a new knowledge engineering framework to represent, share and discover information. In this paper, we describe our efforts towards the development of an ontology-based knowledge base, including aspects from ontology design and population using "semantic" data mashup, to automated reasoning and semantic query answering. Based on yeast data obtained from the Saccharomyces Genome Database and UniProt, we discuss the challenges encountered during the building of the knowledge base and how they were overcome.

    View details for DOI 10.1016/j.jbi.2008.05.001

    View details for Web of Science ID 000260137300010

    View details for PubMedID 18562252

  • Global investigation of protein-protein interactions in yeast Saccharomyces cerevisiae using re-occurring short polypeptide sequences NUCLEIC ACIDS RESEARCH PITRE, S., North, C., Alamgir, M., Jessulat, M., Chan, A., Luo, X., Green, J. R., Dumontier, M., Dehne, F., Golshani, A. 2008; 36 (13): 4286-4294


    Protein-protein interaction (PPI) maps provide insight into cellular biology and have received considerable attention in the post-genomic era. While large-scale experimental approaches have generated large collections of experimentally determined PPIs, technical limitations preclude certain PPIs from detection. Recently, we demonstrated that yeast PPIs can be computationally predicted using re-occurring short polypeptide sequences between known interacting protein pairs. However, the computational requirements and low specificity made this method unsuitable for large-scale investigations. Here, we report an improved approach, which exhibits a specificity of approximately 99.95% and executes 16,000 times faster. Importantly, we report the first all-to-all sequence-based computational screen of PPIs in yeast, Saccharomyces cerevisiae in which we identify 29,589 high confidence interactions of approximately 2 x 10(7) possible pairs. Of these, 14,438 PPIs have not been previously reported and may represent novel interactions. In particular, these results reveal a richer set of membrane protein interactions, not readily amenable to experimental investigations. From the novel PPIs, a novel putative protein complex comprised largely of membrane proteins was revealed. In addition, two novel gene functions were predicted and experimentally confirmed to affect the efficiency of non-homologous end-joining, providing further support for the usefulness of the identified PPIs in biological investigations.

    View details for DOI 10.1093/nar/gkn390

    View details for Web of Science ID 000257964500014

    View details for PubMedID 18586826

  • GridCell: a stochastic particle-based biological system simulator BMC SYSTEMS BIOLOGY Boulianne, L., Al Assaad, S., Dumontier, M., Gross, W. J. 2008; 2


    Realistic biochemical simulators aim to improve our understanding of many biological processes that would be otherwise very difficult to monitor in experimental studies. Increasingly accurate simulators may provide insights into the regulation of biological processes due to stochastic or spatial effects.We have developed GridCell as a three-dimensional simulation environment for investigating the behaviour of biochemical networks under a variety of spatial influences including crowding, recruitment and localization. GridCell enables the tracking and characterization of individual particles, leading to insights on the behaviour of low copy number molecules participating in signaling networks. The simulation space is divided into a discrete 3D grid that provides ideal support for particle collisions without distance calculation and particle search. SBML support enables existing networks to be simulated and visualized. The user interface provides intuitive navigation that facilitates insights into species behaviour across spatial and temporal dimensions. We demonstrate the effect of crowing on a Michaelis-Menten system.GridCell is an effective stochastic particle simulator designed to track the progress of individual particles in a three-dimensional space in which spatial influences such as crowding, co-localization and recruitment may be investigated.

    View details for DOI 10.1186/1752-0509-2-66

    View details for Web of Science ID 000258870400001

    View details for PubMedID 18651956

  • Computational Methods For Predicting Protein-Protein Interactions PROTEIN - PROTEIN INTERACTION Pitre, S., Alamgir, M., Green, J. R., Dumontier, M., Dehne, F., Golshani, A. 2008; 110: 247-267


    Protein-protein interactions (PPIs) play a critical role in many cellular functions. A number of experimental techniques have been applied to discover PPIs; however, these techniques are expensive in terms of time, money, and expertise. There are also large discrepancies between the PPI data collected by the same or different techniques in the same organism. We therefore turn to computational techniques for the prediction of PPIs. Computational techniques have been applied to the collection, indexing, validation, analysis, and extrapolation of PPI data. This chapter will focus on computational prediction of PPI, reviewing a number of techniques including PIPE, developed in our own laboratory. For comparison, the conventional large-scale approaches to predict PPIs are also briefly discussed. The chapter concludes with a discussion of the limitations of both experimental and computational methods of determining PPIs.

    View details for DOI 10.1007/10_2007_089

    View details for Web of Science ID 000260375400011

    View details for PubMedID 18202838

  • Semantic Annotation and Question Answering of Statistical Graphs MICAI 2008: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS Dumontier, M., Ferres, L., Villanueva-Rosales, N. 2008; 5317: 100-110
  • Chemical knowledge for the Semantic Web DATA INTEGRATION IN THE LIFE SCIENCES, PROCEEDINGS Konyk, M., De Leon, A., Dumontier, M. 2008; 5109: 169-176
  • Semantic Query Answering with Time-Series Graphs 2007 11TH IEEE INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING CONFERENCE WORKSHOPS Ferres, L., Dumontier, M., Villanueva-Rosales, N. 2007: 117-124
  • Domain-based small molecule binding site annotation BMC BIOINFORMATICS Snyder, K. A., Feldman, H. J., Dumontier, M., Salama, J. J., Hogue, C. W. 2006; 7


    Accurate small molecule binding site information for a protein can facilitate studies in drug docking, drug discovery and function prediction, but small molecule binding site protein sequence annotation is sparse. The Small Molecule Interaction Database (SMID), a database of protein domain-small molecule interactions, was created using structural data from the Protein Data Bank (PDB). More importantly it provides a means to predict small molecule binding sites on proteins with a known or unknown structure and unlike prior approaches, removes large numbers of false positive hits arising from transitive alignment errors, non-biologically significant small molecules and crystallographic conditions that overpredict ion binding sites.Using a set of co-crystallized protein-small molecule structures as a starting point, SMID interactions were generated by identifying protein domains that bind to small molecules, using NCBI's Reverse Position Specific BLAST (RPS-BLAST) algorithm. SMID records are available for viewing at The SMID-BLAST tool provides accurate transitive annotation of small-molecule binding sites for proteins not found in the PDB. Given a protein sequence, SMID-BLAST identifies domains using RPS-BLAST and then lists potential small molecule ligands based on SMID records, as well as their aligned binding sites. A heuristic ligand score is calculated based on E-value, ligand residue identity and domain entropy to assign a level of confidence to hits found. SMID-BLAST predictions were validated against a set of 793 experimental small molecule interactions from the PDB, of which 472 (60%) of predicted interactions identically matched the experimental small molecule and of these, 344 had greater than 80% of the binding site residues correctly identified. Further, we estimate that 45% of predictions which were not observed in the PDB validation set may be true positives.By focusing on protein domain-small molecule interactions, SMID is able to cluster similar interactions and detect subtle binding patterns that would not otherwise be obvious. Using SMID-BLAST, small molecule targets can be predicted for any protein sequence, with the only limitation being that the small molecule must exist in the PDB. Validation results and specific examples within illustrate that SMID-BLAST has a high degree of accuracy in terms of predicting both the small molecule ligand and binding site residue positions for a query protein.

    View details for DOI 10.1186/1471-2105-7-152

    View details for Web of Science ID 000236762000001

    View details for PubMedID 16545112

  • CO: A chemical ontology for identification of functional groups and semantic comparison of small molecules FEBS LETTERS Feldman, H. J., Dumontier, M., Ling, S., Haider, N., Hogue, C. W. 2005; 579 (21): 4685-4691


    A novel chemical ontology based on chemical functional groups automatically, objectively assigned by a computer program, was developed to categorize small molecules. It has been applied to PubChem and the small molecule interaction database to demonstrate its utility as a basic pharmacophore search system. Molecules can be compared using a semantic similarity score based on functional group assignments rather than 3D shape, which succeeds in identifying small molecules known to bind a common binding site. This ontology will serve as a powerful tool for searching chemical databases and identifying key functional groups responsible for biological activities.

    View details for DOI 10.1016/j.febslet.2005.07.039

    View details for Web of Science ID 000231625800022

    View details for PubMedID 16098521

  • Armadillo: Domain boundary prediction by amino acid composition JOURNAL OF MOLECULAR BIOLOGY Dumontier, M., Yao, R., Feldman, H. J., Hogue, C. W. 2005; 350 (5): 1061-1073


    The identification and annotation of protein domains provides a critical step in the accurate determination of molecular function. Both computational and experimental methods of protein structure determination may be deterred by large multi-domain proteins or flexible linker regions. Knowledge of domains and their boundaries may reduce the experimental cost of protein structure determination by allowing researchers to work on a set of smaller and possibly more successful alternatives. Current domain prediction methods often rely on sequence similarity to conserved domains and as such are poorly suited to detect domain structure in poorly conserved or orphan proteins. We present here a simple computational method to identify protein domain linkers and their boundaries from sequence information alone. Our domain predictor, Armadillo (, uses any amino acid index to convert a protein sequence to a smoothed numeric profile from which domains and domain boundaries may be predicted. We derived an amino acid index called the domain linker propensity index (DLI) from the amino acid composition of domain linkers using a non-redundant structure dataset. The index indicates that Pro and Gly show a propensity for linker residues while small hydrophobic residues do not. Armadillo predicts domain linker boundaries from Z-score distributions and obtains 35% sensitivity with DLI in a two-domain, single-linker dataset (within +/-20 residues from linker). The combination of DLI and an entropy-based amino acid index increases the overall Armadillo sensitivity to 56% for two domain proteins. Moreover, Armadillo achieves 37% sensitivity for multi-domain proteins, surpassing most other prediction methods. Armadillo provides a simple, but effective method by which prediction of domain boundaries can be obtained with reasonable sensitivity. Armadillo should prove to be a valuable tool for rapidly delineating protein domains in poorly conserved proteins or those with no sequence neighbors. As a first-line predictor, domain meta-predictors could yield improved results with Armadillo predictions.

    View details for DOI 10.1016/j.jmb.2005.05.037

    View details for Web of Science ID 000230701300019

    View details for PubMedID 15978619

  • The Biomolecular Interaction Network Database and related tools 2005 update NUCLEIC ACIDS RESEARCH Alfarano, C., Andrade, C. E., Anthony, K., Bahroos, N., Bajec, M., Bantoft, K., Betel, D., Bobechko, B., Boutilier, K., Burgess, E., Buzadzija, K., Cavero, R., D'Abreo, C., Donaldson, I., Dorairajoo, D., Dumontier, M. J., Dumontier, M. R., Earles, V., Farrall, R., Feldman, H., Garderman, E., Gong, Y., Gonzaga, R., Grytsan, V., Gryz, E., Gu, V., Haldorsen, E., Halupa, A., Haw, R., Hrvojic, A., Hurrell, L., Isserlin, R., Jack, F., Juma, F., Khan, A., Kon, T., Konopinsky, S., Le, V., Lee, E., Ling, S., Magidin, M., Moniakis, J., Montojo, J., Moore, S., Muskat, B., Ng, I., Paraiso, J. P., Parker, B., Pintilie, G., Pirone, R., Salama, J. J., Sgro, S., Shan, T., Shu, Y., Siew, J., Skinner, D., Snyder, K., Stasiuk, R., Strumpf, D., Tuekam, B., Tao, S., Wang, Z., White, M., Willis, R., Wolting, C., Wong, S., Wrong, A., Xin, C., Yao, R., Yates, B., Zhang, S., Zheng, K., PAWSON, T., Ouellette, B. F., Hogue, C. W. 2005; 33: D418-D424


    The Biomolecular Interaction Network Database (BIND) ( archives biomolecular interaction, reaction, complex and pathway information. Our aim is to curate the details about molecular interactions that arise from published experimental research and to provide this information, as well as tools to enable data analysis, freely to researchers worldwide. BIND data are curated into a comprehensive machine-readable archive of computable information and provides users with methods to discover interactions and molecular mechanisms. BIND has worked to develop new methods for visualization that amplify the underlying annotation of genes and proteins to facilitate the study of molecular interaction networks. BIND has maintained an open database policy since its inception in 1999. Data growth has proceeded at a tremendous rate, approaching over 100 000 records. New services provided include a new BIND Query and Submission interface, a Standard Object Access Protocol service and the Small Molecule Interaction Database ( that allows users to determine probable small molecule binding sites of new sequences and examine conserved binding residues.

    View details for DOI 10.1093/nar/gki051

    View details for Web of Science ID 000226524300086

    View details for PubMedID 15608229

  • Hardware-accelerated protein identification for mass spectrometry RAPID COMMUNICATIONS IN MASS SPECTROMETRY Alex, A. T., Dumontier, M., Rose, J. S., Hogue, C. W. 2005; 19 (6): 833-837


    An ongoing issue in mass spectrometry is the time it takes to search DNA sequences with MS/MS peptide fragments (see, e.g., Choudary et al., Proteomics 2001; 1: 651-667.) Search times are far longer than spectra acquisition time, and parallelization of search software on clusters requires doubling the size of a conventional computing cluster to cut the search time in half. Field programmable gate arrays (FPGAs) are used to create hardware-accelerated algorithms that reduce operating costs and improve search speed compared to large clusters. We present a novel hardware design that takes full spectra and computes 6-frame translation word searches on DNA databases at a rate of approximately 3 billion base pairs per second, with queries of up to 10 amino acids in length and arbitrary wildcard positions. Hardware post-processing identifies in silico tryptic peptides and scores them using a variety of techniques including mass frequency expected values. With faster FPGAs protein identifications from the human genome can be achieved in less than a second, and this makes it an ideal solution for a number of proteome-scale applications.

    View details for DOI 10.1002/rcm.1853

    View details for Web of Science ID 000227519500013

    View details for PubMedID 15723443

  • Species-specific protein sequence and fold optimizations BMC BIOINFORMATICS Dumontier, M., Michalickova, K., Hogue, C. W. 2002; 3


    An organism's ability to adapt to its particular environmental niche is of fundamental importance to its survival and proliferation. In the largest study of its kind, we sought to identify and exploit the amino-acid signatures that make species-specific protein adaptation possible across 100 complete genomes.Environmental niche was determined to be a significant factor in variability from correspondence analysis using the amino acid composition of over 360,000 predicted open reading frames (ORFs) from 17 archaea, 76 bacteria and 7 eukaryote complete genomes. Additionally, we found clusters of phylogenetically unrelated archaea and bacteria that share similar environments by amino acid composition clustering. Composition analyses of conservative, domain-based homology modeling suggested an enrichment of small hydrophobic residues Ala, Gly, Val and charged residues Asp, Glu, His and Arg across all genomes. However, larger aromatic residues Phe, Trp and Tyr are reduced in folds, and these results were not affected by low complexity biases. We derived two simple log-odds scoring functions from ORFs (CG) and folds (CF) for each of the complete genomes. CF achieved an average cross-validation success rate of 85 +/- 8% whereas the CG detected 73 +/- 9% species-specific sequences when competing against all other non-redundant CG. Continuously updated results are available at analysis of amino acid compositions from the complete genomes provides stronger evidence for species-specific and environmental residue preferences in genomic sequences as well as in folds. Scoring functions derived from this work will be useful in future protein engineering experiments and possibly in identifying horizontal transfer events.

    View details for Web of Science ID 000181476800039

    View details for PubMedID 12487631

  • NBLAST: a cluster variant of BLAST for NxN comparisons BMC BIOINFORMATICS Dumontier, M., Hogue, C. W. 2002; 3


    The BLAST algorithm compares biological sequences to one another in order to determine shared motifs and common ancestry. However, the comparison of all non-redundant (NR) sequences against all other NR sequences is a computationally intensive task. We developed NBLAST as a cluster computer implementation of the BLAST family of sequence comparison programs for the purpose of generating pre-computed BLAST alignments and neighbour lists of NR sequences.NBLAST performs the heuristic BLAST algorithm and generates an exhaustive database of alignments, but it only computes alignments (i.e. the upper triangle) of a possible N2 alignments, where N is the set of all sequences to be compared. A task-partitioning algorithm allows for cluster computing across all cluster nodes and the NBLAST master process produces a BLAST sequence alignment database and a list of sequence neighbours for each sequence record. The resulting sequence alignment and neighbour databases are used to serve the SeqHound query system through a C/C++ and PERL Application Programming Interface (API).NBLAST offers a local alternative to the NCBI's remote Entrez system for pre-computed BLAST alignments and neighbour queries. On our 216-processor 450 MHz PIII cluster, NBLAST requires ~24 hrs to compute neighbours for 850000 proteins currently in the non-redundant protein database.

    View details for Web of Science ID 000181476800013

    View details for PubMedID 12019022

  • SeqHound: biological sequence and structure database as a platform for bioinformatics research BMC BIOINFORMATICS Michalickova, K., Bader, G. D., Dumontier, M., LIEU, H., Betel, D., Isserlin, R., Hogue, C. W. 2002; 3


    SeqHound has been developed as an integrated biological sequence, taxonomy, annotation and 3-D structure database system. It provides a high-performance server platform for bioinformatics research in a locally-hosted environment.SeqHound is based on the National Center for Biotechnology Information data model and programming tools. It offers daily updated contents of all Entrez sequence databases in addition to 3-D structural data and information about sequence redundancies, sequence neighbours, taxonomy, complete genomes, functional annotation including Gene Ontology terms and literature links to PubMed. SeqHound is accessible via a web server through a Perl, C or C++ remote API or an optimized local API. It provides functionality necessary to retrieve specialized subsets of sequences, structures and structural domains. Sequences may be retrieved in FASTA, GenBank, ASN.1 and XML formats. Structures are available in ASN.1, XML and PDB formats. Emphasis has been placed on complete genomes, taxonomy, domain and functional annotation as well as 3-D structural functionality in the API, while fielded text indexing functionality remains under development. SeqHound also offers a streamlined WWW interface for simple web-user queries.The system has proven useful in several published bioinformatics projects such as the BIND database and offers a cost-effective infrastructure for research. SeqHound will continue to develop and be provided as a service of the Blueprint Initiative at the Samuel Lunenfeld Research Institute. The source code and examples are available under the terms of the GNU public license at the Sourceforge site in the SLRI Toolkit.

    View details for Web of Science ID 000181476800032

    View details for PubMedID 12401134

Stanford Medicine Resources: