Dennis Wall is an Associate Professor of Pediatrics at the Stanford University School of Medicine, where his lab is developing novel approaches in systems biology to decipher the molecular pathology of autism spectrum disorder and related neurological disorders.

Dr. Wall received his doctorate in Integrative Biology from the University of California, Berkeley, where he pioneered the use of fast evolving gene sequences to trace population-scale diversification across islands. Then, with a postdoctoral fellowship award from the National Science Foundation, he went on to Stanford University to address broader questions in systems biology and computational genomics, work that resulted in comprehensive functional models for both protein mutation and protein interaction.

Dr. Wall has acted as science advisor to several biotechnology and pharmaceutical companies, has developed cutting-edge approaches to cloud computing, and has received numerous awards, including an NSF postdoctoral fellowship, the Fred R. Cagle Award for Outstanding Achievement in Biology, the Vice Chancellor's Award for Research, three awards for excellence in teaching, and the Harvard Medical School Leadership award.

Academic Appointments

Professional Education

  • Fellow, Stanford University, Biological Informatics (2003)
  • Ph.D., University of California, Berkeley, Integrative Biology (2001)

Research & Scholarship

Current Research and Scholarly Interests

Systems biology for design of clinical solutions that detect and treat disease


2015-16 Courses

Stanford Advisees

Graduate and Fellowship Programs


All Publications

  • COSMOS: Python library for massively parallel workflows BIOINFORMATICS Gafni, E., Luquette, L. J., Lancaster, A. K., Hawkins, J. B., Jung, J., Souilmi, Y., Wall, D. P., Tonellato, P. J. 2014; 30 (20): 2956-2958


    Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services.Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at and or data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btu385

    View details for Web of Science ID 000343083600015

    View details for PubMedID 24982428

  • A framework for the interpretation of de novo mutation in human disease NATURE GENETICS Samocha, K. E., Robinson, E. B., Sanders, S. J., Stevens, C., Sabo, A., McGrath, L. M., Kosmicki, J. A., Rehnstrom, K., Mallick, S., Kirby, A., Wall, D. P., MacArthur, D. G., Gabriel, S. B., DePristo, M., Purcell, S. M., Palotie, A., Boerwinkle, E., Buxbaum, J. D., Cook, E. H., Gibbs, R. A., Schellenberg, G. D., Sutcliffe, J. S., Devlin, B., Roeder, K., Neale, B. M., Daly, M. J. 2014; 46 (9): 944-?


    Spontaneously arising (de novo) mutations have an important role in medical genetics. For diseases with extensive locus heterogeneity, such as autism spectrum disorders (ASDs), the signal from de novo mutations is distributed across many genes, making it difficult to distinguish disease-relevant mutations from background variation. Here we provide a statistical framework for the analysis of excesses in de novo mutation per gene and gene set by calibrating a model of de novo mutation. We applied this framework to de novo mutations collected from 1,078 ASD family trios, and, whereas we affirmed a significant role for loss-of-function mutations, we found no excess of de novo loss-of-function mutations in cases with IQ above 100, suggesting that the role of de novo mutations in ASDs might reside in fundamental neurodevelopmental processes. We also used our model to identify ∼1,000 genes that are significantly lacking in functional coding variation in non-ASD samples and are enriched for de novo loss-of-function mutations identified in ASD cases.

    View details for DOI 10.1038/ng.3050

    View details for Web of Science ID 000341579400007

    View details for PubMedID 25086666

  • Evaluating the critical source area concept of phosphorus loss from soils to water-bodies in agricultural catchments. The Science of the total environment Shore, M., Jordan, P., Mellander, P., Kelly-Quinn, M., Wall, D. P., Murphy, P. N., Melland, A. R. 2014; 490: 405-415


    Using data collected from six basins located across two hydrologically contrasting agricultural catchments, this study investigated whether transport metrics alone provide better estimates of storm phosphorus (P) loss from basins than critical source area (CSA) metrics which combine source factors as well. Concentrations and loads of P in quickflow (QF) were measured at basin outlets during four storm events and were compared with dynamic (QF magnitude) and static (extent of highly-connected, poorly-drained soils) transport metrics and a CSA metric (extent of highly-connected, poorly-drained soils with excess plant-available P). Pairwise comparisons between basins with similar CSA risks but contrasting QF magnitudes showed that QF flow-weighted mean TRP (total molybdate-reactive P) concentrations and loads were frequently (at least 11 of 14 comparisons) more than 40% higher in basins with the highest QF magnitudes. Furthermore, static transport metrics reliably discerned relative QF magnitudes between these basins. However, particulate P (PP) concentrations were often (6 of 14 comparisons) higher in basins with the lowest QF magnitudes, most likely due to soil-management activities (e.g. ploughing), in these predominantly arable basins at these times. Pairwise comparisons between basins with contrasting CSA risks and similar QF magnitudes showed that TRP and PP concentrations and loads did not reflect trends in CSA risk or QF magnitude. Static transport metrics did not discern relative QF magnitudes between these basins. In basins with contrasting transport risks, storm TRP concentrations and loads were well differentiated by dynamic or static transport metrics alone, regardless of differences in soil P. In basins with similar transport risks, dynamic transport metrics and P source information additional to soil P may be required to predict relative storm TRP concentrations and loads. Regardless of differences in transport risk, information on land use and management, may be required to predict relative differences in storm PP concentrations between these agricultural basins.

    View details for DOI 10.1016/j.scitotenv.2014.04.122

    View details for PubMedID 24863139

  • A literature search tool for intelligent extraction of disease-associated genes JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Jung, J., DeLuca, T. F., Nelson, T. H., Wall, D. P. 2014; 21 (3): 399-405


    To extract disorder-associated genes from the scientific literature in PubMed with greater sensitivity for literature-based support than existing methods.We developed a PubMed query to retrieve disorder-related, original research articles. Then we applied a rule-based text-mining algorithm with keyword matching to extract target disorders, genes with significant results, and the type of study described by the article.We compared our resulting candidate disorder genes and supporting references with existing databases. We demonstrated that our candidate gene set covers nearly all genes in manually curated databases, and that the references supporting the disorder-gene link are more extensive and accurate than other general purpose gene-to-disorder association databases.We implemented a novel publication search tool to find target articles, specifically focused on links between disorders and genotypes. Through comparison against gold-standard manually updated gene-disorder databases and comparison with automated databases of similar functionality we show that our tool can search through the entirety of PubMed to extract the main gene findings for human diseases rapidly and accurately.

    View details for DOI 10.1136/amiajnl-2012-001563

    View details for Web of Science ID 000334611600003

    View details for PubMedID 23999671

  • The Potential of Accelerating Early Detection of Autism through Content Analysis of YouTube Videos. PloS one Fusaro, V. A., Daniels, J., Duda, M., DeLuca, T. F., D'Angelo, O., Tamburello, J., Maniscalco, J., Wall, D. P. 2014; 9 (4)


    Autism is on the rise, with 1 in 88 children receiving a diagnosis in the United States, yet the process for diagnosis remains cumbersome and time consuming. Research has shown that home videos of children can help increase the accuracy of diagnosis. However the use of videos in the diagnostic process is uncommon. In the present study, we assessed the feasibility of applying a gold-standard diagnostic instrument to brief and unstructured home videos and tested whether video analysis can enable more rapid detection of the core features of autism outside of clinical environments. We collected 100 public videos from YouTube of children ages 1-15 with either a self-reported diagnosis of an ASD (N = 45) or not (N = 55). Four non-clinical raters independently scored all videos using one of the most widely adopted tools for behavioral diagnosis of autism, the Autism Diagnostic Observation Schedule-Generic (ADOS). The classification accuracy was 96.8%, with 94.1% sensitivity and 100% specificity, the inter-rater correlation for the behavioral domains on the ADOS was 0.88, and the diagnoses matched a trained clinician in all but 3 of 22 randomly selected video cases. Despite the diversity of videos and non-clinical raters, our results indicate that it is possible to achieve high classification accuracy, sensitivity, and specificity as well as clinically acceptable inter-rater reliability with nonclinical personnel. Our results also demonstrate the potential for video-based detection of autism in short, unstructured home videos and further suggests that at least a percentage of the effort associated with detection and monitoring of autism may be mobilized and moved outside of traditional clinical environments.

    View details for DOI 10.1371/journal.pone.0093533

    View details for PubMedID 24740236

  • Testing the accuracy of an observation-based classifier for rapid detection of autism risk. Translational psychiatry Duda, M., Kosmicki, J. A., Wall, D. P. 2014; 4


    Current approaches for diagnosing autism have high diagnostic validity but are time consuming and can contribute to delays in arriving at an official diagnosis. In a pilot study, we used machine learning to derive a classifier that represented a 72% reduction in length from the gold-standard Autism Diagnostic Observation Schedule-Generic (ADOS-G), while retaining >97% statistical accuracy. The pilot study focused on a relatively small sample of children with and without autism. The present study sought to further test the accuracy of the classifier (termed the observation-based classifier (OBC)) on an independent sample of 2616 children scored using ADOS from five data repositories and including both spectrum (n=2333) and non-spectrum (n=283) individuals. We tested OBC outcomes against the outcomes provided by the original and current ADOS algorithms, the best estimate clinical diagnosis, and the comparison score severity metric associated with ADOS-2. The OBC was significantly correlated with the ADOS-G (r=-0.814) and ADOS-2 (r=-0.779) and exhibited >97% sensitivity and >77% specificity in comparison to both ADOS algorithm scores. The correspondence to the best estimate clinical diagnosis was also high (accuracy=96.8%), with sensitivity of 97.1% and specificity of 83.3%. The correlation between the OBC score and the comparison score was significant (r=-0.628), suggesting that the OBC provides both a classification as well as a measure of severity of the phenotype. These results further demonstrate the accuracy of the OBC and suggest that reductions in the process of detecting and monitoring autism are possible.

    View details for DOI 10.1038/tp.2014.65

    View details for PubMedID 25116834

  • Responding to a Diagnosis of Localized Prostate Cancer Men's Experiences of Normal Distress During the First 3 Postdiagnostic Months CANCER NURSING Wall, D. P., Kristjanson, L. J., Fisher, C., Boldy, D., Kendall, G. E. 2013; 36 (6): E44-E50


    Men experience localized prostate cancer (PCa) as aversive and distressing. Little research has studied the distress men experience as a normal response to PCa, or how they manage this distress during the early stages of the illness.The objective of this study was to explore the experience of men diagnosed with localized PCa during their first postdiagnostic year.This constructivist qualitative study interviewed 8 men between the ages of 44 and 77 years, in their homes, on 2 occasions during the first 3 postdiagnostic months. Individual, in-depth semistructured interviews were used to collect the data.After an initial feeling of shock, the men in this study worked diligently to camouflage their experience of distress through hiding and attenuating their feelings and minimizing the severity of PCa.Men silenced distress because they believed it was expected of them. Maintaining silence allowed men to protect their strong and stoic self-image. This stereotype, of the strong and stoic man, prevented men from expressing their feelings of distress and from seeking support from family and friends and health professionals.It is important for nurses to acknowledge and recognize the normal distress experienced by men as a result of a PCa diagnosis. Hence, nurses must learn to identify the ways in which men avoid expressing their distress and develop early supportive relationships that encourage them to express and subsequently manage it.

    View details for DOI 10.1097/NCC.0b013e3182747bef

    View details for Web of Science ID 000326532000006

    View details for PubMedID 23154517

  • Quantification of Phosphorus Transport from a Karstic Agricultural Watershed to Emerging Spring Water ENVIRONMENTAL SCIENCE & TECHNOLOGY Mellander, P., Jordan, P., Melland, A. R., Murphy, P. N., Wall, D. P., Mechan, S., Meehan, R., Kelly, C., Shine, O., Shortle, G. 2013; 47 (12): 6111-6119


    The degree to which waters in a given watershed will be affected by nutrient export can be defined as that watershed's nutrient vulnerability. This study applied concepts of specific phosphorus (P) vulnerability to develop intrinsic groundwater vulnerability risk assessments in a 32 km(2) karst watershed (spring zone of contribution) in a relatively intensive agricultural landscape. To explain why emergent spring water was below an ecological impairment threshold, concepts of P attenuation potential were investigated along the nutrient transfer continuum based on soil P buffering, depth to bedrock, and retention within the aquifer. Surface karst features, such as enclosed depressions, were reclassified based on P attenuation potential in soil at the base. New techniques of high temporal resolution monitoring of P loads in the emergent spring made it possible to estimate P transfer pathways and retention within the aquifer and indicated small-medium fissure flows to be the dominant pathway, delivering 52-90% of P loads during storm events. Annual total P delivery to the main emerging spring was 92.7 and 138.4 kg total P (and 52.4 and 91.3 kg as total reactive P) for two monitored years, respectively. A revised groundwater vulnerability assessment was used to produce a specific P vulnerability map that used the soil and hydrogeological P buffering potential of the watershed as key assumptions in moderating P export to the emergent spring. Using this map and soil P data, the definition of critical source areas in karst landscapes was demonstrated.

    View details for DOI 10.1021/es304909y

    View details for Web of Science ID 000320749000007

    View details for PubMedID 23672730

  • Haplotype structure enables prioritization of common markers and candidate genes in autism spectrum disorder TRANSLATIONAL PSYCHIATRY Vardarajan, B. N., Eran, A., Jung, J., KUNKEL, L. M., Wall, D. P. 2013; 3


    Autism spectrum disorder (ASD) is a neurodevelopmental condition that results in behavioral, social and communication impairments. ASD has a substantial genetic component, with 88-95% trait concordance among monozygotic twins. Efforts to elucidate the causes of ASD have uncovered hundreds of susceptibility loci and candidate genes. However, owing to its polygenic nature and clinical heterogeneity, only a few of these markers represent clear targets for further analyses. In the present study, we used the linkage structure associated with published genetic markers of ASD to simultaneously improve candidate gene detection while providing a means of prioritizing markers of common genetic variation in ASD. We first mined the literature for linkage and association studies of single-nucleotide polymorphisms, copy-number variations and multi-allelic markers in Autism Genetic Resource Exchange (AGRE) families. From markers that reached genome-wide significance, we calculated male-specific genetic distances, in light of the observed strong male bias in ASD. Four of 67 autism-implicated regions, 3p26.1, 3p26.3, 3q25-27 and 5p15, were enriched with differentially expressed genes in blood and brain from individuals with ASD. Of 30 genes differentially expressed across multiple expression data sets, 21 were within 10 cM of an autism-implicated locus. Among them, CNTN4, CADPS2, SUMF1, SLC9A9, NTRK3 have been previously implicated in autism, whereas others have been implicated in neurological disorders comorbid with ASD. This work leverages the rich multimodal genomic information collected on AGRE families to present an efficient integrative strategy for prioritizing autism candidates and improving our understanding of the relationships among the vast collection of past genetic studies.

    View details for DOI 10.1038/tp.2013.38

    View details for Web of Science ID 000321184400008

    View details for PubMedID 23715297

  • Genomics-Informed Pathology SCIENTIST Wall, D. P., Tonellato, P. J. 2013; 27 (1): 22-23
  • Genetic Networks of Complex Disorders: from a Novel Search Engine for PubMed Article Database. AMIA Joint Summits on Translational Science proceedings AMIA Summit on Translational Science Jung, J., Wall, D. P. 2013; 2013: 99-?


    Finding genetic risk factors of complex disorders may involve reviewing hundreds of genes or thousands of research articles iteratively, but few tools have been available to facilitate this procedure. In this work, we built a novel publication search engine that can identify target-disorder specific, genetics-oriented research articles and extract the genes with significant results. Preliminary test results showed that the output of this engine has better coverage in terms of genes or publications, than other existing applications. We consider it as an essential tool for understanding genetic networks of complex disorders.

    View details for PubMedID 24303309

  • Streaming Support for Data Intensive Cloud-Based Sequence Analysis BIOMED RESEARCH INTERNATIONAL Issa, S. A., Kienzler, R., El-Kalioby, M., Tonellato, P. J., Wall, D., Bruggmann, R., Abouelhoda, M. 2013


    Cloud computing provides a promising solution to the genomics data deluge problem resulting from the advent of next-generation sequencing (NGS) technology. Based on the concepts of "resources-on-demand" and "pay-as-you-go", scientists with no or limited infrastructure can have access to scalable and cost-effective computational resources. However, the large size of NGS data causes a significant data transfer latency from the client's site to the cloud, which presents a bottleneck for using cloud computing services. In this paper, we provide a streaming-based scheme to overcome this problem, where the NGS data is processed while being transferred to the cloud. Our scheme targets the wide class of NGS data analysis tasks, where the NGS sequences can be processed independently from one another. We also provide the elastream package that supports the use of this scheme with individual analysis programs or with workflow systems. Experiments presented in this paper show that our solution mitigates the effect of data transfer latency and saves both time and cost of computation.

    View details for DOI 10.1155/2013/791051

    View details for Web of Science ID 000318725500001

    View details for PubMedID 23710461

  • Personalized cloud-based bioinformatics services for research and education: use cases and the elasticHPC package BMC BIOINFORMATICS El-Kalioby, M., Abouelhoda, M., Krueger, J., Giegerich, R., Sczyrba, A., Wall, D. P., Tonellato, P. 2012; 13


    Bioinformatics services have been traditionally provided in the form of a web-server that is hosted at institutional infrastructure and serves multiple users. This model, however, is not flexible enough to cope with the increasing number of users, increasing data size, and new requirements in terms of speed and availability of service. The advent of cloud computing suggests a new service model that provides an efficient solution to these problems, based on the concepts of "resources-on-demand" and "pay-as-you-go". However, cloud computing has not yet been introduced within bioinformatics servers due to the lack of usage scenarios and software layers that address the requirements of the bioinformatics domain.In this paper, we provide different use case scenarios for providing cloud computing based services, considering both the technical and financial aspects of the cloud computing service model. These scenarios are for individual users seeking computational power as well as bioinformatics service providers aiming at provision of personalized bioinformatics services to their users. We also present elasticHPC, a software package and a library that facilitates the use of high performance cloud computing resources in general and the implementation of the suggested bioinformatics scenarios in particular. Concrete examples that demonstrate the suggested use case scenarios with whole bioinformatics servers and major sequence analysis tools like BLAST are presented. Experimental results with large datasets are also included to show the advantages of the cloud model.Our use case scenarios and the elasticHPC package are steps towards the provision of cloud based bioinformatics services, which would help in overcoming the data challenge of recent biological research. All resources related to elasticHPC and its web-interface are available at

    View details for DOI 10.1186/1471-2105-13-S17-S22

    View details for Web of Science ID 000317183600002

    View details for PubMedID 23281941

  • Cross-pollination of research findings, although uncommon, may accelerate discovery of human disease genes BMC MEDICAL GENETICS Duda, M., Nelson, T., Wall, D. P. 2012; 13


    Technological leaps in genome sequencing have resulted in a surge in discovery of human disease genes. These discoveries have led to increased clarity on the molecular pathology of disease and have also demonstrated considerable overlap in the genetic roots of human diseases. In light of this large genetic overlap, we tested whether cross-disease research approaches lead to faster, more impactful discoveries.We leveraged several gene-disease association databases to calculate a Mutual Citation Score (MCS) for 10,853 pairs of genetically related diseases to measure the frequency of cross-citation between research fields. To assess the importance of cooperative research, we computed an Individual Disease Cooperation Score (ICS) and the average publication rate for each disease.For all disease pairs with one gene in common, we found that the degree of genetic overlap was a poor predictor of cooperation (r(2)=0.3198) and that the vast majority of disease pairs (89.56%) never cited previous discoveries of the same gene in a different disease, irrespective of the level of genetic similarity between the diseases. A fraction (0.25%) of the pairs demonstrated cross-citation in greater than 5% of their published genetic discoveries and 0.037% cross-referenced discoveries more than 10% of the time. We found strong positive correlations between ICS and publication rate (r(2)=0.7931), and an even stronger correlation between the publication rate and the number of cross-referenced diseases (r(2)=0.8585). These results suggested that cross-disease research may have the potential to yield novel discoveries at a faster pace than singular disease research.Our findings suggest that the frequency of cross-disease study is low despite the high level of genetic similarity among many human diseases, and that collaborative methods may accelerate and increase the impact of new genetic discoveries. Until we have a better understanding of the taxonomy of human diseases, cross-disease research approaches should become the rule rather than the exception.

    View details for DOI 10.1186/1471-2350-13-114

    View details for Web of Science ID 000312866300001

    View details for PubMedID 23190421

  • Autworks: a cross-disease network biology application for Autism and related disorders BMC MEDICAL GENOMICS Nelson, T. H., Jung, J., DeLuca, T. F., Hinebaugh, B. K., St Gabriel, K. C., Wall, D. P. 2012; 5


    The genetic etiology of autism is heterogeneous. Multiple disorders share genotypic and phenotypic traits with autism. Network based cross-disorder analysis can aid in the understanding and characterization of the molecular pathology of autism, but there are few tools that enable us to conduct cross-disorder analysis and to visualize the results.We have designed Autworks as a web portal to bring together gene interaction and gene-disease association data on autism to enable network construction, visualization, network comparisons with numerous other related neurological conditions and disorders. Users may examine the structure of gene interactions within a set of disorder-associated genes, compare networks of disorder/disease genes with those of other disorders/diseases, and upload their own sets for comparative analysis.Autworks is a web application that provides an easy-to-use resource for researchers of varied backgrounds to analyze the autism gene network structure within and between disorders.

    View details for DOI 10.1186/1755-8794-5-56

    View details for Web of Science ID 000313043800001

    View details for PubMedID 23190929

  • Use of Artificial Intelligence to Shorten the Behavioral Diagnosis of Autism PLOS ONE Wall, D. P., Dally, R., Luyster, R., Jung, J., DeLuca, T. F. 2012; 7 (8)


    The Autism Diagnostic Interview-Revised (ADI-R) is one of the most commonly used instruments for assisting in the behavioral diagnosis of autism. The exam consists of 93 questions that must be answered by a care provider within a focused session that often spans 2.5 hours. We used machine learning techniques to study the complete sets of answers to the ADI-R available at the Autism Genetic Research Exchange (AGRE) for 891 individuals diagnosed with autism and 75 individuals who did not meet the criteria for an autism diagnosis. Our analysis showed that 7 of the 93 items contained in the ADI-R were sufficient to classify autism with 99.9% statistical accuracy. We further tested the accuracy of this 7-question classifier against complete sets of answers from two independent sources, a collection of 1654 individuals with autism from the Simons Foundation and a collection of 322 individuals with autism from the Boston Autism Consortium. In both cases, our classifier performed with nearly 100% statistical accuracy, properly categorizing all but one of the individuals from these two resources who previously had been diagnosed with autism through the standard ADI-R. Our ability to measure specificity was limited by the small numbers of non-spectrum cases in the research data used, however, both real and simulated data demonstrated a range in specificity from 99% to 93.8%. With incidence rates rising, the capacity to diagnose autism quickly and effectively requires careful design of behavioral assessment methods. Ours is an initial attempt to retrospectively analyze large data repositories to derive an accurate, but significantly abbreviated approach that may be used for rapid detection and clinical prioritization of individuals likely to have an autism spectrum disorder. Such a tool could assist in streamlining the clinical diagnostic process overall, leading to faster screening and earlier treatment of individuals with autism.

    View details for DOI 10.1371/journal.pone.0043855

    View details for Web of Science ID 000308044800067

    View details for PubMedID 22952789

  • Delivery and impact bypass in a karst aquifer with high phosphorus source and pathway potential WATER RESEARCH Mellander, P., Jordan, P., Wall, D. P., Melland, A. R., Meehan, R., Kelly, C., Shortle, G. 2012; 46 (7): 2225-2236


    Conduit and other karstic flows to aquifers, connecting agricultural soils and farming activities, are considered to be the main hydrological mechanisms that transfer phosphorus from the land surface to the groundwater body of a karstified aquifer. In this study, soil source and pathway components of the phosphorus (P) transfer continuum were defined at a high spatial resolution; field-by-field soil P status and mapping of all surface karst features was undertaken in a > 30 km(2) spring contributing zone. Additionally, P delivery and water discharge was monitored in the emergent spring at a sub-hourly basis for over 12 months. Despite moderate to intensive agriculture, varying soil P status with a high proportion of elevated soil P concentrations and a high karstic connectivity potential, background P concentrations in the emergent groundwater were low and indicative of being insufficient to increase the surface water P status of receiving surface waters. However, episodic P transfers via the conduit system increased the P concentrations in the spring during storm events (but not >0.035 mg total reactive P L(-1)) and this process is similar to other catchments where the predominant transfer is via episodic, surface flow pathways; but with high buffering potential over karst due to delayed and attenuated runoff. These data suggest that the current definitions of risk and vulnerability for P delivery to receiving surface waters should be re-evaluated as high source risk need not necessarily result in a water quality impact. Also, inclusion of conduit flows from sparse water quality data in these systems may over-emphasise their influence on the overall status of the groundwater body.

    View details for DOI 10.1016/j.watres.2012.01.048

    View details for Web of Science ID 000302645300020

    View details for PubMedID 22377147

  • Systems analysis of inflammatory bowel disease based on comprehensive gene information BMC MEDICAL GENETICS Suzuki, S., Takai-Igarashi, T., Fukuoka, Y., Wall, D. P., Tanaka, H., Tonellato, P. J. 2012; 13


    The rise of systems biology and availability of highly curated gene and molecular information resources has promoted a comprehensive approach to study disease as the cumulative deleterious function of a collection of individual genes and networks of molecules acting in concert. These "human disease networks" (HDN) have revealed novel candidate genes and pharmaceutical targets for many diseases and identified fundamental HDN features conserved across diseases. A network-based analysis is particularly vital for a study on polygenic diseases where many interactions between molecules should be simultaneously examined and elucidated. We employ a new knowledge driven HDN gene and molecular database systems approach to analyze Inflammatory Bowel Disease (IBD), whose pathogenesis remains largely unknown.Based on drug indications for IBD, we determined sibling diseases of mild and severe states of IBD. Approximately 1,000 genes associated with the sibling diseases were retrieved from four databases. After ranking the genes by the frequency of records in the databases, we obtained 250 and 253 genes highly associated with the mild and severe IBD states, respectively. We then calculated functional similarities of these genes with known drug targets and examined and presented their interactions as PPI networks.The results demonstrate that this knowledge-based systems approach, predicated on functionally similar genes important to sibling diseases is an effective method to identify important components of the IBD human disease network. Our approach elucidates a previously unknown biological distinction between mild and severe IBD states.

    View details for DOI 10.1186/1471-2350-13-25

    View details for Web of Science ID 000305184200001

    View details for PubMedID 22480395

  • Roundup 2.0: enabling comparative genomics for over 1800 genomes BIOINFORMATICS DeLuca, T. F., Cui, J., Jung, J., Gabriel, K. C., Wall, D. P. 2012; 28 (5): 715-716


    Roundup is an online database of gene orthologs for over 1800 genomes, including 226 Eukaryota, 1447 Bacteria, 113 Archaea and 21 Viruses. Orthologs are inferred using the Reciprocal Smallest Distance algorithm. Users may query Roundup for single-linkage clusters of orthologous genes based on any group of genomes. Annotated query results may be viewed in a variety of ways including as clusters of orthologs and as phylogenetic profiles. Genomic results may be downloaded in formats suitable for functional as well as phylogenetic analysis, including the recent OrthoXML standard. In addition, gene IDs can be retrieved using FASTA sequence search. All source code and orthologs are freely available.

    View details for DOI 10.1093/bioinformatics/bts006

    View details for Web of Science ID 000300986600017

    View details for PubMedID 22247275

  • Cloud Computing for Comparative Genomics with Windows Azure Platform EVOLUTIONARY BIOINFORMATICS Kim, I., Jung, J., DeLuca, T. F., Nelson, T. H., Wall, D. P. 2012; 8: 527-534


    Cloud computing services have emerged as a cost-effective alternative for cluster systems as the number of genomes and required computation power to analyze them increased in recent years. Here we introduce the Microsoft Azure platform with detailed execution steps and a cost comparison with Amazon Web Services.

    View details for DOI 10.4137/EBO.S9946

    View details for Web of Science ID 000308500500001

    View details for PubMedID 23032609

  • The future of genomics in pathology. F1000 medicine reports Wall, D. P., Tonellato, P. J. 2012; 4: 14-?


    The recent advances in technology and the promise of cheap and fast whole genomic data offer the possibility to revolutionise the discipline of pathology. This should allow pathologists in the near future to diagnose disease rapidly and early to change its course, and to tailor treatment programs to the individual. This review outlines some of these technical advances and the changes needed to make this revolution a reality.

    View details for DOI 10.3410/M4-14

    View details for PubMedID 22802873

  • Phylogenetically informed logic relationships improve detection of biological network organization BMC BIOINFORMATICS Cui, J., DeLuca, T. F., Jung, J., Wall, D. P. 2011; 12


    A "phylogenetic profile" refers to the presence or absence of a gene across a set of organisms, and it has been proven valuable for understanding gene functional relationships and network organization. Despite this success, few studies have attempted to search beyond just pairwise relationships among genes. Here we search for logic relationships involving three genes, and explore its potential application in gene network analyses.Taking advantage of a phylogenetic matrix constructed from the large orthologs database Roundup, we invented a method to create balanced profiles for individual triplets of genes that guarantee equal weight on the different phylogenetic scenarios of coevolution between genes. When we applied this idea to LAPP, the method to search for logic triplets of genes, the balanced profiles resulted in significant performance improvement and the discovery of hundreds of thousands more putative triplets than unadjusted profiles. We found that logic triplets detected biological network organization and identified key proteins and their functions, ranging from neighbouring proteins in local pathways, to well separated proteins in the whole pathway, and to the interactions among different pathways at the system level. Finally, our case study suggested that the directionality in a logic relationship and the profile of a triplet could disclose the connectivity between the triplet and surrounding networks.Balanced profiles are superior to the raw profiles employed by traditional methods of phylogenetic profiling in searching for high order gene sets. Gene triplets can provide valuable information in detection of biological network organization and identification of key genes at different levels of cellular interaction.

    View details for DOI 10.1186/1471-2105-12-476

    View details for Web of Science ID 000299824500001

    View details for PubMedID 22172058

  • Identification of autoimmune gene signatures in autism TRANSLATIONAL PSYCHIATRY Jung, J., Kohane, I. S., Wall, D. P. 2011; 1


    The role of the immune system in neuropsychiatric diseases, including autism spectrum disorder (ASD), has long been hypothesized. This hypothesis has mainly been supported by family cohort studies and the immunological abnormalities found in ASD patients, but had limited findings in genetic association testing. Two cross-disorder genetic association tests were performed on the genome-wide data sets of ASD and six autoimmune disorders. In the polygenic score test, we examined whether ASD risk alleles with low effect sizes work collectively in specific autoimmune disorders and show significant association statistics. In the genetic variation score test, we tested whether allele-specific associations between ASD and autoimmune disorders can be found using nominally significant single-nucleotide polymorphisms. In both tests, we found that ASD is probabilistically linked to ankylosing spondylitis (AS) and multiple sclerosis (MS). Association coefficients showed that ASD and AS were positively associated, meaning that autism susceptibility alleles may have a similar collective effect in AS. The association coefficients were negative between ASD and MS. Significant associations between ASD and two autoimmune disorders were identified. This genetic association supports the idea that specific immunological abnormalities may underlie the etiology of autism, at least in a number of cases.

    View details for DOI 10.1038/tp.2011.62

    View details for Web of Science ID 000306217100007

    View details for PubMedID 22832355

  • Detecting biological network organization and functional gene orthologs BIOINFORMATICS Cui, J., DeLuca, T. F., Jung, J., Wall, D. P. 2011; 27 (20): 2919-2920


    We developed a package TripletSearch to compute relationships within triplets of genes based on Roundup, an orthologous gene database containing >1500 genomes. These relationships, derived from the coevolution of genes, provide valuable information in the detection of biological network organization from the local to the system level, in the inference of protein functions and in the identification of functional orthologs. To run the computation, users need to provide the GI IDs of the genes of interest. data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btr485

    View details for Web of Science ID 000295680600025

    View details for PubMedID 21856738

  • Biomedical Cloud Computing With Amazon Web Services PLOS COMPUTATIONAL BIOLOGY Fusaro, V. A., Patil, P., Gafni, E., Wall, D. P., Tonellato, P. J. 2011; 7 (8)


    In this overview to biomedical computing in the cloud, we discussed two primary ways to use the cloud (a single instance or cluster), provided a detailed example using NGS mapping, and highlighted the associated costs. While many users new to the cloud may assume that entry is as straightforward as uploading an application and selecting an instance type and storage options, we illustrated that there is substantial up-front effort required before an application can make full use of the cloud's vast resources. Our intention was to provide a set of best practices and to illustrate how those apply to a typical application pipeline for biomedical informatics, but also general enough for extrapolation to other types of computational problems. Our mapping example was intended to illustrate how to develop a scalable project and not to compare and contrast alignment algorithms for read mapping and genome assembly. Indeed, with a newer aligner such as Bowtie, it is possible to map the entire African genome using one m2.2xlarge instance in 48 hours for a total cost of approximately $48 in computation time. In our example, we were not concerned with data transfer rates, which are heavily influenced by the amount of available bandwidth, connection latency, and network availability. When transferring large amounts of data to the cloud, bandwidth limitations can be a major bottleneck, and in some cases it is more efficient to simply mail a storage device containing the data to AWS ( More information about cloud computing, detailed cost analysis, and security can be found in references.

    View details for DOI 10.1371/journal.pcbi.1002147

    View details for Web of Science ID 000294299700022

    View details for PubMedID 21901085

  • The semantic organization of the animal category: evidence from semantic verbal fluency and network theory COGNITIVE PROCESSING Goni, J., Arrondo, G., Sepulcre, J., Martincorena, I., Velez de Mendizabal, N., Corominas-Murtra, B., Bejarano, B., Ardanza-Trevijano, S., Peraita, H., Wall, D. P., Villoslada, P. 2011; 12 (2): 183-196


    Semantic memory is the subsystem of human memory that stores knowledge of concepts or meanings, as opposed to life-specific experiences. How humans organize semantic information remains poorly understood. In an effort to better understand this issue, we conducted a verbal fluency experiment on 200 participants with the aim of inferring and representing the conceptual storage structure of the natural category of animals as a network. This was done by formulating a statistical framework for co-occurring concepts that aims to infer significant concept-concept associations and represent them as a graph. The resulting network was analyzed and enriched by means of a missing links recovery criterion based on modularity. Both network models were compared to a thresholded co-occurrence approach. They were evaluated using a random subset of verbal fluency tests and comparing the network outcomes (linked pairs are clustering transitions and disconnected pairs are switching transitions) to the outcomes of two expert human raters. Results show that the network models proposed in this study overcome a thresholded co-occurrence approach, and their outcomes are in high agreement with human evaluations. Finally, the interplay between conceptual structure and retrieval mechanisms is discussed.

    View details for DOI 10.1007/s10339-010-0372-x

    View details for Web of Science ID 000289685000005

    View details for PubMedID 20938799

  • Genotator: A disease-agnostic tool for genetic annotation of disease BMC MEDICAL GENOMICS Wall, D. P., Pivovarov, R., Tong, M., Jung, J., Fusaro, V. A., DeLuca, T. F., Tonellato, P. J. 2010; 3


    Disease-specific genetic information has been increasing at rapid rates as a consequence of recent improvements and massive cost reductions in sequencing technologies. Numerous systems designed to capture and organize this mounting sea of genetic data have emerged, but these resources differ dramatically in their disease coverage and genetic depth. With few exceptions, researchers must manually search a variety of sites to assemble a complete set of genetic evidence for a particular disease of interest, a process that is both time-consuming and error-prone.We designed a real-time aggregation tool that provides both comprehensive coverage and reliable gene-to-disease rankings for any disease. Our tool, called Genotator, automatically integrates data from 11 externally accessible clinical genetics resources and uses these data in a straightforward formula to rank genes in order of disease relevance. We tested the accuracy of coverage of Genotator in three separate diseases for which there exist specialty curated databases, Autism Spectrum Disorder, Parkinson's Disease, and Alzheimer Disease. Genotator is freely available at demonstrated that most of the 11 selected databases contain unique information about the genetic composition of disease, with 2514 genes found in only one of the 11 databases. These findings confirm that the integration of these databases provides a more complete picture than would be possible from any one database alone. Genotator successfully identified at least 75% of the top ranked genes for all three of our use cases, including a 90% concordance with the top 40 ranked candidates for Alzheimer Disease.As a meta-query engine, Genotator provides high coverage of both historical genetic research as well as recent advances in the genetic understanding of specific diseases. As such, Genotator provides a real-time aggregation of ranked data that remains current with the pace of research in the disease fields. Genotator's algorithm appropriately transforms query terms to match the input requirements of each targeted databases and accurately resolves named synonyms to ensure full coverage of the genetic results with official nomenclature. Genotator generates an excel-style output that is consistent across disease queries and readily importable to other applications.

    View details for DOI 10.1186/1755-8794-3-50

    View details for Web of Science ID 000284541000001

    View details for PubMedID 21034472

  • Cloud computing for comparative genomics BMC BIOINFORMATICS Wall, D. P., Kudtarkar, P., Fusaro, V. A., Pivovarov, R., Patil, P., Tonellato, P. J. 2010; 11


    Large comparative genomics studies and tools are becoming increasingly more compute-expensive as the number of available genome sequences continues to rise. The capacity and cost of local computing infrastructures are likely to become prohibitive with the increase, especially as the breadth of questions continues to rise. Alternative computing architectures, in particular cloud computing environments, may help alleviate this increasing pressure and enable fast, large-scale, and cost-effective comparative genomics strategies going forward. To test this, we redesigned a typical comparative genomics algorithm, the reciprocal smallest distance algorithm (RSD), to run within Amazon's Elastic Computing Cloud (EC2). We then employed the RSD-cloud for ortholog calculations across a wide selection of fully sequenced genomes.We ran more than 300,000 RSD-cloud processes within the EC2. These jobs were farmed simultaneously to 100 high capacity compute nodes using the Amazon Web Service Elastic Map Reduce and included a wide mix of large and small genomes. The total computation time took just under 70 hours and cost a total of $6,302 USD.The effort to transform existing comparative genomics algorithms from local compute infrastructures is not trivial. However, the speed and flexibility of cloud computing environments provides a substantial boost with manageable cost. The procedure designed to transform the RSD algorithm into a cloud-ready application is readily adaptable to similar comparative genomics problems.

    View details for DOI 10.1186/1471-2105-11-259

    View details for Web of Science ID 000279730300001

    View details for PubMedID 20482786

  • Collaborative text-annotation resource for disease-centered relation extraction from biomedical text JOURNAL OF BIOMEDICAL INFORMATICS Cano, C., Monaghan, T., Blanco, A., Wall, D. P., Peshkin, L. 2009; 42 (5): 967-977


    Agglomerating results from studies of individual biological components has shown the potential to produce biomedical discovery and the promise of therapeutic development. Such knowledge integration could be tremendously facilitated by automated text mining for relation extraction in the biomedical literature. Relation extraction systems cannot be developed without substantial datasets annotated with ground truth for benchmarking and training. The creation of such datasets is hampered by the absence of a resource for launching a distributed annotation effort, as well as by the lack of a standardized annotation schema. We have developed an annotation schema and an annotation tool which can be widely adopted so that the resulting annotated corpora from a multitude of disease studies could be assembled into a unified benchmark dataset. The contribution of this paper is threefold. First, we provide an overview of available benchmark corpora and derive a simple annotation schema for specific binary relation extraction problems such as protein-protein and gene-disease relation extraction. Second, we present BioNotate: an open source annotation resource for the distributed creation of a large corpus. Third, we present and make available the results of a pilot annotation effort of the autism disease network.

    View details for DOI 10.1016/j.jbi.2009.02.001

    View details for Web of Science ID 000270870500021

    View details for PubMedID 19232400

  • Comparative analysis of neurological disorders focuses genome-wide search for autism genes GENOMICS Wall, D. P., Esteban, F. J., DeLuca, T. F., Huyck, M., Monaghan, T., de Mendizabal, N. V., Goni, J., Kohane, I. S. 2009; 93 (2): 120-129


    The behaviors of autism overlap with a diverse array of other neurological disorders, suggesting common molecular mechanisms. We conducted a large comparative analysis of the network of genes linked to autism with those of 432 other neurological diseases to circumscribe a multi-disorder subcomponent of autism. We leveraged the biological process and interaction properties of these multi-disorder autism genes to overcome the across-the-board multiple hypothesis corrections that a purely data-driven approach requires. Using prior knowledge of biological process, we identified 154 genes not previously linked to autism of which 42% were significantly differentially expressed in autistic individuals. Then, using prior knowledge from interaction networks of disorders related to autism, we uncovered 334 new genes that interact with published autism genes, of which 87% were significantly differentially regulated in autistic individuals. Our analysis provided a novel picture of autism from the perspective of related neurological disorders and suggested a model by which prior knowledge of interaction networks can inform and focus genome-scale studies of complex neurological disorders.

    View details for DOI 10.1016/j.ygeno.2008.09.015

    View details for Web of Science ID 000263227600003

    View details for PubMedID 18950700

  • Heterogeneous dysregulation of microRNAs across the autism spectrum NEUROGENETICS Abu-Elneel, K., Liu, T., Gazzaniga, F. S., Nishimura, Y., Wall, D. P., Geschwind, D. H., Lao, K., Kosik, K. S. 2008; 9 (3): 153-161


    microRNAs (miRNAs) are approximately 21 nt transcripts capable of regulating the expression of many mRNAs and are abundant in the brain. miRNAs have a role in several complex diseases including cancer as well as some neurological diseases such as Tourette's syndrome and Fragile x syndrome. As a genetically complex disease, dysregulation of miRNA expression might be a feature of autism spectrum disorders (ASDs). Using multiplex quantitative polymerase chain reaction (PCR), we compared the expression of 466 human miRNAs from postmortem cerebellar cortex tissue of individuals with ASD (n = 13) and a control set of non-autistic cerebellar samples (n = 13). While most miRNAs levels showed little variation across all samples suggesting that autism does not induce global dysfunction of miRNA expression, some miRNAs among the autistic samples were expressed at significantly different levels compared to the mean control value. Twenty-eight miRNAs were expressed at significantly different levels compared to the non-autism control set in at least one of the autism samples. To validate the finding, we reversed the analysis and compared each non-autism control to a single mean value for each miRNA across all autism cases. In this analysis, the number of dysregulated miRNAs fell from 28 to 9 miRNAs. Among the predicted targets of dysregulated miRNAs are genes that are known genetic causes of autism such Neurexin and SHANK3. This study finds that altered miRNA expression levels are observed in postmortem cerebellar cortex from autism patients, a finding which suggests that dysregulation of miRNAs may contribute to autism spectrum phenotype.

    View details for DOI 10.1007/s10048-008-0133-5

    View details for Web of Science ID 000257216200001

    View details for PubMedID 18563458

  • Testing the Accuracy of Eukaryotic Phylogenetic Profiles for Prediction of Biological Function EVOLUTIONARY BIOINFORMATICS Singh, S., Wall, D. P. 2008; 4: 217-223


    A phylogenetic profile captures the pattern of gene gain and loss throughout evolutionary time. Proteins that interact directly or indirectly within the cell to perform a biological function will often co-evolve, and this co-evolution should be well reflected within their phylogenetic profiles. Thus similar phylogenetic profiles are commonly used for grouping proteins into functional groups. However, it remains unclear how the size and content of the phylogenetic profile impacts the ability to predict function, particularly in Eukaryotes. Here we developed a straightforward approach to address this question by constructing a complete set of phylogenetic profiles for 31 fully sequenced Eukaryotes. Using Gene Ontology as our gold standard, we compared the accuracy of functional predictions made by a comprehensive array of permutations on the complete set of genomes. Our permutations showed that phylogenetic profiles containing between 25 and 31 Eukaryotic genomes performed equally well and significantly better than all other permuted genome sets, with one exception: we uncovered a core of group of 18 genomes that achieved statistically identical accuracy. This core group contained genomes from each branch of the eukaryotic phylogeny, but also contained several groups of closely related organisms, suggesting that a balance between phylogenetic breadth and depth may improve our ability to use Eukaryotic specific phylogenetic profiles for functional annotations.

    View details for Web of Science ID 000264677700019

    View details for PubMedID 19204819

  • Ortholog detection using the reciprocal smallest distance algorithm. Methods in molecular biology (Clifton, N.J.) Wall, D. P., Deluca, T. 2007; 396: 95-110


    All protein coding genes have a phylogenetic history that when understood can lead to deep insights into the diversification or conservation of function, the evolution of developmental complexity, and the molecular basis of disease. One important part to reconstructing the relationships among genes in different organisms is an accurate method to find orthologs as well as an accurate measure of evolutionary diversification. The present chapter details such a method, called the reciprocal smallest distance algorithm (RSD). This approach improves upon the common procedure of taking reciprocal best Basic Local Alignment Search Tool hits (RBH) in the identification of orthologs by using global sequence alignment and maximum likelihood estimation of evolutionary distances to detect orthologs between two genomes. RSD finds many putative orthologs missed by RBH because it is less likely to be misled by the presence of close paralogs in genomes. The package offers a tremendous amount of flexibility in investigating parameter settings allowing the user to search for increasingly distant orthologs between highly divergent species, among other advantages. The flexibility of this tool makes it a unique and powerful addition to other available approaches for ortholog detection.

    View details for PubMedID 18025688

  • Roundup: a multi-genome repository of orthologs and evolutionary distances BIOINFORMATICS DeLuca, T. F., Wu, I., Pu, J., Monaghan, T., Peshkin, L., Singh, S., Wall, D. P. 2006; 22 (16): 2044-2046


    We have created a tool for ortholog and phylogenetic profile retrieval called Roundup. Roundup is backed by a massive repository of orthologs and associated evolutionary distances that was built using the reciprocal smallest distance algorithm, an approach that has been shown to improve upon alternative approaches of ortholog detection, such as reciprocal blast. Presently, the Roundup repository contains all possible pair-wise comparisons for over 250 genomes, including 32 Eukaryotes, more than doubling the coverage of any similar resource. The orthologs are accessible through an intuitive web interface that allows searches by genome or gene identifier, presenting results as phylogenetic profiles together with gene and molecular function annotations. Results may be downloaded as phylogenetic matrices for subsequent analysis, including the construction of whole-genome phylogenies based on gene-content data.

    View details for DOI 10.1093/bioinformatics/btl286

    View details for Web of Science ID 000239900200016

    View details for PubMedID 16777906

  • Heparan sulfate proteoglycans and the emergence of neuronal connectivity CURRENT OPINION IN NEUROBIOLOGY Van Vactor, D., Wall, D. P., Johnson, K. G. 2006; 16 (1): 40-51


    With the identification of the molecular determinants of neuronal connectivity, our understanding of the extracellular information that controls axon guidance and synapse formation has evolved from single factors towards the complexity that neurons face in a living organism. As we move in this direction - ready to see the forest for the trees - attention is returning to one of the most ancient regulators of cell-cell interaction: the extracellular matrix. Among many matrix components that influence neuronal connectivity, recent studies of the heparan sulfate proteoglycans suggest that these ancient molecules function as versatile extracellular scaffolds that both sculpt the landscape of extracellular cues and modulate the way that neurons perceive the world around them.

    View details for DOI 10.1016/j.conb.2006.01.011

    View details for Web of Science ID 000236136200007

    View details for PubMedID 16417999

  • The role of selection in the evolution of human mitochondrial genomes GENETICS Kivisild, T., Shen, P. D., Wall, D. P., Do, B., Sung, R., Davis, K., Passarino, G., Underhill, P. A., Scharfe, C., Torroni, A., Scozzari, R., Modiano, D., Coppa, A., de Knijff, P., Feldman, M., Cavalli-Sforza, L. L., Oefner, P. J. 2006; 172 (1): 373-387


    High mutation rate in mammalian mitochondrial DNA generates a highly divergent pool of alleles even within species that have dispersed and expanded in size recently. Phylogenetic analysis of 277 human mitochondrial genomes revealed a significant (P < 0.01) excess of rRNA and nonsynonymous base substitutions among hotspots of recurrent mutation. Most hotspots involved transitions from guanine to adenine that, with thymine-to-cytosine transitions, illustrate the asymmetric bias in codon usage at synonymous sites on the heavy-strand DNA. The mitochondrion-encoded tRNAThr varied significantly more than any other tRNA gene. Threonine and valine codons were involved in 259 of the 414 amino acid replacements observed. The ratio of nonsynonymous changes from and to threonine and valine differed significantly (P = 0.003) between populations with neutral (22/58) and populations with significantly negative Tajima's D values (70/76), independent of their geographic location. In contrast to a recent suggestion that the excess of nonsilent mutations is characteristic of Arctic populations, implying their role in cold adaptation, we demonstrate that the surplus of nonsynonymous mutations is a general feature of the young branches of the phylogenetic tree, affecting also those that are found only in Africa. We introduce a new calibration method of the mutation rate of synonymous transitions to estimate the coalescent times of mtDNA haplogroups.

    View details for DOI 10.1534/genetics.105.043901

    View details for Web of Science ID 000235197700033

    View details for PubMedID 16172508

  • Converging on a general model of protein evolution TRENDS IN BIOTECHNOLOGY Herbeck, J. T., Wall, D. P. 2005; 23 (10): 485-487


    The availability of high-throughput genomic databases that establish protein dispensability, expression and interaction networks enables rigorous tests of competing models of protein evolution. Recent research utilizing these new data sets shows that protein evolution is more complex than was previously thought. Several variables, including protein dispensability, expression, functional density, and genetic modularity, appear to have independent effects on the evolutionary rate of proteins, suggesting that proteomes have evolved via an assembly of selectional regimes. These results indicate that a general model of protein evolution will emerge as more functional genomic data from a diversity of organisms accumulate.

    View details for DOI 10.1016/j.tibtech.2005.07.009

    View details for Web of Science ID 000232605900001

    View details for PubMedID 16054255

  • Origin and rapid diversification of a tropical moss EVOLUTION Wall, D. P. 2005; 59 (7): 1413-1424


    Molecular sequences rarely evolve at a constant rate. Yet, even in instances where a clock can be assumed or approximated for a particular set of sequences, fossils or clear patterns of vicariance are rarely available to calibrate the clock. Thus, obtaining absolute timing for diversification of natural lineages can prove difficult. Unfortunately, without absolute time we cannot develop a complete understanding of important evolutionary processes, including adaptive radiations and key innovations. In the present study, the coding sequence of the nuclear gene, glyceraldehyde 3-phosphate dehydrogenase (gpd), extracted from the paleotropical moss, Mitthyridium, was found to exhibit clocklike behavior and used to reconstruct the history of 80 distinct molecular lineages that cover the full geographic range of Mitthyridium. Two separate clades endemic to two geographically distinct oceanic archipelagos were revealed by this phylogenetic analysis. This allowed the use of island age (as derived from potassium-argon dating) as a maximum age of origin of each monophyletic group, providing two independent time anchors for the clock found in gpd, the final piece needed to study absolute time. Based on results from both maximum age calibrations, which separately yielded highly consistent estimates, the ancestor of this moss group arose approximately 8 million years ago, and then diversified at the rapid rate of 0.56 +/- 0.004 new lineages per million years. Such a rate is on par with the highest diversification rates reported in the literature including rapidly radiating insular groups like the Hawaiian silversword alliance, a classic example of an adaptive radiation. Using independent sources of data, it was found that neither the age nor diversification estimates were affected by the use of molecular lineages rather than species as the operational taxonomic units. Identifying the cause for this rapid diversification requires further testing, but it appears to be related to a general shift in reproductive strategy from sexual to asexual, which may be a key innovation for this young group.

    View details for Web of Science ID 000230975600004

    View details for PubMedID 16153028

  • Functional genomic analysis of the rates of protein evolution PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Wall, D. P., Hirsh, A. E., Fraser, H. B., Kumm, J., Giaever, G., Eisen, M. B., Feldman, M. W. 2005; 102 (15): 5483-5488


    The evolutionary rates of proteins vary over several orders of magnitude. Recent work suggests that analysis of large data sets of evolutionary rates in conjunction with the results from high-throughput functional genomic experiments can identify the factors that cause proteins to evolve at such dramatically different rates. To this end, we estimated the evolutionary rates of >3,000 proteins in four species of the yeast genus Saccharomyces and investigated their relationship with levels of expression and protein dispensability. Each protein's dispensability was estimated by the growth rate of mutants deficient for the protein. Our analyses of these improved evolutionary and functional genomic data sets yield three main results. First, dispensability and expression have independent, significant effects on the rate of protein evolution. Second, measurements of expression levels in the laboratory can be used to filter data sets of dispensability estimates, removing variates that are unlikely to reflect real biological effects. Third, structural equation models show that although we may reasonably infer that dispensability and expression have significant effects on protein evolutionary rate, we cannot yet accurately estimate the relative strengths of these effects.

    View details for DOI 10.1073/pnas.0501761102

    View details for Web of Science ID 000228376600036

    View details for PubMedID 15800036

  • Conservation of the RB1 gene in human and primates HUMAN MUTATION Sivakumaran, T. A., Shen, P. D., Wall, D. P., Do, B. H., Kucheria, K., Oefner, P. J. 2005; 25 (4): 396-409


    Mutations in the RB1 gene are associated with retinoblastoma, which has served as an important model for understanding hereditary predisposition to cancer. Despite the great scrutiny that RB1 has enjoyed as the prototypical tumor suppressor gene, it has never been the object of a comprehensive survey of sequence variation in diverse human populations and primates. Therefore, we analyzed the coding (2,787 bp) and adjacent intronic and untranslated (7,313 bp) sequences of RB1 in 137 individuals from a wide range of ethnicities, including 19 Asian Indian hereditary retinoblastoma cases, and five primate species. Aside from nine apparently disease-associated mutations, 52 variants were identified. They included six singleton, coding variants that comprised five amino acid replacements and one silent site. Nucleotide diversity of the coding region (pi=0.0763+/-1.35 x 10(-4)) was 52 times lower than that of the noncoding regions (pi=3.93+/-5.26 x 10(-4)), indicative of significant sequence conservation. The occurrence of purifying selection was corroborated by phylogeny-based maximum likelihood analysis of the RB1 sequences of human and five primates, which yielded an estimated ratio of replacement to silent substitutions (omega) of 0.095 across all lineages. RB1 displayed extensive linkage disequilibrium over 174 kb, and only four unique recombination events, two in Africa and one each in Europe and Southwest Asia, were observed. Using a parsimony approach, 15 haplotypes could be inferred. Ten were found in Africa, though only 12.4% of the 274 chromosomes screened were of African origin. In non-Africans, a single haplotype accounted for from 63 to 84% of all chromosomes, most likely the consequence of natural selection and a significant bottleneck in effective population size during the colonization of the non-African continents.

    View details for DOI 10.1002/humu.20154

    View details for Web of Science ID 000228099600009

    View details for PubMedID 15776430

  • Adjusting for selection on synonymous sites in estimates of evolutionary distance MOLECULAR BIOLOGY AND EVOLUTION Hirsh, A. E., Fraser, H. B., Wall, D. P. 2005; 22 (1): 174-177


    Evolution at silent sites is often used to estimate the pace of selectively neutral processes or to infer differences in divergence times of genes. However, silent sites are subject to selection in favor of preferred codons, and the strength of such selection varies dramatically across genes. Here, we use the relationship between codon bias and synonymous divergence observed in four species of the genus Saccharomyces to provide a simple correction for selection on silent sites.

    View details for DOI 10.1093/molbev/msh265

    View details for Web of Science ID 000225730100018

    View details for PubMedID 15371530

  • Improved haematopoietic recovery following transplantation with ex vivo-expanded mobilized blood cells BRITISH JOURNAL OF HAEMATOLOGY Prince, H. M., Simmons, P. J., Whitty, G., Wall, D. P., Barber, L., Toner, G. C., Seymour, J. F., Richardson, G., Mrongovius, R., Haylock, D. N. 2004; 126 (4): 536-545


    Infusions of ex vivo-expanded (EXE) mobilized blood cells have been explored to enhance haematopoietic recovery following high dose chemotherapy (HDT). However, prior studies have not consistently demonstrated improvements in trilineage haematopoietic recovery. Three cohorts of three patients with breast cancer received three cycles of repetitive HDT supported by either unmanipulated (UM) and/or EXE cells. Efficacy was assessed by an internal comparison of each patient's consecutive HDT cycles, and to 106 historical UM infusions. Twenty-one cycles were supported by EXE cells and six by UM cells alone. Infusions of EXE cells resulted in fewer days with an absolute neutrophil count (ANC) <0.1 x 10(9)/l (median 2 vs. 4 d, P = 0.002) and 3 d faster ANC recovery to >0.1 x 10(9)/l (median 5 vs. 8 d, P = 0.0002). This resulted in a major reduction in the incidence of febrile neutropenia compared with UM cycles (0% vs. 83%; P = 0.008) and in 66% of historical UM cycles (P = 0.01) and a marked reduction in hospital re-admission. There were also fewer platelet transfusions required (43% vs. 100%; P = 0.009). We conclude that EXE cells enhance both neutrophil and platelet recovery and reduce febrile neutropenia, platelet transfusion and hospital re-admission.

    View details for DOI 10.1111/j.1365-2141.2004.05081.x

    View details for Web of Science ID 000223036300011

    View details for PubMedID 15287947

  • Coevolution of gene expression among interacting proteins PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Fraser, H. B., Hirsh, A. E., Wall, D. P., Eisen, M. B. 2004; 101 (24): 9033-9038


    Physically interacting proteins or parts of proteins are expected to evolve in a coordinated manner that preserves proper interactions. Such coevolution at the amino acid-sequence level is well documented and has been used to predict interacting proteins, domains, and amino acids. Interacting proteins are also often precisely coexpressed with one another, presumably to maintain proper stoichiometry among interacting components. Here, we show that the expression levels of physically interacting proteins coevolve. We estimate average expression levels of genes from four closely related fungi of the genus Saccharomyces using the codon adaptation index and show that expression levels of interacting proteins exhibit coordinated changes in these different species. We find that this coevolution of expression is a more powerful predictor of physical interaction than is coevolution of amino acid sequence. These results demonstrate that gene expression levels can coevolve, adding another dimension to the study of the coevolution of interacting proteins and underscoring the importance of maintaining coexpression of interacting proteins over evolutionary time. Our results also suggest that expression coevolution can be used for computational prediction of protein-protein interactions.

    View details for DOI 10.1073/pnas.0402591101

    View details for Web of Science ID 000222104900038

    View details for PubMedID 15175431

  • Gene expression level influences amino acid usage, but not codon usage, in the tsetse fly endosymbiont Wigglesworthia MICROBIOLOGY-SGM Herbeck, J. T., Wall, D. P., Wernegreen, J. J. 2003; 149: 2585-2596


    Wigglesworthia glossinidia brevipalpis, the obligate bacterial endosymbiont of the tsetse fly Glossina brevipalpis, is characterized by extreme genome reduction and AT nucleotide composition bias. Here, multivariate statistical analyses are used to test the hypothesis that mutational bias and genetic drift shape synonymous codon usage and amino acid usage of Wigglesworthia. The results show that synonymous codon usage patterns vary little across the genome and do not distinguish genes of putative high and low expression levels, thus indicating a lack of translational selection. Extreme AT composition bias across the genome also drives relative amino acid usage, but predicted high-expression genes (ribosomal proteins and chaperonins) use GC-rich amino acids more frequently than do low-expression genes. The levels and configuration of amino acid differences between Wigglesworthia and Escherichia coli were compared to test the hypothesis that the relatively GC-rich amino acid profiles of high-expression genes reflect greater amino acid conservation at these loci. This hypothesis is supported by reduced levels of protein divergence at predicted high-expression Wigglesworthia genes and similar configurations of amino acid changes across expression categories. Combined, the results suggest that codon and amino acid usage in the Wigglesworthia genome reflect a strong AT mutational bias and elevated levels of genetic drift, consistent with expected effects of an endosymbiotic lifestyle and repeated population bottlenecks. However, these impacts of mutation and drift are apparently attenuated by selection on amino acid composition at high-expression genes.

    View details for DOI 10.1099/mic.0.26381-0

    View details for Web of Science ID 000185342900027

    View details for PubMedID 12949182

  • Detecting putative orthologs BIOINFORMATICS Wall, D. P., Fraser, H. B., Hirsh, A. E. 2003; 19 (13): 1710-1711


    We developed an algorithm that improves upon the common procedure of taking reciprocal best blast hits(rbh) in the identification of orthologs. The method-reciprocal smallest distance algorithm (rsd)-relies on global sequence alignment and maximum likelihood estimation of evolutionary distances to detect orthologs between two genomes. rsd finds many putative orthologs missed by rbh because it is less likely than rbh to be misled by the presence of a close paralog.

    View details for DOI 10.1093/bioinformatics/btg213

    View details for Web of Science ID 000185310600016

    View details for PubMedID 15593400

  • Evolutionary patterns of codon usage in the chloroplast gene rbcL JOURNAL OF MOLECULAR EVOLUTION Wall, D. P., Herbeck, J. T. 2003; 56 (6): 673-688


    In this study we reconstruct the evolution of codon usage bias in the chloroplast gene rbcL using a phylogeny of 92 green-plant taxa. We employ a measure of codon usage bias that accounts for chloroplast genomic nucleotide content, as an attempt to limit plausible explanations for patterns of codon bias evolution to selection- or drift-based processes. This measure uses maximum likelihood-ratio tests to compare the performance of two models, one in which a single codon is overrepresented and one in which two codons are overrepresented. The measure allowed us to analyze both the extent of bias in each lineage and the evolution of codon choice across the phylogeny. Despite predictions based primarily on the low G + C content of the chloroplast and the high functional importance of rbcL, we found large differences in the extent of bias, suggesting differential molecular selection that is clade specific. The seed plants and simple leafy liverworts each independently derived a low level of bias in rbcL, perhaps indicating relaxed selectional constraint on molecular changes in the gene. Overrepresentation of a single codon was typically plesiomorphic, and transitions to overrepresentation of two codons occurred commonly across the phylogeny, possibly indicating biochemical selection. The total codon bias in each taxon, when regressed against the total bias of each amino acid, suggested that twofold amino acids play a strong role in inflating the level of codon usage bias in rbcL, despite the fact that twofolds compose a minority of residues in this gene. Those amino acids that contributed most to the total codon usage bias of each taxon are known through amino acid knockout and replacement to be of high functional importance. This suggests that codon usage bias may be constrained by particular amino acids and, thus, may serve as a good predictor of what residues are most important for protein fitness.

    View details for DOI 10.1007/s00239-002-2436-8

    View details for Web of Science ID 000183129100004

    View details for PubMedID 12911031

  • A simple dependence between protein evolution rate and the number of protein-protein interactions BMC EVOLUTIONARY BIOLOGY Fraser, H. B., Wall, D. P., Hirsh, A. E. 2003; 3


    It has been shown for an evolutionarily distant genomic comparison that the number of protein-protein interactions a protein has correlates negatively with their rates of evolution. However, the generality of this observation has recently been challenged. Here we examine the problem using protein-protein interaction data from the yeast Saccharomyces cerevisiae and genome sequences from two other yeast species.In contrast to a previous study that used an incomplete set of protein-protein interactions, we observed a highly significant correlation between number of interactions and evolutionary distance to either Candida albicans or Schizosaccharomyces pombe. This study differs from the previous one in that it includes all known protein interactions from S. cerevisiae, and a larger set of protein evolutionary rates. In both evolutionary comparisons, a simple monotonic relationship was found across the entire range of the number of protein-protein interactions. In agreement with our earlier findings, this relationship cannot be explained by the fact that proteins with many interactions tend to be important to yeast. The generality of these correlations in other kingdoms of life unfortunately cannot be addressed at this time, due to the incompleteness of protein-protein interaction data from organisms other than S. cerevisiae.Protein-protein interactions tend to slow the rate at which proteins evolve. This may be due to structural constraints that must be met to maintain interactions, but more work is needed to definitively establish the mechanism(s) behind the correlations we have observed.

    View details for Web of Science ID 000188122100011

    View details for PubMedID 12769820

  • Use of the nuclear gene glyceraldehyde 3-phosphate dehydrogenase for phylogeny reconstruction of recently diverged lineages in Mitthyridium (Musci : Calymperaceae) MOLECULAR PHYLOGENETICS AND EVOLUTION Wall, D. P. 2002; 25 (1): 10-26


    A portion of the nuclear gene glyceraldehyde 3-phosphate dehydrogenase (gpd) was sequenced in 26 representatives of the paleotropical moss, Mitthyridium, and a group of 20 outgroup taxa to assess its utility for phylogenetic reconstruction compared with the better understood chloroplast markers, rps4 and trnL. Primers based on plant and fungal sequences were designed to amplify gpd in plants universally with the exclusion of fungal contaminants. The piece amplified spanned 4 introns and 3 of 9 exons, based on comparisons with complete sequence from Arabidopsis. Size variation in gpd ranged from 891 to 1007 bp, in part attributable to 6 indels of variable length found within the introns. Intron 6 contributed most of the length variation and contained a variable purine-repeat motif of possible use as a microsatellite. Phylogenetic analyses of the full gpd amplicon yielded well-resolved trees that were in nearly full accord with the trees derived from the cpDNA partitions for analyses of both the ingroup and ingroup + outgroup taxon sets. Pairwise nucleotide substitution rates of gpd were as much as 2.2 times higher than those in rps4 and 2.8 times higher than in trnL. Excision of the introns left suitable numbers of parsimony informative characters and demonstrated that the full gpd amplicon could be compartmentalized to provide resolution for both shallow and deep phylogenetic branches. Exons of gpd were found to behave in a clock-like fashion for the 26 ingroup taxa and select outgroups. In general, gpd was found to hold great promise not only for improving resolution of chloroplast-derived phylogenies, but also for phylogenetic reconstruction of recent, diversifying lineages.

    View details for Web of Science ID 000179028400002

    View details for PubMedID 12383747


    View details for Web of Science ID A1990EK67800024

    View details for PubMedID 2283290

Stanford Medicine Resources: