Introducing the Supercomputer
An easy way of thinking about a supercomputer is to think of a single compute server that has massive amounts of data processing units i.e. many CPU cores, very large memory, GPGPUs and local scratch.
Stanford has acquired its first supercomputer, a SGI (now part of Hewlett Packard Enterprise) UV300 unit, via an NIH S10 Shared Instrumentation Grant. It has 360 cores (720 threads), 10 terabytes of random-access-memory (RAM), 20 terabytes of flash memory (essentially SSD disks), 4 NVidia Pascal GPUs (P100s are especially suited to deep learning), and 150+ terabytes of local scratch storage. The supercomputer is made available to the Stanford community via the Genetics Bioinformatics Service Center.
Biomedical workflows require many different applications. Some applications are hungry for CPU cores, some for RAM, others require fast SSDs,and some do best with GPGPUs. The supercomputer fits in nicely when data analysis requires diversity of applications. NIH investigators at Stanford are increasingly analyzing terabyte-to-petabyte scale datasets generated using state-of-the-art biomedical technologies. It is no longer unusual to find studies that analyze hundreds of samples, or correlate with other available large scale cohorts (e.g. UK10K, TCGA), or involve longitudinal multi-modal data. Analysis and interpretation of these large scale complex data require a computational environment that is fast and affordable. GBSC provides secure High Performance Computing (HPC) cluster and integration with Clouds (Google in production, Azure in alpha). Distributed computing ("scale-out") paradigms are different from supercomputing paradigms ("scale-up"). By making this supercomputer available to a large cross-disciplinary biomedical research community at Stanford, we expect to invigorate development of novel algorithms, mathematical and statistical approaches unhindered by the limitations of current capabilities found in typical HPC clusters and public Clouds.
Example biomedical use cases
Clinical data mining
Sources of data that are can be used to build inference in translational and clinical research are expanding rapidly. Stanford researchers have access to a number of observational datasets - Stanford Hospital EHR, Medicare, IPUMS Census Data, Optum Claims, Clinical and consumer health care data, Truven Health MarketScan Research Databases with 230 million patient records to name a few. Just one of these sources, Stanford clinical data warehouse, generates 6 billion normalized mentions of drugs, diseases, procedures and devices from roughly 2 million patients spanning about 35 million clinical documents. PubMed alone in last two decades has published over 2 Million research articles. Imagine the scale when we start to combine multiple sources of EHR data, claims data, knowledge bases and literature. In order to analyze and return results of processing 100s of millions to billions of records to build any causal inference framework, it is essential to be able to run these analyses on large number of processing units.
In complex multi-modal biology (e.g. omics, wearable, imaging, ...), the relationships between datasets are hard to characterize using relational databases. The appropriate paradigm for storing and mining these datasets is a graph database. Graph databases store data in nodes (vertices) and edges rather than tables, as in relational databases. Graph analytics offers capability to search and identify different characteristics of a graph dataset: nodes connected to each other, communities containing nodes, the most influential nodes, chokepoints in a dataset, and nodes similar to each other.
New implementations in industry has shown that using graph algorithms can solve real-world problems such as detecting cyberattacks, creating value from internet of things sensor data, analyze the spread of epidemics (Ebola), and precisely identifying drug interactions faster than ever before. An open source tool, Bio4j, is a graph database framework for protein related information querying and management that integrates most data available in Uniprot KB (SwissProt + Trembl), Gene Ontology (GO), UniRef (50,90,100), NCBI Taxonomy, and Expasy Enzyme DB. NeuroArch is a graph database framework for querying and executing fruit fly brain circuits. Researchers are increasingly looking towards graph database when current data models and schemas will not support research queries and study has a lots of new and disparate data sources that are inherently unstructured.
Over the last decade, machine learning methods based on deep neural networks (DNNs) are dominating learning problems in computer vision, speech recognition, and natural language processing. Deep learning approaches approximate complex input-output mappings by automatically learning hierarchical, non-linear representations of input data thereby avoiding the need for feature engineering that plagues learning problems in biology where the input data types are often poorly understood. DNNs are ideally suited for biological discovery because they are (i) most effective when applied to massive, diverse training data; (ii) designed to capture complex, non-linear and hierarchical relationships; and (iii) elegantly handle joint learning across multiple related prediction tasks that share related feature spaces. Genomic data and the associated biological questions epitomize these properties. The Pascal GPUs (P100) show deep learning acceleration in recent benchmarks.
Genomic Data Exploration
Making sense of genomes at individual or cohort scale is achived by comparing individual or cohort genome data with existing knowledge bases or other cohorts. Existing knowledge bases that capture structural, functional, biomarker and pharmacological information are expanding rapidly e.g. UCSC Genome Annotations, ENCODE, GWAS catalog, PharmGKB, and GO. Population databases now contain information across 1000s-10,000s of genomes e.g., 1000 Genomes project, Exome Sequencing Project, the UK10K consortium, and AmbryShare. Analysis takes many forms, including database queries, statistical approachs such as GWAS or machine learning techniques. Co-analyzing user data with the exploding volume of public/private datasets require large number of cores, GPUs or memory.
A typical metagenomic analysis requires a) assemblies (genome or transcriptome), b) post-assembly analysis and validation such as searching against known bacterial and viral databases, classification of new assemblies, and quantitation and finally, c) correlation studies e.g. with metabolites. The memory requirements for de novo assembly increase dramatically with genome size. In a recent survey of memory efficient algorithms [Kleftogiannis D], investigators attempted to run a variety of assembler on Cloud or commodity servers. For large genomes or data sets, these runs fail or result in more contigs i.e. unfinished assemblies. Researchers from Oklahoma State University completed the largest metagenomics assembly to date by sequencing data from a soil metagenome that required 4TB of memory [Couger BM].
Sequence search algorithms e.g. BLAST can specially benefit from availability of large amount of RAM available on this appliance. In these search algorithms, newly assembled sequences are compared against large GB to TB sized databases of known DNA and protein sequences. Once the database files are loaded into RAM, they can be accessed in under a microsecond with shared memory – randomly reading anywhere. Empirically, in-memory data, on systems similar to this supercomputer, is 1,000x faster than data on hard drives.
Another aspect of metagenomic studies is human contamination removal from microbiome samples. Microbiome samples that are collected from saliva, gut, skin, nasal cavity and other sites are comprised of a substantial amount of human DNA. Typical microbiome studies collect many samples (sites or longitudinal). For example, it has been reported that some samples in The Human Microbiome Project contain up to 95% human sequence and 4% of the samples contain >10% human reads. Studies show repeatedly that de-identified human genome data can be re-identified [Naveed M]. So while there is no significant concern in current regulatory landscape, it may become an issue in coming years. Newer algorithmic approaches exist which are computationally expensive, that can be used to pre-process all microbiome samples to do a superior job of human contamination removal.
- Who can use the supercomputer? The supercomputer is part of the Genetics Bioinformatics Service Center (GBSC). The GBSC services are available to all biomedical researchers at Stanford and affiliated organization. Contact GBSC for further information.
- If my lab is signed up for GBSC services, what other paperwork do I need to submit? To meet NIH reporting requirements, GBSC administration will collect some information from you via a Qualtrics survey. Once this information is collected, you will receive access to the supercomputer.
- Are there resources/how-to available? User guide on SCG wiki is available here. To log into SCG wiki, you will need a SUNetID.