Introducing the Supercomputer
An easy way of thinking about a supercomputer is to think of a single compute server that has massive amounts of data processing units i.e. many CPU cores, very large memory, GPGPUs and local scratch.
Stanford has acquired its first supercomputer, a SGI (now part of Hewlett Packard Enterprise) UV300 unit, via an NIH S10 Shared Instrumentation Grant. It has 360 cores (720 threads), 10 terabytes of random-access-memory (RAM), 20 terabytes of flash memory (essentially SSD disks), 4 NVidia Pascal GPUs (P100s are especially suited to deep learning), and 150+ terabytes of local scratch storage. The supercomputer is made available to the Stanford community via the Genetics Bioinformatics Service Center.
Biomedical workflows require many different applications. Some applications are hungry for CPU cores, some for RAM, others require fast SSDs,and some do best with GPGPUs. The supercomputer fits in nicely when data analysis requires diversity of applications. NIH investigators at Stanford are increasingly analyzing terabyte-to-petabyte scale datasets generated using state-of-the-art biomedical technologies. It is no longer unusual to find studies that analyze hundreds of samples, or correlate with other available large scale cohorts (e.g. UK10K, TCGA), or involve longitudinal multi-modal data. Analysis and interpretation of these large scale complex data require a computational environment that is fast and affordable. GBSC provides secure High Performance Computing (HPC) cluster and integration with Clouds (Google in production, Azure in alpha). Distributed computing ("scale-out") paradigms are different from supercomputing paradigms ("scale-up"). By making this supercomputer available to a large cross-disciplinary biomedical research community at Stanford, we expect to invigorate development of novel algorithms, mathematical and statistical approaches unhindered by the limitations of current capabilities found in typical HPC clusters and public Clouds.
Example biomedical use cases
Clinical data mining
Sources of data that are can be used to build inference in translational and clinical research are expanding rapidly. Stanford researchers have access to a number of observational datasets - Stanford Hospital EHR, Medicare, IPUMS Census Data, Optum Claims, Clinical and consumer health care data, Truven Health MarketScan Research Databases with 230 million patient records to name a few. Just one of these sources, Stanford clinical data warehouse, generates 6 billion normalized mentions of drugs, diseases, procedures and devices from roughly 2 million patients spanning about 35 million clinical documents. PubMed alone in last two decades has published over 2 Million research articles. Imagine the scale when we start to combine multiple sources of EHR data, claims data, knowledge bases and literature. In order to analyze and return results of processing 100s of millions to billions of records to build any causal inference framework, it is essential to be able to run these analyses on large number of processing units.
In complex multi-modal biology (e.g. omics, wearable, imaging, ...), the relationships between datasets are hard to characterize using relational databases. The appropriate paradigm for storing and mining these datasets is a graph database. Graph databases store data in nodes (vertices) and edges rather than tables, as in relational databases. Graph analytics offers capability to search and identify different characteristics of a graph dataset: nodes connected to each other, communities containing nodes, the most influential nodes, chokepoints in a dataset, and nodes similar to each other.
New implementations in industry has shown that using graph algorithms can solve real-world problems such as detecting cyberattacks, creating value from internet of things sensor data, analyze the spread of epidemics (Ebola), and precisely identifying drug interactions faster than ever before. An open source tool, Bio4j, is a graph database framework for protein related information querying and management that integrates most data available in Uniprot KB (SwissProt + Trembl), Gene Ontology (GO), UniRef (50,90,100), NCBI Taxonomy, and Expasy Enzyme DB. NeuroArch is a graph database framework for querying and executing fruit fly brain circuits. Researchers are increasingly looking towards graph database when current data models and schemas will not support research queries and study has a lots of new and disparate data sources that are inherently unstructured.
Over the last decade, machine learning methods based on deep neural networks (DNNs) are dominating learning problems in computer vision, speech recognition, and natural language processing. Deep learning approaches approximate complex input-output mappings by automatically learning hierarchical, non-linear representations of input data thereby avoiding the need for feature engineering that plagues learning problems in biology where the input data types are often poorly understood. DNNs are ideally suited for biological discovery because they are (i) most effective when applied to massive, diverse training data; (ii) designed to capture complex, non-linear and hierarchical relationships; and (iii) elegantly handle joint learning across multiple related prediction tasks that share related feature spaces. Genomic data and the associated biological questions epitomize these properties. The Pascal GPUs (P100) show deep learning acceleration in recent benchmarks.
Genomic Data Exploration
Making sense of genomes at individual or cohort scale is achived by comparing individual or cohort genome data with existing knowledge bases or other cohorts. Existing knowledge bases that capture structural, functional, biomarker and pharmacological information are expanding rapidly e.g. UCSC Genome Annotations, ENCODE, GWAS catalog, PharmGKB, and GO. Population databases now contain information across 1000s-10,000s of genomes e.g., 1000 Genomes project, Exome Sequencing Project, the UK10K consortium, and AmbryShare. Analysis takes many forms, including database queries, statistical approachs such as GWAS or machine learning techniques. Co-analyzing user data with the exploding volume of public/private datasets require large number of cores, GPUs or memory.
A typical metagenomic analysis requires a) assemblies (genome or transcriptome), b) post-assembly analysis and validation such as searching against known bacterial and viral databases, classification of new assemblies, and quantitation and finally, c) correlation studies e.g. with metabolites. The memory requirements for de novo assembly increase dramatically with genome size. In a recent survey of memory efficient algorithms [Kleftogiannis D], investigators attempted to run a variety of assembler on Cloud or commodity servers. For large genomes or data sets, these runs fail or result in more contigs i.e. unfinished assemblies. Researchers from Oklahoma State University completed the largest metagenomics assembly to date by sequencing data from a soil metagenome that required 4TB of memory [Couger BM].
Sequence search algorithms e.g. BLAST can specially benefit from availability of large amount of RAM available on this appliance. In these search algorithms, newly assembled sequences are compared against large GB to TB sized databases of known DNA and protein sequences. Once the database files are loaded into RAM, they can be accessed in under a microsecond with shared memory – randomly reading anywhere. Empirically, in-memory data, on systems similar to this supercomputer, is 1,000x faster than data on hard drives.
Another aspect of metagenomic studies is human contamination removal from microbiome samples. Microbiome samples that are collected from saliva, gut, skin, nasal cavity and other sites are comprised of a substantial amount of human DNA. Typical microbiome studies collect many samples (sites or longitudinal). For example, it has been reported that some samples in The Human Microbiome Project contain up to 95% human sequence and 4% of the samples contain >10% human reads. Studies show repeatedly that de-identified human genome data can be re-identified [Naveed M]. So while there is no significant concern in current regulatory landscape, it may become an issue in coming years. Newer algorithmic approaches exist which are computationally expensive, that can be used to pre-process all microbiome samples to do a superior job of human contamination removal.
Is the supercomputer available for research use? Yes! We received the grant in Q2 2017, the appliance was deployed in Q3 2017, extensively benchmarked, alpha tested and was made generally available in Q4 2017.
Who can use the supercomputer? The supercomputer is part of the Genetics Bioinformatics Service Center (GBSC). The GBSC services are available to all biomedical researchers at Stanford and affiliated organization. Contact GBSC for further information.
Are there resources/how-to available? The supercomputer is essentially like one single giant server. It is available to users via SLURM scheduler. User guide on SCG wiki is available here. To log into SCG wiki, you will need a SUNetID.
UV300 System Details
Specs, system setup and performance benchmark
Supercomputer Configuration Summary
- High memory-to-processor ratio: Intel Xeon E7-8867 v4 with 24 CPUs/socket
- SGI NUMALink™ 7 interconnect (NL7; 7.47GB/s bidirectional peak): Ultra low latency: All-to-All & Multi-dimensional All-to-all network topology
- Extreme I/O: 12 PCIe Gen3 slots per chassis
- NVidia Pascal GPUs
- 5.3 TFLOPS of double-precision floating point (FP64) performance
- 10.6 TFLOPS of single-precision floating point (FP32) performance
- 21.2 TFLOPS of half-precision floating point (FP16) performance
- NVLink high-speed interface provides GPU-to-GPU data transfers at up to 160 Gbps of bidirectional bandwidth
- HBM2: offering three times (3x) the memory bandwidth of the Maxwell GM200 GPU
- Unified Memory and Compute Preemption via CUDA 8
System Configuration Summary
This supercomputer is integrated into GBSC's computing cluster SCG and meets moderate risk compliance requirements along with dbGaP compliance requirements. The supercomputer and the cluster nodes all have access to 7 petabytes of high-performance, redundant storage, divided across two storage subsystems (the older DDN and the newer Isilon). Analyses are managed on the supercomputer and the other nodes via a Slurm job scheduler, which controls access to the resources provided by all the computing nodes. By allowing access to only the resources needed for each job, the supercomputer can maximize its utilization and minimize its idle time.
The SCGPM team has extensively benchmarked the UV300 for OPI/OpenMP and Cuda use cases. These results will be made public available shortly. In the meantime, please reach out to the administrators for details.
Following stories have emerged from our UV300 user community
Application of metagenomics assembly in study of microbes found on the human skin
Ami Bhatt's research student Matthew Durrant wanted to explore the use of publicly available next-generation sequencing data to analyze the genetic composition of the microbes found on the human skin. The dataset included 616 samples, totaling ~1 trillion base pairs or 1.4 TB of compressed data. The research goal was to assemble all of these samples using SPAdes, a metagenomic sequencing assembler, and to taxonomically classify all of the reads using Kraken, a kmer-based sequence classification program. Matt performed an incredible 616 metagenomic assemblies in about a week on UV300 with several of these assemblies requiring > 1TB of RAM.
Deep learning for Learning Health System
Nigam Shah's research student Stephen Robert Pfohl uses patient data in the electronic health record (EHR) for the development of algorithms and models capable of automatically performing patient-level outcome prediction, risk stratification, and clinical decision support for the eventual goal of developing a learning health system. The use of EHR data poses challenges for these tasks in that it is heterogeneous and sparse with a complex temporal structure. For instance, a typical patient record contains multiple visits (potentially over several years) where each visit may contain coded data (e.g. diagnoses, procedure, and medication codes), unstructured text in clinical notes, laboratory measurements, linked genomic data, and densely sampled time-series or waveforms from bedside monitors. It is the case that models developed with EHR data often fail to generalize to new health systems and domains that differ from that which they were trained due to covariate shift and a lack of interoperability between sites.
Stephen leverages deep neural networks to develop methods capable of adapting learned representations of human disease to new domains with minimal input of human experts. The UV-300 NVIDIA Tesla P100 GPUs are well suited to deep learning applications. As this research involves the training of large and complex models on the basis of hundreds of millions of anonymized records from millions of patients, the high-performance capabilities of the UV-300 have greatly accelerated this research.