From the Director's desk
Somalee Datta, PhD
As a long term computational scientist, I am seeing a change that I am genuinely excited about - the availability of multiple trusted Cloud partners. The computational infrastructure is not only becoming democratized, we are seeing a faster evolution of the product portfolio and expect to see reduction in cost as Cloud businesses scales up.
All data centric individuals like to think that data is gold. But data is gold only if it is open to analytics and eventual interpretation. Data in an archive is a graveyard for bits! Cloud not only support one best practice analytical stack or two, it brings freedom to operate in a multi-dimensional analytical space - one data and multiple analytics. This aspect of Cloud paradigm especially resonates more for research infrastructure architects like me who are trying to meet the unpredictable research environment (shifts are happening at all times and at all scales) given lack of resources (NIH does not have funds for infrastructure like FaceBook does). There is no uniform processing at the frontier of innovation, no production pipelines to run, and no best practices to follow. What looks like chaos is really many many experiments and innovations happening at all times. And it is imperative to increase the pace of experiments. There is a lot of data and there are not enough analysts and there never will be. So if we enable the analysts we have, we will increase the pace of discovery.
The other impact comes from ease of collaboration. Cloud is no longer my infrastructure or yours, it is ours.
Impact of Governance
For biomedical community, NIH has played a significant role in acceptance of Cloud as a critical computational paradigm. Cancer Cloud, Precision FDA and Precision Medicine Initiative have distinctly different goals, but they have all adopted Cloud. This adoption process makes Cloud partners aware of the unique needs of large scale biomedical data sharing, related privacy and ethics issues - we expect this will make IaaS and PaaS even better.
Impact of Standards
NIST and PrecisionFDA have joined efforts and together have spurred a flurry of activity (link to 2016 competion) on the leaderboard for whole genome based precision. There are two exciting components in this endeavour. First is the emergence of a platform that allows for new products like Sentieon to announce themselves. And the other is emergence of a crowd sourcing mechanism feeding into NIST to improve the truth set (see Dr. Justin Zook's announcement here).
Global Alliance for Genomics and Health (GA4GH) is our international community of emerging genomics standards. Take the multitude of file formats away and give us the APIs! We are happy to report that SCGPM team was one of the early adopters of GA4GH APIs via Google Genomics implementation. What a relief it was to worry about bioinformatics application and not have to worry about backend performance at the petabyte scale.
And while we agree with xkcd regarding standard, we think that GA4GH will be definitive.
Impact of Technology Landscape
Docker and Dockerhub are bringing reproducibility and simplifying collaborations. We are especially excited about the HPC centric containerization product, Singularity, that promises to bridge on-premise bare metal clusters with Cloud. Singularity was developed by Berkeley Research Computing (BRC) and our Stanford Research Computing Center (SRCC) is connecting Singularity with biomedical application containers on Dockerhub.
Vision at SCGPM
At SCGPM, we are in alignment with NIH on the vision of Data Commons (see NIH Data Commons vs GBSC Vision). We are making progress, slowly and surely, in a cooperative manner with GA4GH, Cloud partners, and community (research and technology sectors). The success of this vision comes with success of our research community. We have essentially established a research help desk for our researchers. Between office hours, support mailing list, community discussion forum, and bioinformatics service, we have a support mechanism for our community. The help desk is particularly strong for our on-premise HPC. Our vision for 2017 is to extend our help desk to incorporate Cloud.
If I have created the impression that on-premise HPC is dead then I want to take the opportunity to re-iterate that HPC continues to be centerfold for our research. If we can localize our on-premise efforts on powerful servers and network for low latency workflows, we can benefit from best of both worlds - "scale out" on Cloud and "scale up" on-premise. We expanded our storage portfolio this year to add Isilon to our existing DDN and NetApp products. We continue to evaluate new products and partnership opportunities for our HPC.
Finally, I wish to thank Stanford research community, Department of Genetics, Stanford Research Computing Center and my team for doing what they do. Wishing everyone a fantastic 2017 ahead.
2016: New and noteworthy for SCGPM Bioinformatics
Successful translation from research to clinic
Stanford Clinical Genomics Service (CGS) is ramping up for production this year. SCGPM bioinformatics team was part of inception and we nurtured the pilot. We built the analytic pipeline, analyzed over a 100 patients (several trios, families), and developed the Cloud conduit for translation.
A key component of the CGS informatics framework is Loom, developed by Nathan Hammond, PhD, who transition from SCGPM to Sr Scientist role at Stanford Health Care (SHC), and Isaac Liao, PhD, Software Engineer at SCGPM. Loom is a workflow manager to make data analysis portable, traceable, and reproducible - built on Docker and GA4GH principles of interoperability. (more)
SHC and SCGPM continue to work hand in hand. The CGS engineering team has come together under supervision of Sowmi Utiramerur, Director of Genome Informatics at SHC. Recently the team presented on the Genome Information Management System (GIMS) at SCGPM organized seminar. Over a hundred executives and researchers attended the seminar at the popular Li Ka Shing Center venue. The video below starts with a brief overview of GBSC, by organizer Keith Bettinger, MS, Head of Infrastructure at SCGPM, and subsequently, the SHC team presents GIMS and Loom.
Genomics and Privacy
Research using genomics data is fraught with concerns around patient re-identification using de-identified sequence data. This has resulted in complex data sharing agreements and patient consent process. We believe that the current genomics research bottleneck is no longer within a data silo but being able to share insights across silos.
Somalee Datta, Director of Bioinformatics at SCGPM, kicked off the year with Personalized Medicine World Conference (PMWC 2016) panel session titled Genome: Silos, Hacking Privacy, Collaboration. Other panelists include Philip Tsao, PhD, Director of Epidemiological Research and Information Center for Genomics at VA Palo Alto, who spoke about The Million Veteran Program and William Knox Carey, PhD, VP of Healthcare Technology at Intertrust, who spoke about Access vs Privacy: A False Dichotomy (more). Somalee spoke about the Shifting Bottlenecks in Bioinformatics.
Soon after, she hosted a unique conference centered on the theme of privacy that invited participation from academicians and industry. Purpose of this conference was to investigate the role of privacy-preserving technologies in genomics data collaboration. Are there classes of relevant genomics algorithms or queries that can benefit from privacy protecting analytical methods, and be applied to real world data sharing scenarios? (more)
The following video is presentation from Kristin Lauter, PhD, Principal Researcher and Research Manager for the Cryptography group at Microsoft Research. She presents a demonstration of privacy preserving technologies for genomic data sharing.
Kristin's group has won the iDASH 2016 competition in Track 3, Testing for Genetic Diseases on Encrypted Genomes (secure outsourcing). This is to calculate the probability of genetic diseases through matching a set of biomarkers to encrypted genomes that stored in a commercial cloud service. Our heartfelt congratulations to her and the Microsoft team. We are looking forward to our ongoing collaboration with Microsoft with a focus of making these methods available to genomics researchers.
Successful launch of Bioinformatics-as-a-Service
Late last year, with seed funding from Office of the Dean of Research, we launched Stanford's first bioinformatics service program as part of Genetics Bioinformatics Service Center. The service went 0 to 60 in matter of a few months. (more)
Following slide show is from SCGPM training seminar held in september. It starts with introduction by Ramesh Nair, PhD, Head of Bioinformatics-as-a-Service followed by Yue (Wendy) Zhang, Sr. Scientist presenting on Weighted Gene Co-expression Network Analysis (WGCNA) and Single Cell Differentiation Expression (SCDE).
Sequencing informatics moves to cloud
SCGPM’s Genome Sequencing Service Center (GSSC) is a state of the art genomics facility established to support cost-effective high throughput sequencing for Stanford research. This year, GSSC became the sequencing hub for CIRM’s Center of Excellence in Stem Cell Genomics (Stanford, UCSF, UCSC, UCLA, UCSD, UC Berkeley, Salk Institute, JCVI, and Scripps Research Institute). In terms of sheer throughput, GSSC produced 22,000 exomes worth of data last year.
To keep up with the pace of GSSC's growing community, SCGPM moved the informatics processing to Cloud. Sequencing data are now delivered directly to a cloud informatics platform, backed by DNAnexus, that provides storage, compute, and access to popular bioinformatics tools. From that platform, data can be shared with collaborators around the world or plugged directly into downstream analyses. Many shared resources for processing sequencing data, such as ENCODE Consortium workflows for analyzing Methyl-seq, ChIP-seq, and RNA-seq data are publicly available on DNAnexus, as well as tools for generating custom workflows. All of these tools for managing, sharing, and analyzing data are accessible through an easy-to-use web interface or command-line console - thus breaking down barriers between data generators and analysts, and making bioinformatics accessible to novices and experts alike. Now DNAnexus supports both AWS and Azure platform - a single pane of glass approach - thus making the collaboration network even wider.
Following video is from our Cloud workshop where Ramesh Nair, PhD, Associate Director of Bioinformatics at Stanford Center for Genomics and Personalized Medicine, does a deep dive on RNA-sequencing analysis on Cloud platform. Introduction is provided by Paul Billing-Ross, workshop organizer and SCGPM developer responsible for migrating the sequencing informatics platform to Cloud.
Bioinformatics for the microbiome workshop
The three pounds of microbes that you carry around with you might be more important than every single gene you carry around in your genome - Rob Knight, Professor of Pediatrics, UC San Diego, Keynote Speaker
Ramesh Nair, PhD, Head of Bioinformatics at SCGPM, co-organized a day long bioinformatics focused microbiome workshop with Ami Bhatt, MD, PhD on the heels of the announcement of the National Microbiome Initiative. It was an exciting day packed full of talks, and panels and had representation from academic and industry partners. (more)
Secure Cloud Computing for Genomic Data
Last year NIH allowed Controlled-Access Data Subject to the NIH Genomic Data Sharing (GDS) Policy to be placed on Cloud. They provided best practice recommendation. Internally, we found ourselves trying to meet slightly non-overlapping guidelines from our school, and private IRBs. We consolidated the requirements keeping in mind the principle of "build once use many times". Our approach was published in Nature Biotech this year as a peer reviewed commentary.
Join our efforts in building a biomedical research infrastructure ...
... that benefits Stanford researchers, broader research community and patient care
First and foremost, I wish to thank everyone who has supported our work in 2016. Aside from various sponsored research including ENCODE, CIRM Center of Excellence in Stem Cell Genomics, and Human Microbiome Project, I specifically wish to thank Stanford Health Care who enable us to support Clinical Genomics Service. I also wish to thank Prof Phil Tsao at VA Palo Alto who enables us to support his collaboration in Million Veteran Program. Dean Ann Arvin and Dean Harry Greenberg enabled us to start the Bioinformatics-as-a-Service offering with a seed funding. Our Genomics and Patient Privacy conference was sponsored by Stanford IRT, Lucile Packard Children's Hospital, Mayfield Ventures, Microsoft and Intertrust. Our Bioinformatics for Microbiome workshop was sponsored by Illumina, Janssen’s Human Microbiome Institute and Huawei Enterprise Business Group USA. Our Cloud workshop was sponsored by DNAnexus. We are also recipient of two corporate gifts, one from Microsoft Research and another from Google (via Stanford Data Science Initiative) that are supporting our Cloud efforts.
Corporate entities and individuals are invited to partner with us. There are many modes of participation.
- Gift funding: Gift funding allows you to support world-class research conducted at Stanford, and contribute towards building resources that have far reaching impact. (more)
- Become an industry affiliate: We have close ties with Stanford Data Science Initiative and are developing towards the Data Commons vision.
- In-kind contributions: We seek in-kind contributions in the form of server technologies and software access. In turn, you will be able to gain a deep understanding of the science your technology fuels.
- Attend and sponsor an event: We host workshops and seminars throughout the year. These are typically attended by industry members, faculty and researchers. Where feasible, the events have live streams and videos are made available (e.g. Microbiome and Privacy workshops) but there is nothing more vibrant than live participation.