As DNA sequencing has become more affordable and a more bustling area of medical research, a growing number of researchers have dived into questions that require lots of computer power to answer. The next step is obvious: Startups offering cloud services customized for genomic (or even “multi-omic”) research.
In October, an ambitious, collaborative genetic research program based at Baylor University became the largest such project to date, by its own account. As part of Cohorts for Heart and Aging Research in Genomic Epidemiology, or CHARGE, the sequencing endeavored to link the risk of particular diseases, with a specific focus on heart disease, with particular genetic variants — a task that checks off two variables that mean big data: population research and whole genome sequencing.
Working with the platform-as-a-service company DNAnexus and housed on Amazon Web Services, researchers sequenced the DNA of more than 14,000 individuals, including 3,751 whole genomes and 10,771 exomes, or a 1-percent fraction of the genome that is thought to house mutations relevant to genetic diseases. The number crunching required 2.4 million core-hours of computational time and turned out 430TB of results and nearly a petabyte of data storage hosted for further analysis. (By comparison, the 1,000 genomes project sequenced required 25 terabytes of data.)
“Having access to this much data was unique. Many institutions do not have the local compute resources and infrastructure to support large scale analysis projects like this one,” said Jeffrey Reid, assistant professor in the Human Genome Sequencing Center at Baylor.
Reid told the American Society of Human Genetics conference in October how CHARGE’s most recent work had required researchers to move to the cloud, due to the volume of data. Researchers are now working on the sequencing work done on DNAnexus’s platform.
These big data research projects seem to be a great use case of cloud computing, compared to so many other things that use “cloud” as little more than a buzzword.
“It’s largely an infrastructure and agility issue for genomics. I think it will be the status quo both now and in the future,” Alan Louie, an analyst at IDC Health Insights, told Singularity Hub.
That’s the hypothesis behind DNAnexus, which launched its cloud-based service in 2010.
“Many large-scale population studies to date have been limited in scope by a lack of the necessary compute power; this is a real hindrance in realizing the full promise of genomic medicine,” said Richard Daly, CEO.
By using the cloud to run the data, the CHARGE researchers got the computing work done 12 times faster than they would if they’d used their own firepower even if they’d dominated campus servers for nearly a month. The HIPAA-compliant DNAnexus platform also made it easier for 300 researchers at five institutions to collaborate on the sequencing project.
But medical researchers are often hesitant to use cloud services, due to concerns about the privacy of the data that they are legally bound to safeguard.
“What we’ve actually found is if you look at HIPAA breaches, more than75 percent of them are due to people losing their laptops or their own device. We really think that using the cloud is a way for organizations to take advantage of the best in class security you can get versus trying to build it in-house,” DNAnexus CTO Andreas Sundquist told Singularity Hub.
DNAnexus’s competitors Bina Technologies and Knome address the issue by selling machines optimized to run genome-analysis software that researchers keep on site.
But two new companies, SolveBio and GeneStack, will soon emerge from beta tests to challenge DNAnexus in the cloud. The model will be an interesting test for cloud computing, but don’t expect vicious competition. There’s plenty of DNA research to go around.
Photos: Eskimar, anyaivanova and GraphicGeoff via Shutterstock.com