To most of us, DNA stores the code of all living things.
But according to computer scientists, DNA may one day become the preferential storage medium for all things.
Earlier this month, a team of scientists at the University of Washington published a paper describing one of the first complete DNA storage systems that encodes, stores and retrieves digital images using strings of synthetic DNA molecules.
This isn’t the first attempt at using the code of life to store digital data, but it’s certainly unique. Here’s the kicker: the system supports random access — selectively reading only a desired file from DNA, rather than the entire database — a first in the field, according to the press release.
It’s like jumping from tape to CD — a quantum leap — but for DNA-based data storage, explains Dr. Luis Ceze, a computer scientist and engineer who led the study, in an interview with Singularity Hub.
“Imagine if Facebook had to look through every photo they have to find the one you asked to see!” agrees first author James Bornholt.
Random access, what we achieved, is critical to a practical storage system, he says.
Nature’s Data Code
Why would we want to store computer data on the fabric of life?
The short answer: breathtaking raw storage capacity and longevity.
Our digital universe, the set of all digital data worldwide, is forecast to grow to over 16 zettabytes in 2017. This exponential growth is quickly exceeding our ability to store it, even when accounting for development in our current storage technologies.
Compared to electronic or magnetic storage systems, the storage capacity of DNA is staggering. Made up of four nucleotide “letters” — As, Ts, Cs and Gs — DNA can theoretically store one exabyte of data in a volume of a grain of sand. That’s roughly eight magnitudes denser than that of tape, and roughly equivalent to 200 million DVDs.
What’s more, DNA is extremely stable. The average lifespan for rotating disks and optical storage systems is at most a few decades.
In contrast, DNA has the potential to reliably store data for centuries without significant decay. (That’s why paleontologists are able to sequence DNA from long-extinct species such as the woolly mammoth.) In a study presented last year at the American Chemical Society Conference, a team of researchers led by Dr. Robert Grass at ETH Zurich showed that DNA can be kept error-free for at least 2,000 years at about 50 degrees. The half-life is even longer if kept dry and at colder temperatures.
DNA sounds like the perfect data storage system, but getting to large-scale commercial use isn’t easy.
One major issue is accuracy. The biochemical process of synthesizing DNA (encoding data) and sequencing DNA (retrieving data) isn’t always accurate, so these storage systems require a certain level of redundancy.
Another critical roadblock is random access.
As Grass explains, “in DNA storage, you have a drop of liquid containing floating molecules encoded with information. Right now, we can read everything that’s in the drop. But I can’t point to a specific place within the drop and read only one file.”
Those are the challenges tackled in this new study.
DNA-Based Archival Storage
Here’s how the new system works.
The team first translated image and video data into a standard binary code of 0s and 1s. Each image or video is then broken into thousands of pieces. Using a lossless encoding technique called Huffman coding, the team next mapped each piece into short synthetic DNA strands. In this way, a single image may result in thousands of snippets of DNA.
During DNA synthesis, the researchers added a unique identifier — an “address” made up of a short nucleotide sequence — that lets them later reassemble the snippets into a complete image, much like putting together a jigsaw puzzle. To read the data, the DNA is sequenced on a machine, and the nucleotide sequence is translated back into bytes.
Every step of this biochemical process is error-prone, so the authors built in a few fail-safe mechanisms to enhance reliability.
For one, rather than using the conventional base-4 system for DNA — for example, mapping the binary string 01110001 to a base-4 string (in this case, 1301) to the DNA sequence CTAC — the team decided to use a base-3 system instead.
This strategy gives us some redundancy in data encoding, so it can be more reliably retrieved during the DNA sequencing process, explained the authors. The team also used a strategy called XOR encoding to further amp up fidelity.
It’s a way of recovering from missing pieces of data, says Bornholt.
“Suppose you want to remember two numbers, 5 and 9,” he explains.
With an XOR encoding, you translate those two numbers into binary strings, and then compare each bit. Whenever the inputs don’t match, you get a 1. Repeat the operation, and the output is a new string of 0s and 1s. This is the XOR string, which in a way stores the difference between 5 and 9.
If you forget one of the original two numbers, you can recover it by reversing the XOR encoding process.
“Our system stores both the original data and these differences, and uses the differences to recover missing data,” he explains. “The key advantage is density: you only need to store 50% more information to get this level of error recovery.”
And it works: when the team retrieved the data from their pool of DNA, they were able to reconstruct the images without losing a single byte of information.
Random access was a little harder to implement. In a nutshell, after the team assembled their DNA database, they used a routine biochemical technique called PCR (polymerase chain reaction) to fish out the desired sequence from it.
PCR is a way to selectively amplify a piece of DNA in a solution. The selectivity comes from a set of short nucleotide sequences called primers. Primers bind to the first and last chunk of nucleotides on the target DNA strand, essentially “tagging” that DNA strand.
Only tagged DNA strands are copied during the PCR reaction, and only these amplified strands are read during the sequencing process, rather than the entire pool.
For each image, the team assigned a key that corresponds to the primers that can tag the resulting DNA strands.
Here’s an example. The team encoded an emoji of a monkey covering its face into synthetic DNA strands. The image was given the key “monkey.” The authors then noted that “monkey” corresponds to primers “m1” and “m2.” These primers can only tag the DNA strands that represent the monkey image.
These monkey emoji-encoding DNA strands are then mixed in with a bunch of other data-containing DNA molecules.
To retrieve the emoji, the researchers look up which primers “monkey” (the key) correspond to, add them to the entire DNA pool and run a PCR reaction. Only the emoji-encoding DNA strands get amplified. After sequencing those strands, the team translates the nucleotide sequences back into bytes and — voila — they’ve retrieved the monkey emoji.
The Next Big Barriers
Having demonstrated how to randomly access DNA databases, the team is moving on to the next challenge: rewritability. For now the system is read-only, which is why we call it a DNA archival storage system, says the authors.
But archiving itself has immense value.
We can save information we have today for future times — take data snapshots of the ever-evolving Internet, preserve troves of historical texts, government documents, private company archives and so on, says Grass.
A larger barrier is the speed of DNA manipulation technologies and their hefty price tag. According to Dr. Spike Narayan, the Director of Science and Technology at IBM Research, as of now it costs more than $12,000 per MB to encode DNA data and around $200 per MB to read that data back.
That said, the team is optimistic.
“DNA sequencing technology is improving even more quickly than computer speeds,” says Bornholt.
Ceze agrees. “I believe progress in that front will be rapid,” he says.
Although it’s hard to make predictions, I would say within a decade DNA storage should be available for commercial use, he says.
Banner image credit: Shutterstock.com