Search

An Ancient Alphabet for the Digital Age: Encoding digital information in DNA

The 21st century library is a complicated beast — constantly growing, with too many books on the brink of decay. While digital storage media such as CDs, DVDs, hard drives, and servers have allowed archivists to store huge amounts of information, these formats are not reliable in the long term: discs get scratched, hard drives burn out, and old formats become obsolete. Although archivists work to move files to fresh storage devices and to update their formats in keeping with the progress of computer technology, even the most fastidious data maintenance cannot handle the extraordinary volume of digital information that exists today. To solve this problem, ETH Zurich biochemist Robert Grass and his team are working to transcribe the information of the digital age in science’s most ancient alphabet: DNA.

Perhaps the most cosmopolitan macromolecule, DNA, or deoxyribonucleic acid, keeps a bed in laboratories all over the map of biology. Its name is gold to forensic scientists, for whom DNA evidence can indicate paternity or presence at the scene of a crime. Its aid is key to engineers of transgenic organisms. It serves in hospitals as a predictor of heritable disease and in evolutionary research labs as a sort of rubric to the tree of life.

With so many applications in the biosciences, it seems almost bizarre that DNA would become a key player in digital informatics. But Grass’s choice to employ DNA as a digital library is actually inspired by the molecule’s biological role. According to Grass, the sturdy, compact structure that makes DNA a reliable catalogue of an organism’s genetic information also makes it well-suited for the long-term storage of large volumes of digital information. How long-term? When Grass and his colleagues performed accelerated aging experiments on 83 kilobytes of digital information encoded in silica-encased DNA, they found that the data could survive storage for at least 2,000 years, error free.

A new code to crack

To encode his digital records in DNA, Grass began with a derivative code pioneered by a team of scientists at the EMBL-European Bioinformatics Institute (EMBL-EBI) in England. As this team had done, Grass encoded his data as a complex series of molecular substructures called nucleotide bases — the same means by which genetic information is stored naturally. Within an organism, these bases — adenine, thymine, cytosine and guanine, or A, T, C and G for short — are organized into three-letter units called “codons,” each of which codes for the production of a specific amino acid. At EMBL-EBI, scientists repurposed these bases to code for digital information in synthetic strands of DNA.

According to Ewan Birney, Associate Director of EMBL-EBI, his team faced two main challenges in adapting the genetic code to non-biological informatics. First, contemporary technology limited artificial DNA synthesis to the production of short fragments. Second, the technology by which DNA is read was prone to error when the same nucleotide base was repeated multiple times in a row. To address these problems, the team broke the code into many overlapping fragments, with no repeats. These fragments were short enough to be produced synthetically, and together, they could store as much information as could much longer strands of DNA. Computer analysis served as the ultimate jigsaw puzzler, recognizing instances of overlap to order the fragments and make sense of the overall code.

Ewan Birney, Associate Director of EMBL-EBI, was part of the team that pioneered the use of DNA as a data storage device. When Grass was experimenting with methods to increase the shelf life of a DNA databank, he drew on Birney’s technique for encoding digital information in his silica-coated DNA test molecules. Image courtesy of Royal Society uploader, Wikimedia Commons.
Ewan Birney, Associate Director of EMBL-EBI, was part of the team that pioneered the use of DNA as a data storage device. When Grass was experimenting with methods to increase the shelf life of a DNA databank, he drew on Birney’s technique for encoding digital information in his silica-coated DNA test molecules. Image courtesy of Royal Society uploader, Wikimedia Commons.

The code worked. With their take on the genomic alphabet, the researchers were able to synthesize DNA fragments encoding the “I Have a Dream” speech from an mp3 file. They created DNA to encode a jpeg photo of EMBL-EBI, and a pdf file of Watson and Crick’s treatise on the structure of DNA. They even recorded all of Shakespeare’s sonnets, originally in .txt file format, as synthesized DNA. Moreover, the team was able to successfully decode this information back into text, visuals, and sound.

For Grass, the EMBL-EBI team’s discovery was a launching pad for further developments in DNA informatics. Inspired by the evidence that data could be stored and reread in the form of DNA, Grass set out to improve the lifespan of the medium. Because while EMBL-EBI’s results were sufficiently error free, Grass knew that DNA stored for longer periods of time would degrade via chemical reactions with the environment. If DNA was to be made into a viable way of storing information, Grass needed to find some way of halting that chemical degradation.

Ancient inspiration

To answer his Information Age question, Grass looked to the ancients — or more specifically, to their bones. Realizing that genetic material preserved in fossilized bones was intact and readable even hundreds of thousands of years after an organism’s death, Grass inferred that synthetic DNA might last longer if preserved in its own fossil shell. His idea was a success. By encasing DNA in glass shells made of microscopic silica spheres, Grass and his team were able to prevent information loss due to chemical degradation. In fact, silica-encased DNA containing Switzerland’s Federal Charter and Archimedes’ Methods of Mechanical Theorems withstood a month of storage at more than 60 degrees Celsius — the equivalent of hundreds of years of heat-weathering. DNA stored without a shell — either on impregnated filter paper or in a biopolymer — showed no such resolve when subject to the same weathering process.

In an ideal archive situation, DNA logs would not be stored at the sweltering temperatures of Grass’s weathering test. Instead, they would be maintained at low temperatures. One possibility would be to store DNA in the Svalbard Global Seed Vault, where seeds of essential crops are preserved at negative 18 degrees Celsius. In the Seed Vault or in some similar location, info-coded DNA would be kept chilled to maximize its shelf life. With a combination of these methods — encapsulation in silica and low temperatures — Grass predicts that synthesized, information-carrying DNA could last for more than a million years.

The Svalbard Global Seed Vault in Norway is a secure location containing seeds for more than 300 plant species, many of which are crops. The vault is kept at negative 18 degrees Celsius to preserve the genetic information stored in the seeds. Grass predicts that if stored under conditions similar to those of the Svalbard vault, digitally encoded DNA could last error free for a million years. Image courtesy of Dag Endresen and Wikimedia Commons.
The Svalbard Global Seed Vault in Norway is a secure location containing seeds for more than 300 plant species, many of which are crops. The vault is kept at negative 18 degrees Celsius to preserve the genetic information stored in the seeds. Grass predicts that if stored under conditions similar to those of the Svalbard vault, digitally encoded DNA could last error free for a million years. Image courtesy of Dag Endresen and Wikimedia Commons.

Eternal accuracy

Grass’s silica shell is not only incredibly protective, but is also easy to remove. When the time comes to read the DNA, all it takes is a fluoride bath to wash away the glass coating. DNA sequencing technology, which is relatively quick and affordable today, can then be used to decode the stored data.

To prevent misreading by the sequencing technology itself, Reinhard Heckel, a scientist in ETH Zurich’s Communication Technology Laboratory, has devised an algorithm to correct errors in the code. Heckel modeled his technique on the Reed-Solomon Codes, which are used to prevent error in the transmission of data over long distances. Their basic premise is that by attaching additional copies of information to the actual data, scientists can offset the negative impact of losing one data point. This means that even if the DNA was stored under harsh conditions or misread by a sequencing machine, its data could be retrieved error free.

The prospect suggested by Grass’s research is enticing: to record the thoughts and innovations of our generation for millennia to come, to make our memories, as Shakespeare said, incorruptible by “sluttish time.” But the rise of DNA as a storage medium does not come without challenges. First, how could we decide which information merits immortality? To whom should be granted the honor of curating the collection? And then, who would maintain it?

We might envision a new sort of librarian trained to navigate vast archives of encoded information. To the first such librarian would fall the peculiar responsibility of deciding how to stack microscopic books of DNA. Librarians of future generations would face an altogether different burden. They would have to keep custody of materials millions of years outside the context of their own authorship — and millions of years after the life of Robert Grass, the scientist whose work would have made it all possible.

Cover Image: Current digital storage methods, such as the servers in the Server Room at the National Archives of the United Kingdom, are not reliable for the long term.  DNA molecules offer the promise of a much longer shelf life for digital information. Image courtesy of the National Archives (UK) and Wikimedia Commons.