One Project, One Terabyte: How DNA Sequencers Are Straining Storage Infrastructure

This is the fifth installment in our series, The Unstructured Data Explosion, which explores how new technologies are creating file growth challenges within today’s leading businesses and industries.

When scientists first attempted to sequence the human genome, the project lasted more than ten years. The cost? Roughly $3 billion. But researchers no longer need a decade and a major federal grant to detangle a genome. Today, pharmaceutical companies, nonprofits and independent research groups can acquire their own DNA sequencing platforms and conduct these studies in house. In this post, I’ll explain how these DNA sequencers are straining traditional storage infrastructure.

Compared to the Human Genome Project, the sequencing process today is incredibly fast and relatively inexpensive. The Illumina HiSeq X Ten System, for example, can churn out 18,000 genome studies a year at $1,000 per genome. There’s a whole range of platforms with different capabilities. One of our clients, a nonprofit research and education facility, has been using Illumina’s MiSeq sequencer to analyze the DNA of thousands of different species, including sharks, fungi and Amazonian birds. The organization employs dozens of scientists to tease out the genetic blueprint for all these life forms, then compare and contrast them to gain a deeper understanding of the common biological threads between all species.

“The genetic code for a highly evolved species doesn’t exactly fit on a floppy disk. A single DNA sequencing run can churn out 1 TB of raw data.”

This is fantastic work. Absolutely. But imagine what it’s like to be the IT group that has to deal with all this data.

Massive, Nonlinear Data Growth

The genetic code for a highly evolved species doesn’t exactly fit on a floppy disk. A single DNA sequencing run can churn out 1 TB of raw data, which then needs to be stored, protected, and made available to the scientists. All of this has to be done cost effectively, too. And not just at budget-constrained nonprofits: Even large pharmaceutical firms need their IT departments to operate with utmost efficiency.

1. Unparalleled Scale: DNA sequencers churn out a tremendous amount of unstructured data. Relying on traditional hardware to store all this information is tremendously expensive and inefficient.

2. Nonlinear Growth: It’s nearly impossible to forecast storage growth accurately when one sequencing run can generate a TB of data. An organization that ties itself to traditional storage hardware could easily run out of capacity early, and be forced into a costly purchasing cycle prematurely.

3. Expensive Protection: This data is tremendously valuable – especially at pharmaceutical companies. But given the sheer volume and unpredictable growth rates, relying on backup or tape is too risky. The cost of data protection will grow out of control. IT will have to dedicate more cycles to managing protection. And in the event of a disaster, end users may have to wait days before access is restored – if it’s restored at all.

4. Proliferating Hardware: For one of our clients, space is at a premium. They don’t want storage to grow so quickly that their server rooms have to expand and cannibalize the available square footage. They need to conserve that space for other uses.

Unstructured Data Should Not Be a Burden

Naturally, these are only a few of the challenges. The larger point here is that this data is a tremendous advantage for all these groups, from drug discovery companies to research teams and nonprofits. It shouldn’t be seen as a burden. But storing, protecting and managing it with traditional hardware will only drain an organization’s resources and budget.

An innovative technology like DNA sequencing should be paired with an innovative unstructured data solution.