Contact Us  |  Blog  |  Account  |  Support  |  DEMO

Designing a File System for the Cloud

Brewers’s CAP Theorem meets Eventual Consistency
In a technological world where everything seems possible, there are still those immutable laws that designers attempt to violate at their own peril. In July of 2000, Eric Brewer gave the keynote speech at the ACM Symposium on the Principles of Distributed Computing. To those of us working on distributed systems, it was a watershed moment when many of our suspicions and concerns about the design of really big systems were crystalized into something equivalent to the universal law of gravitation. Even before Newton defined his law, everyone experienced gravitation but couldn’t build around it. Similarly before Brewers’s Consistency, Availability and Partition Tolerance (CAP) Theorem was articulated, we were already feeling the weight of system design trade-offs. CAP is persistent across technology and frames the way we architect our systems.

Brewer’s CAP Theorem stems from his theoretical work at UC Berkley and empirical observations he made during his time at Inktomi. The early Internet search and e-commerce sites were a breeding ground for great ideas around how to build highly scalable computer systems. It is easy to forget that there was a time when no one was sure that the search engines or an e-commerce site like Amazon would be able to keep up with the rapid growth of the Internet. Today, sites like Amazon and Google have all but solved their scalability problems.

CAP establishes that for a system to scale it needs to make a trade-off among these three variables: Consistency, Availability and Partition Tolerance.

Consistency in a distributed system means that any one state, or piece of data, is the same across the entire system. Availability means that the service is always responsive. Partition Tolerance means that the service can withstand having parts of it not communicating with the rest.

The largest storage systems in the world today are examples of a very specific class of the CAP compromise. The RAIN storage architecture used by cloud storage providers compromises consistency in order to deliver maximum availability and partition tolerance at scale. The result is a system that can scale to exabytes and withstand outages even when its components are distributed all around the world.

The compromise made around consistency does not really matter when the applications are mostly Web applications. Pictures in a Web catalogue do not change very often and if the picture corresponds to an old model, it may not be even noticed by the user. We built one of these massive storage clusters at our previous company, Archivas, which was designed to host petabyte scale archives (the data never changes in an archive). Werner Vogels, CTO of Amazon, articulated the compromise in his Eventually Consistent article, updated in 2008 where he argues that data inconsistency is the cost you pay for reliability and management of partition cases.

Most business data lives in file systems: MS Office documents, PDF, images, movies, music, etc. Unfortunately, file systems demand a highly consistent storage model. No one wants to doubt the state of a file that has been written to the file system. At Archivas we spent several years frustrated by the inability of the RAIN architecture to be consistent. It meant that all of the data that was actually in use and still changing had to remain out of bounds for our highly scalable and reliable RAIN system. It is important to point out that this was not a result of bad engineering, all of the RAIN architectures share this compromise. The CAP theorem categorically rules out the possibility of being able to solve the problem in a single system: you want more consistency; then you better be prepared to forgo availability or partition tolerance. Gravity and US satellite Explorer I – despite all good intentions, inherently a failure.

But what if one were to design two loosely coupled systems – each optimized uniquely per the CAP Theroem – instead of one system trying to defy a universally known truth? One of the key insights during the early design days at Nasuni was that a series of file system snapshots could provide the connection between a highly consistent file system and the RAIN architectures. A single Nasuni Filer is meant to be used by a business and can be optimized for consistency because it does not need to withstand the availability or partition challenges of Amazon S3, for example.

Two loosely coupled systems delivering the tremendous reliability of RAIN and the consistency (and simplicity) of a file system. Gravity understood – Sputnik!

Scale in today’s systems come from optimized trade-offs of individual systems and the subsequent coupling of these optimized systems.  The byproduct of these loosely coupled systems allow us to effectively deliver availability, consistency and reliability together in a single elegant design.