What Happens When the Cloud Goes Down?
By now, almost every person in the world knows of the widespread outage Amazon Web Services (AWS) suffered in the past few weeks. This outage specifically affected the Amazon Elastic Block Storage (EBS) system, and other services built on top of it as well as companies using that service.
EBS is an interesting animal in the scheme of cloud services - we’ve spoken before on the merits of block level abstractions versus file level abstractions and how we feel about this when it comes to the cloud. EBS is obviously block-level - it’s a persistent block device you can share amongst EC2 configurations inside of AWS. We continue to believe (as do others) that blocks are a fundamentally poor approach to cloud storage in general. It’s especially bad in the case of gateway devices for the reasons we’ve outlined previously.
Regardless of our opinion of blocks in the cloud - Amazon’s downtime has definitely brought in some questions from our customers about what happens in the event that their cloud provider (such as Amazon, RackSpace, or Nirvanix) does go down - how does it affect the Nasuni Filer, and what can be done to mitigate a situation like this.
From a Nasuni (and Nasuni Filer perspective) this downtime happily did not affect our customers or us at all. The Nasuni Filer’s use of the cloud when configured for Amazon is solely tied to the Amazon simple storage (S3) service. We do not use EBS, nor do we use EC2 for our operations.
Since its inception, Amazon’s S3 service has been rock solid. It has incredible performance characteristics, downtime is vanishingly rare, it has good, known scaling characteristics, etc. From our historical records, S3 appears to be nothing but a solid, reliable service on which to store your data.
Amazon’s stated SLA (service level agreement) for S3 uptime is 99.9% - but the stated durability of objects stored within S3 is (using the standard redundancy) 99.999999999% - an amazingly high “number of 9s”. This indicates a disparity between uptime and durability. Durability and the risk of data loss has been, and continues to be Nasuni’s primary focus and Amazon’s durability guarantees are astounding.
Uptime, when discussing the cloud and all services is a little more hairy. The Nasuni Filer is unique though in that while Amazon guarantees 99.9% uptime, meaning it could go down, the Filer’s intelligent caching behavior means that you will continue to be able to edit your in cache working set, write new data, etc even if Amazon (or any other CSP) goes down.
The Filer is an intelligent cache - we keep active data (your “working set”) in the local cache, so even if an outage on the CSP side occurs, you can continue to add new data, edit data within the cache, etc. We buffer and protect you and your business as much as we can from any sort of network or cloud storage outage.
When the CSP finally comes back online, or the network issue is resolved, the Filer intelligently pushes the changed data via its normal snapshot mechanism to the cloud, once again ensuring your data is fully protected.
There is a wrinkle in this though - I’m sure you’re asking yourself about objects not in the cache at the time out of the outage, or what happens if you need to restore old data from a previous snapshot.
In these cases the Nasuni Filer performs a best effort pull from the cloud, retrying, reporting, raising intelligent errors to client, and pushing back as needed on clients requesting the data should the operation begin to fail. If the cloud storage provider is unavailable/down - we cannot pull data that is not in the cache from the cloud.
A secondary concern is what happens when, during an outage, your cache fills with new/changed data. Again, this is when our intelligent caching comes into play. We always prefer new/changed data over old unchanged data. We will evict unchanged data from the cache as new data comes in until all unchanged data is removed from the cache.
As much as possible - the Filer is resilient and intelligent about how to deal with cloud storage/provider outages, network outages, etc and we have always been up front about this. The Filer is designed to withstand all of these as much as possible and in a fashion that keeps you and your business running despite these failures.
The Filer’s object/file level approach is critical for our ability to perform this well in the face of failure. Cache management is easier - we don’t have to bring back or keep massive numbers of blocks from the cloud, we can determine “hot” files (your working set) and keep them resident within the cache and end users and applications (all of whom work on a file level) can continue working and functioning.
Amazon’s outage has raised a lot of questions - but like many other companies, Nasuni views this as less a question about the viability of the cloud, and more of a question about architecture and systems design.
We have been asked, and have considered, the potential for users to select a “mirroring” approach to data protection - meaning that instead of having a single provider on a given volume within the Filer you would be able to select two and data would be mirrored between them. Another approach we’ve been asked about is the concept of the “redundant array of inexpensive clouds” approach - essentially building a RAID device across cloud storage providers.
In both cases there’s a fundamental flaw: It assumes equality between the providers in terms of performance, uptime, and more importantly - cost. We’ve previously discussed the true cost of cloud storage (with a webinar here), and while we would never rule out the potential to add mirroring of objects between clouds, we have to keep a close eye on the costs involved for both us as a company, and to our users. As it stands today, you can easily have multiple volumes on a given Filer pointed at different cloud storage providers, and if you chose to, you can write the same data to both volumes, or mirror them using the built-in data migration service.
We’re confident in our architecture and our ability to continue to serve customers in the face of outages - a “cloud-based” architecture does not remove anyone from the need to make sound architectural decisions and having proper processes in place in the face of failure.
The following posts have more information about the Amazon outage, and how these individual companies dealt with it both from a customer service and architectural standpoint. They are good reading, and fundamentally the “cloud” and service built on top of it are going to get better due to this outage and lessons learned.
- Lessons Netflix Learned from the AWS Outage
- Understanding and Using Amazon EBS
- ZDNet: Seven Lessons to Learn from Amazon’s Outage
- Rightscale: Summary and Lessons Learned
- SimpleGeo: How SimpleGeo stayed up during the AWS Downtime
- Heroku: Widespread Application outage post mortem.
- The Updated Big List Of Articles On The Amazon Outage