Data in the Cloud: Ebbs and Flows - Part 3 | Nasuni

Data in the Cloud: Ebbs and Flows – Part 3

Part 3: CSP Performance Measurements

Part 1 of the series looked at the importance of performance in trying to avoid vendor lock-in. Part 2 established the criteria that will be used to measure the performance of different bulk migration scenarios within and across Cloud Storage Providers (CSP). In part 3 we run through the different scenarios and look at the performance of Amazon S3, Microsoft Azure and Rackspace Cloud Files.

Case 1: S3 to S3 Migration
While Amazon has some internal functions like “Copy Bucket” to move data within Amazon S3 to the same account or to other accounts and its presumably faster than anything we could do with external hosts we used our tools to run this test so we could do the later comparisons. As you’d expect moving data around within Amazon scales well, although you can still see the per host operations per second dropping off as you scale up the workload implying that there is a limit to how fast it would go. Since we were limited to only 40 machines for this test we couldn’t push Amazon S3 to its limit.

S3 to S3 Migration

The great thing about the chart above is how linear the performance is as load is added. While performance understandably degrades as you add hosts, it does so predictably which is exactly what you want from a storage system.

Case 2: Microsoft Azure to Amazon S3 migration
In this next case, we look at how fast you can load data into Amazon S3 from the outside. As a general rule, writes are slower than reads on most storage systems. In addition, external bandwidth is much more limited than internal bandwidth.

Microsoft Azure to Amazon S3 migration

In the chart above you can see that the overall performance is lower than before and that we appear to be approaching the maximum faster but still were unable to saturate the systems with the 40 hosts. Note that the limit we were approaching could have either been Amazon’s write limit (bandwidth or technology), or it could easily have been Microsoft’s read limit, again bandwidth or technology.

At their peak in this test Amazon S3 was receiving over 270MB/s and we could have moved the entire 12TB test data set from Azure to S3 in less than 4 hours.

This demonstrated that Amazon S3 had tremendous write performance and bandwidth into S3 and also the Microsoft Azure could provide the data fast enough to support the movement.  

Now lets look at what happens when we go the other way.

Case 3: Amazon S3 to Microsoft Azure migration
Here we get some very different results. At the lower host counts Azure’s performance and scalability were pretty impressive. They did, however, hit a wall we got to about 25 MB/s:

Amazon S3 to Microsoft Azure migration

So by the time we had 2 machines doing the data migration we had hit the limits of Microsoft Azure to accept data. Its hard to determine from the outside if their limits are due to their incoming network or if they are due to a technology limitation. Usually at larger providers bandwidth limits are symmetric — you have the same upstream and downstream Internet connection speeds — so the limitation may be in their architecture. Note that from the previous cases we saw with Amazon S3 we were starting to see a decline in performance at the really high host counts so there is also a limit for Amazon S3. The difference is that we could saturate Microsoft Azure with 2 hosts writing to them while we couldn’t saturate Amazon S3 with 40 hosts.

At their peak in this test Microsoft Azure was receiving over 30MB/s and we could have moved the entire 12TB test data set from Azure to S3 in about 40 hours. A week to move 12TB isn’t bad but its 10 times slower than moving into Amazon S3 or within S3.

We also noticed that as we got the higher machine counts that we started to see a climbing error rates from Azure. While you’d think that 1,000 operations per second shouldn’t be anything a cloud should be concerned about, perhaps there are challenges at the account or container level in their architecture.

The other thing that was very noticeable on Microsoft Azure is that the performance could vary significantly from run to run and appeared to be very dependent on the time of day when the test was being run:

Time of Run

For the chart above, these were all runs using the same data set, the same number of machines (4), and all from the same Amazon S3 bucket to the same Microsoft Azure bucket (with the bucket emptied out before each run). As you can see there’s a high degree of variability in the results with performance ranging from 30MB/s to 190MB/s. We generally saw the lower numbers during normal business hours. Unfortunately the CSPs are not under our control and they’re not very forthcoming about why their performance would vary so greatly. We didn’t experience the same behavior with Amazon S3, and this measurement probably further indicates limitations in their architecture or bandwidth as other customers using the system appear to be affecting our results to a large degree.

Case 4: Amazon S3 to Rackspace Cloud Files migration
Running the same test to Rackspace that we did to Azure provided similar results the S3 to Azure migration but with lower overall performance:

Amazon S3 to Rackspace Cloud Files migration

Like Azure, Rackspace has a limit that is easily hit with our test. The surprising thing is that their limit is almost exactly 10MB/s which is easily hit with only two machines. The near perfect number and the lack of variability in the results (unlike Azure) seems to imply that they’ve got some kind of restrictions applied at an account, container or other level.

Note that its in the provider’s interest to give you plenty of write performance. None of these providers charge you for data transfers into their service — they all figure they’ll get you with the storage, access, or data transfer out charges later. So Azure evidently makes whatever bandwidth is available accessible to users while Rackspace for some inexplicable reason sets a hard cap on the number.

Regardless of where the restriction is, you can see that Rackspace Cloud Files is in a different class than Azure. This also makes you wonder about all the other clouds that are springing up based on OpenStack — its hard to imagine they’d perform better than Rackspace’s Cloud Files as that’s the premier reference implementation.

Of course some of these restrictions could be artificial and just “settings” applied at an account level and the results could change if so, but these are default standard accounts from each of these providers. Rackspace could also scale better if the workload was spread out over multiple accounts or containers (as could Azure). But the providers do not provide guidance on where their bottlenecks are or how to avoid them.

Given this interesting result, the next thing to check is the read performance Rackspace. Do they limit read performance too?

Case 5: Rackspace Cloud Files to Amazon S3 migration
Similar to the test from Azure to Amazon, we moved the same data set from Rackspace to Amazon and measured the results:

Rackspace Cloud Files to Amazon S3 migration

The chart above shows that Rackspace, like Microsoft, has read performance that appears to keep up with the Amazon S3 ingest performance. Like the Microsoft case, as the performance appears to plateau, its unclear whether we’re reaching Rackspace’s read performance limits or Amazon S3’s write performance limits. It’s interesting that the number is pretty close to the one from Azure to S3 which may imply that the limit is on Amazon’s side.

While we’re on the topic of Rackspace, another byproduct of running these tests was having to delete data from the different providers. Here’s an interesting table:

Delete Data Chart

For each provider the test was simple, fetch (query) the list of files to delete from the test set, then delete the list from the provider using 50 concurrent operations (threads) from a single EC2 host. The difference in performance in this area is astounding. Cleaning up your data on Rackspace takes an order of magnitude longer than it does on other providers and about as much time as it takes to get it there in the first place. In the next a final part in our series we look at what conclusions can be drawn from the testing that was conducted.

Leave a Reply