By Rob Mason on March 27, 2012
In part 1 of the series we looked at the significance and some of the criteria that needs to be considered when executing a bulk data migration in the cloud. There could be a number of reasons to do a migration — perhaps they’re merging accounts after being acquired, or perhaps they’re giving their data to another company, perhaps the company has decided (or been informed) that one CSP is better than another. The cases basically come down to two situations: move data within a provider, or move data between providers.
We set up a series of tests to look at these different cases:
1. Move data from one S3 account to another S3 account.
2. Move data from Microsoft Azure to Amazon S3.
3. Move data from Amazon S3 to Microsoft Azure.
4. Move data from Amazon S3 to Rackspace Cloud Files.
5. Move data from Rackspace Cloud Files to Amazon S3.
To test these migrations, we needed a data set, a set of tools, and compute resources to run the tests.
We used a sample data set we have which is roughly 12TB consisting of about 22 million files of mixed sizes for an average file size of about 550KB. The files are all encrypted and compressed so moving them around poses no security threat. The data set lives in a bucket in an Amazon S3 account in the “US Standard” region. Ideally we would have used the entire data set for all these measurements but since both time and money are limited, we chose to use the first 5% of the data set. That accounts for about one million files and about 200GB of data (the data set is not homogeneous).
We used the technology that we had created to evaluate the CSPs and had later adapted to migrate customer data. As input, the tool takes a source CSP and a target CSP and then a number of hosts (machines) to use to do the data movement. When run, the tool first determines the list of files to copy from the source CSP and the list of files already present on the target CSP. It then finds the difference — files not on the target CSP, and uses that list as the set of files to copy. It then splits that list by the number of machines being used to do the data movement. So if it was using 10 machines and had 22 million files to move that’s 2.2 million files that each machine was assigned to move. Items from the list are divided amongst the machines in a round-robin fashion to ensure fairness, good resume-ability, etc.
On each machine the tool uses 50 concurrent operations (threads) to move the set for that machine. So with 10 machines, we’d have 500 concurrent operations moving data from the source CSP to the target CSP. The machines during the test generally had a load average in the 15-20 range (when the targets could keep up) meaning that they were quite busy.
Data is moved between the CSPs using an encrypted HTTPS connection and the data is never stored on disk by the machines doing the movement. Since we were moving data stored by Nasuni Filers, the data was encrypted at the source and the tool had no visibility into the data.
To measure how far the systems could go we scaled up the number of hosts doing the migration from 1 machine to 40 machines. As the load increased we got increasing error rates from all the providers. Our tool has the ability to retry on errors so it could run to completion despite the errors but the errors and the necessary retries have an impact on the performance. Errors like “Server too busy” from Azure, “Internal Error” from Amazon or the more frightening “The resource could not be found” error from Rackspace are not uncommon as you scale up the load on the CSPs. All eventually did the right thing with appropriate retries.
To run the jobs, we selected Amazon EC2 as our cloud compute provider. We chose the EC2 “m1.large” machine type. We selected EC2 for a few reasons:
While we picked EC2 for the reasons above, we were a bit surprised to find out that EC2 limits how many machines you may run by default and that its a pretty small number. Accounts are limited to 20 machines unless you contact AWS and request more. To get by this silly artificial limit when testing with more than 20 machines we just mixed machines from multiple accounts. Since these are generic machines in the cloud, machines can be mixed from multiple accounts or even multiple providers.
In an ideal world we would have repeated a number of combinations of these tests using compute resources from different providers including Rackspace and Microsoft, but both time and cost were limited for this study.
Rob Mason has more than 20 years of operational, management and software development experience, all of it in storage. A meticulous builder and obsessive tester, with an eye for talented engineers, Rob produces rock-solid software, and, through his own example of hard work and ingenuity, inspires his teams to outdo themselves. His determination for thoroughness extends to financial and operational matters, and at Nasuni, he is a powerhouse behind the scenes, managing the company’s operations, in addition to its engineering team. As the VP of Engineering at Archivas from 2004 to acquisition, Rob oversaw all development and quality assurance. After the Hitachi acquisition, he continued in his role, as VP of HCAP Engineering, managing the integration of his team with Hitachi’s and supporting the rollout of HCAP. Before joining Archivas, he was a senior manager at storage giant EMC, where he was responsible for the API, support applications and partner development for EMC’s content-addressed storage product, Centera. In a previous stint at EMC, he was Manager and Principal Design Engineer for the elite Symmetrix Group, where he improved the speed and reliability of EMC’s flagship enterprise storage disk array. Between Centera and Symmetrix, Rob was the co-founder and VP of engineering at I/O Integrity, a storage-based startup developing a high-performance caching appliance. He has a bachelor of science from Rensselaer Polytechnic Institute and a master’s in business administration with honors from Rutgers University. Rob holds upwards of 30 patents.