Breaking Point: 7 Reasons Traditional Disaster Recovery Falls Short
Nasuni’s Ryan Miller explores the top reasons traditional disaster recovery falls short in large enterprises.
November 6, 2024 | Ryan Miller
During my 20+ years in Professional Services, Disaster Recovery has always been near the top of the list of topics that IT staff fear the most. We often referred to disasters as potential RGEs, or Resume Generating Events, and joked that the Disaster Recovery plan was to auto-email out resumes and cover letters upon primary site failure.
Typically, efficacy is thought of as being on a scale. Low RPO (Recovery Point Objective) and low RTO (Recovery Time Objective) are at odds with cost. You can spend more money to get more favorable metrics but at some point, the juice isn’t worth the squeeze. As a result, IT staff often allocate more resources to shore up the resiliency of the primary site in the hope of never needing to enact a disaster recovery plan. It’s not necessarily a bad strategy, but it does have a glaring weakness of relying a little too heavily on hope.
So, is there a way to improve the situation? I believe so, and will explore these possibilities in my subsequent post in this series, but first let’s take a step back and look at the 7 reasons traditional disaster recovery falls short.
1. Required bandwidth
During the day, production workloads consume substantial portions of available bandwidth. At night, when many organizations schedule their 3rd party backups, the same bandwidth is again heavily utilized, leading to potential bottlenecks. This competition for a limited resource severely impacts the ability to perform timely data replications.
2. Replication timing
Explosive data growth, coupled with the fact that there are still only 24 hours in the day, creates a replication challenge. Bursty or new workloads can exacerbate the problem during critical replication periods. When a replication fails to complete on time, not only is the current replication cycle disrupted, but it can also impact subsequent schedules. This can lead to data inconsistencies and increased vulnerability.
3. Replication scheduling
Replications are often scheduled during off-peak hours. So, organizations are typically looking at a 24-hour Recovery Point Objective (RPO). This may not suffice for mission-critical workloads. As a result, IT staff find themselves splitting up workloads and managing multiple replication schedules. This complicates the disaster recovery process and increases the burden on IT, increasing the risk of errors and inefficiencies.
4. Failover testing
Many organizations only conduct failover testing annually, if at all. Additionally, many of the variables that come with a real-world disaster, including the element of surprise, are avoided. This leads to complacency if a test is deemed successful and results in prolonged recovery times and operational disruptions in real-world situations.
5. Failback complications
Failback procedures, essential for restoring normal operations after a disaster, often present their own set of unique issues. For example, integrating changes made during the failover period back into the primary site. While some replication solutions address this by designating the secondary site as the new primary, this is only effective if both sites are configured equivalently, which is rare. Other solutions require a full replication from the secondary site back to the primary site, which is not always feasible or practical.
6. Expense of maintaining multiple clusters
To cope with data growth, many organizations purchase a second identical hardware cluster located at a remote site, and use storage-based replication. This can be effective in addressing many of the bandwidth-centric issues above, but it is costly. Additionally, this approach relies on replicating snapshots, of which there are typically a maximum number allowed in a storage array. You end up having to compromise between data retention and providing a suitable RPO.
7. Reliance on backups
Many organizations rely on backups as their disaster recovery. Over my career, I’ve generally been impressed with the results of this approach, but it is a significant burden on everyone involved, and takes an inordinate amount of time. Furthermore, recoveries are performed by IT staff familiar with the processes and procedures of the organization. In a true Disaster Recovery scenario, there is no guarantee that the people with the most intimate systems knowledge will be present. You can start to see where this goes vis-à-vis factors like disaster recovery playbooks and overall documentation, managing rotating shifts of personnel, identifying the last full backup and its subsequent incremental backups, etc.
Why does this have to be so difficult?
The common underlying theme is the difficulty of getting data from point A to point B effectively when using traditional storage. This is because traditional storage, with its inherent limitations around capacity and challenges around durability, lends itself to siloed architectures. These flaws underscore the need for organizations to reassess their disaster recovery solution and consider a fundamentally different approach that is more adaptive, scalable, cost effective, and efficient. In Part 2, I’ll discuss what such an approach looks like, and some of the implications thereof.