If disaster strikes, how quickly can you recover?

Dec. 16, 2013
The longer an organization takes to recover, the more costly it becomes

If a hurricane, fire, earthquake or even a high-impact human error were to render your business facilities unusable, how long could your organization operate without mission-critical IT systems? How long would it take you to restore operations — and to what extent could you repair the damage short- and long-term?

In the face of a natural or man-made disaster, companies can be crippled for days, weeks or even months, and many risk a permanent loss impacting customers, revenues and reputation. Given the extent to which most companies today are dependent on computerized business processes, a disaster-recovery plan is a necessity. The longer it takes to restore systems and data, the more difficult it will be to recover from the disruption.

Creating a disaster-recovery plan involves prioritizing current systems, pinpointing mission-critical applications and data, and establishing the most cost-effective backup and recovery strategies. Since implementation of the plan may involve significant capital investment in IT infrastructure, fully realizing a disaster-recovery plan may require several years of phased implementation.

Following are a series of questions that your disaster-recovery plan should answer:

  • What are your business needs related to disaster recovery?
  • Where are the gaps?
  • How can you close the gaps?
  • How long will it take to close the gaps?

What are your disaster-recovery business needs?

Disaster-recovery planning should begin with a review of possible threats and impacts to your organization’s processes and systems. Health care and higher education organizations, for example, may use hundreds of applications in many different departments — and near-constant uptime is more critical for some than for others. Prioritization is essential, because establishing immediate recovery for every single system will require more investment than would be feasible for most organizations.

The best way to separate mission-critical from “nice-to-have” applications is to interview end users, application owners and other stakeholders, and to quantify the business impact of potential system disruptions. What will impact human health and life safety? What scenarios might arise if an application or data set becomes unavailable? How long can a service be unavailable without causing irreparable harm? What is the true cost of system downtime?

Quantifying the business impact will enable the planning team to objectively separate the mission-critical from secondary systems. This “business impact analysis” (BIA) can be used to establish the “recovery point objective” (RPO) for data and a “recovery time objective” (RTO) for each critical system.

For example, one Midwest hospital had long used an electronic medical record (EMR) system to dramatically increase its capacity for emergency-room admissions. The hospital determined that EMR downtime of more than two hours would lead to significant delays in patient care because it would need to rely upon inefficient manual paper-based processes. The delay in patient care would first result in health risks to the patients — the first and most primary concern should the system not be restored quickly.

If the system remained down long enough, the hospital would need to redirect ambulances to competing facilities in order to protect the well-being of patients, and as a result revenues would decline significantly. Even after the EMR system was restored, a hospital would face an uphill battle to restore its reputation, and thus potentially suffer a reduction in patient visits during a much longer period of time than the initial system failure.

Clearly, the EMR system was mission-critical. Therefore, its recovery point objective was to restore 100 percent of EMR data for the past three months, with a recovery-time objective of two hours — the maximum time length for which the emergency department could function without the EMR system before incurring a waterfall of high-impact negative events.

Many organizations are challenged by this discovery phase. First of all, assessing the sheer volume of applications used in various departments can be daunting. In addition, business staff may not agree on which applications are mission-critical until they see the risks actually quantified in the business-impact analysis. Furthermore, it can be difficult to quantify the intangible results of negative media coverage and loss of reputation, but these results can have a long-lasting impact on profitability.

Where are the gaps?

Once mission-critical systems are defined, the organization should next analyze the gaps between the business needs and current recovery mechanisms. The in-house IT staff and outside technology vendors should be able to provide insight about current capabilities and recovery options specific to particular applications.

The Midwest hospital, for instance, learned that restoring the EMR would take 48 to 50 hours under current conditions, whereas the recovery-time objective was no more than two hours, and ideally near zero. Obviously, the gap was unacceptable, as the hospital could not afford to risk any EMR downtime.

The organization inevitably must address the perceptions of business staff versus current realities. They may not realize that assuring timely recovery requires IT investments that may not have been planned or adequately funded — and they may be unhappy to learn that their operations are not as well-protected as they had assumed.

Since senior business staffers typically control the budgeting of technology projects, it is important to engage them in reviewing the business impact and gap analyses. To remediate the most critical gaps, the IT team will need to make a clear business case for investing in recovery infrastructure, such as storage replication, disk-based backups, and off-site servers. The business-impact analysis is a helpful tool for juxtaposing business risk against the cost of IT infrastructure investments.

How can you close the gaps?

After the gaps are identified, the organization can create a technical and procedural plan to close them, along with the associated costs of each recovery strategy. Clearly, the mission-critical systems will receive the highest level of attention and investment, while the less-critical applications will be recoverable over longer periods of time.

Often, the best solution is to create a secure “hot site” for highest-priority applications, with disk or tape backup for others. This off-site facility hosts a replica of the mission-critical applications and data, and can be quickly activated during a disaster. Since a hot site that completely mirrors all IT systems could cost even a small organization seven figures, this strategy must be deployed very selectively.

Less-costly backup options include disk, tape or, increasingly, even the cloud. As cloud-based services evolve, this may become another standard recovery strategy. However, in many cases today, cloud options are best fit for DR for less critical systems. As time goes on, cloud will likely become increasingly more compelling for even mission-critical systems, though.

How long will it take to close the gaps?

Closing the gaps between business needs and capabilities may be a multiyear process, depending on the time and resources available. A thorough disaster-recovery plan will include the details of IT infrastructure improvements, timelines for project phases and resources required so the organization can budget for the multiyear implementation costs.

For most organizations, disaster-recovery planning is a valuable education process. It is comparable to buying an insurance policy, and you should not invest in the insurance until the risks and mitigation strategies are fully understood. Most important, disaster-recovery planning compels the organization to anticipate the worst — and feel more confident about tackling the aftermath.

About the Author:

Nick Chandler is a Senior Consultant for Data Center & Application Delivery at Burwood Group, a consulting firm specializing in IT management and infrastructure solutions. Chandler specializes in the design and deployment of Data Center infrastructure technologies, including core networking, server virtualization, and unified storage. Founded in 1997 and headquartered in Chicago, Ill., Burwood Group serves local, national and international clients, helping them bridge the gap between business strategy and technology solutions.