Summer Storms Shouldn’t Take Down Your Servers

Summer means power outages. That can mean a data center outage; but it shouldn’t. All businesses should have appropriate disaster recovery plans to keep them functioning through power outages and other incidents that take down systems.

A disaster recovery plan includes the steps needed to bring information systems back online, but it isn’t just a copy of the daily runbook. The plan needs to document:

Inventory of systems affected. Both hardware and software resources should be identified.
Risk assessment and prioritization. Some systems can have downtime without major impact on the business; others serve critical business functions and need minimal downtime. Analysis should rate each system’s level of risk and its importance to the business.
Recovery objectives. “As soon as possible” is not specific enough guidance for the IT team. In order to appropriately design a recovery procedure, the business should define a recovery time objective (RTO) and recovery point objective (RPO) for each application. These numbers tell the IT team how long an application can be down and how much data the business can afford to lose. With those numbers in mind, the technology team can implement high availability and backup solutions appropriate to the business needs. Without those numbers, IT has no choice but to overspend and provide high availability to all applications or underspend and fail to provide applications the support they need.
Recovery procedures. Because teams shouldn’t need to scramble to figure out what to do in the middle of a crisis, the plan should include specific details of the recovery process. It’s particularly vital to include dependencies to ensure systems are brought up in the appropriate sequence. Also critical is documenting the process to check out the restored servers and verify that they’re up and operational with the correct data.
Recovery personnel. Include a list of key contacts and their backups. Also document responsibilities, including who has the all-important authority to invoke the recovery plan.
Fallback process. Recovery may include bringing systems up at another location; eventually, they need to be restored to the normal production servers. In many ways, this process is the same as the recovery process, just to a different set of machines, but any special considerations should be noted.
Impacts on business processes. It’s possible that some recovery procedures will change the way the business needs to perform certain operations. For instance, you may opt not to have secondary servers for a low-priority process and to switch to a manual process in case of failure.

Once the recovery plan is developed, it needs to be tested to ensure that it works. It’s surprising how easy it is to leave important systems and important steps out of the plan! Only testing can provide the reassurance that the plan will be effective. Tests can be as simple as a tabletop read-through, but full-scale disaster simulations that execute the documented processes are the most robust way to test a disaster recovery plan.

Finally, the plan needs to be kept up to date to reflect changes in IT resources and business processes. It’s a good idea to update the plan as part of your change management process whenever a new system or device is deployed in production. Annual reviews, coordinated with an annual test, are also effective.

For more guidance on developing an effective disaster recovery strategy, contact CCS Technology Group.