Disaster Recovery 101: Update & Review

For years in healthcare IT we have associated Disaster Recovery (DR) with a natural disaster, weather event, power issue, hardware failure, etc. While those certainly need to be accounted for in your planning, in reality, the odds of a catastrophic weather event impacting your data center environment are low. Fast forward to 2017, (although the first ransomware attack was in 1989) when WannaCry impacted thousands of organizations and left numerous healthcare systems offline for days or weeks.

Now this convo is focused on DR and we will save the security topic for another day, but I think it’s important to know what you are now protecting against. The days of just having to worry about a hurricane, tornado, array failure, etc. is all but gone. In the last few years, I have seen more DR events due to a security incident than for any other reason.

As we have seen the reasons why DR is more important now than ever change in the health IT industry over the past few years, likewise have the way we plan for these events. In the IT world, we have always used terms like RPO (Recovery Point Objective) and RTO (Recovery Time Objective or Return to Operations) forever and yes, they still are representative of the overall goal, there are a few other things I think are equally, if not more important when discussing DR.

Before we get into the details, below is a graphic that shows the “new” version of a DR timeline.

As you look at the chart above you will see that RPO and RTO are still included in the overall timeline but now we are accounting for items such as Maximum Tolerable Downtime (MTD), Working Recovery Time (WRT), and Decision Time. Let’s look at all the components individually first and then can show how they are so integrated with each other.

Decision Time

Starting with what I consider one of the more important items in the overall timeline is the “Decision Time”. This is the time when failure or event occurs when the organization declares a disaster and initiates the DR plan. Besides being the main factor in whether or not the target timeframes are hit, it’s the single item that we can’t account for upfront. I personally have been in the position where team members are troubleshooting an issue like a virus outbreak on the core EMR system and you are being told things like “we are close to having this resolved”, “I’m waiting for a call back”, or my favorite “this SHOULD work”. Those are the toughest calls to make as a leader in the organization since you want to allow the team to attempt to eradicate the virus and clean up the system, but with every minute that passes you are losing the ability to invoke your DR plan, start system recovery, and leverage the system for improved patient safety and outcomes.

Decision Time is one factor that I wish I had a formula for. After X hours, invoke DR procedures, recover from the last backup, rekey data, and get the system online. This is something that will be different for every organization out there but it’s a very important aspect of your DR procedures. Waiting until you need to make that decision can be the difference between 8 hours or 3 days.

Recovery Point Objective (RPO)

As I mentioned above, Recovery Point Objective (RPO) is the time between your last backup and when you invoke your DR procedures. This backup may be your daily backup to physical tape, virtual tape, or a backup device, or maybe a storage array-based snapshot. Either way, there is one very important thing to remember in our industry, application consistency is king! Without application consistency, the backup or snapshot is as useless as that medium-size tee-shirt in my closet (although I’m convinced if I leave it there one day it will fit!.) In the MEDITECH space, there is a process called quiescing and unquiescing used to obtain application consistency. The process includes quiescing (pausing) the MEDITECH databases to stop all writes, allowing the backup process to start, and then unquiesces to allow writes back to the databases. In the Epic space, this process is called freeze and thaw. These processes will give you true application consistency and therefore, a true RPO.

The average RPO usually ranges between 1 hour and 4 hours.

Recovery Time Objective (RTO)

The RTO as traditionally defined is the time it will take the organization to bring the systems back online for users functionally after an event is declared. In the new model, the RTO is a portion of the total Recovery Time Frame which is a combination of RTO and Working Recovery Time.

Working Recovery Time (WRT)

Working Recovery Time is a portion of the total Recovery Time Frame, calculated by adding together the RTO and WRT. WRT is the time that elapses between bringing the system back online and finishing the input of data captured on paper procedures. This time is important to account for because it has a significant impact on hospital operations, especially in high volume facilities. This can include manually entering new patients from registration all the way through admissions, order entry, billing, etc.

Recovery Time Frame (RTF)

The RTF is the sum of the Recovery Point Objective and the Working Recovery Time. RTF=(RPO+WRT)

Maximum Tolerable Downtime (MTD)

The MTD is the total time an organization can sustain from when the outage occurs until normal business operations are resumed.

Now that we know what we need to account for, time for the million-dollar question. What is the Maximum Tolerable Downtime you can withstand as an organization?

I hope as you read this it provokes some great conversations internally and puts DR into a new light.

In part 2 of this blog, we will review the different options and planning activities to achieve a DR solution that aligns with the business and budget needs.

Mike Donahue leads CloudWave’s Technical Services team at CloudWave. When he’s not managing engineers and consultants, he’s enjoying Florida life with his family in his new home.