While creating a backup is straightforward, two concepts can complicate the issue–RTO and RPO. Here we describe the concepts and give some ideas on how to get started.
Recovery Point Objective
Recovery Point Objective (RPO) is defined by Wikipedia as “the maximum targeted period in which data might be lost.”
RPO is essentially telling you how much data loss is acceptable for an organization. Inevitably, when I look at a lot of organizations and the way they’ve organized their backup structure up, they’re basically allowing for data loss during the day. If an organization can’t lose any data, ever, things get really expensive.
An organization is always initially going to say, “well, none–we can’t have any data loss”, and then you show them “this is how many additional zeros you add to the current infrastructure to ensure that there is never any data loss”. Hopefully they get more reasonable and say, “well, you know what? We only update the data once a day, so, all right, maybe we can lose eight hours” or, “we can lose this amount of time and still consider it to be acceptable”–if not ideal.
If you’re using your database in simple mode instead of full mode, you’re saying, “okay, lose any data up to the last full backup. I don’t care.” And there are scenarios where that makes perfect sense. A data warehouse is usually a good scenario for this because you can get the data from somewhere else.
Your environment may dictate different RPOs depending on what you have in mind for each system. A single plan to apply to all the systems may not be appropriate.
Simple vs Full Recovery
Once you make the decision to put your database in full recovery mode, you have to take regular backups of the log. This is a separate process from your full backups.
Without the log backup, the log would just continue to grow because it doesn’t know when to allow for the loss. A full backup won’t help you in this scenario. At what point are you okay to allow the data in the log to be overwritten? Your log backups help you know that answer.
You can compress the log backups and hopefully they take up a lot less space than the raw data itself, but you’re still going to have to factor that in. A recovery doesn’t necessarily just mean, “oh, I’ve still got one backup”. That’s good, because sometimes the incident is not ‘your data center flooded’; sometimes the incident involves ‘somebody accidentally deleted all of the records from this really important table’. You need to go back a week ago, before this person deleted it. Coincidentally, they went on vacation six hours later, and at that point, if you’re having to pull a backup from a week ago…well, you better have that backup.
Disc space is generally affordable, so there’s no excuse to be missing a backup; cloud storage makes this very approachable now. It sure beats the downside, I can tell you that.
Recovery Time Objective
Recovery Time Objective (RTO) is the amount of time of downtime you’re going to have from the point of the disaster to the point you can get restore application availability. This can be extremely hard to test for, and we are generally overconfident in what we can do.
A common problem is usually you have correlated failures. What if you’re in a situation where you have a severe outage? It’s not like somebody fat-fingered something, but you have a severe outage. Maybe you got hit with ransomware and it’s not just, “okay, we need to restore this database”, but, “we need to restore our restores”. You may have to revert to a previous version of the storage backup or something like where you have to rebuild the Windows Server and re-install SQL. It may not be as easy as, “well, I did a test restore and for this terabyte of data, it took six hours”. It can be more complicated.
How to calculate RTO?
The business folks are the ones who should come up with these answers. You may have to play nice and ask for their help—this is important for us on the IT side because we have to implement their RTO and ensure we can meet them; however, it’s their job to come up with the requirements. In some organizations, you have people who take that very seriously. You have people who’ve done the risk analysis.
Maximum Tolerable Downtime
A third measure is maximum tolerable downtime (MTD). MTD is ‘at what point do people start suing you’ or are you thinking about just never bringing it back up again? This can be super complex and it does get fairly pricey pretty quickly.
This is where we have some features available to help us: availability groups, log shipping, failover to the cloud—even some of the VM features. These are features which make it easier to try and reduce RTO time.
Always some downtime
Even with some of these highly available features, there’s always a little bit of downtime. It can be seconds, but don’t think your application won’t notice. The setup will help you limit data loss, but the users may notice a rough transition. There should be some expectation with the business in terms of an outage. Are users going to get kicked out? “If you’re on the system and you’re doing something while we experience a failure…” you should communicate whatever the process is for the users.
*Edited excerpts from SQL Data Partners Podcast Episode 230