Are you playing Russian Roulette with your data? Many people are and may not even realize it. I actually think Russian Roulette is an acceptable policy, as long as you understand the consequences, which most people don’t.
In Russian Roulette, one takes a single bullet and puts it in a revolver, spins the cylinder, puts the gun to his head and pulls the trigger, thus a 1 in n chance of losing, where n is the number of chambers in the revolver (typically 6). Shouldn’t be necessary to explain the consequences.
However, it often is necessary to explain the consequences when playing Russian Roulette with data. In my view, you are playing Russian Roulette with your data when you have a weak, or untested, or non-existent backup plan. It’s not unusual to run across a company where backups were configured several years ago and no one has touched them since. There is no testing of restores to verify that backups are working, there is no sending of data to an offsite location or maybe just no visibility into how backups are running. No one will know how they are doing until there is a problem and the backups are needed, by which time it is too late.
The phrase Russian Roulette came to my mind over a dozen years ago, when, as a Sys Admin, I took a look at our backups and was a bit freaked out by what I saw. We had dozens of tape drives physically attached to larger servers and no barcodes on any of the tapes. A handful of operators would swap tapes out every morning, manually label the previous night’s tapes and put in new ones. For larger servers, they had to stand by during the backup to swap out tapes. It had evolved from simple, unplanned growth of a system that used to work fine many years before. As soon as I saw how these backups were running, I couldn’t stop seeing the image of Christopher Walken in The Deer Hunter, holding a gun to his head while his captor’s yelled “Mau”! I told my boss we were playing Russian Roulette and it was just a matter of time before we landed on the loaded chamber and we would be unable to recover data. We didn’t actually lose data, but it did take us more than a day to recover our primary database server after an incident, so we lost more money than we would have spent fixing our backups.
The importance of RTO and RPO
Let’s cover a couple key terms in backup and recovery, RTO and RPO.
RTO stands for Recovery Time Objective and is a measure of how quickly data must be restored. It actually refers to the maximum amount of time required to get the application back up to the point it was prior to whatever outage or problem suffered.
RPO is Recovery Point Objective and is a measure of how much data one can afford to lose, or how far back to go for a restore. If you do nightly backups, you have an RPO of approximately 24 hours.
So, designing a backup solution for your environment means determining what the appropriate RTO and RPO would be for your environment.
Most environments will have a mix of data, some very important data that has to be restored quickly (small RTO) and can tolerate no data lose (small RPO) and some data which can take hours or days to restore and can go back a day or two to the last good copy. To further complicate things, many end-users and business units can’t easily distinguish which data is more important that the rest, hence they are not sure what RTO/RPO to assign to their data.
To help sort this out, you need to ask the same questions multiple times in different ways. Basically, how much would it cost if we had to restore our data? Is it ok to get the data from an hour ago, a day ago, a week ago? How would we recreate the data if we couldn’t recover it at all? It may take some research to find these answers. Don’t settle for the easy answer of ‘we can’t lose any data ever and have to have it instantly restored’ (RTO/RPO of 0). That may be the case, but not usually. Most environments have a small percentage of data with RTO/RPO of zero, say 5-20%; and some environments have no data that fits that. It is more common to have data that can’t go back more than an hour, or 4 hours, than data that can’t go back seconds or minutes. If you really need aggressive RTOs or RPOs, then that’s fine, it just costs more to implement. Don’t go too far the other way and be complacent with your standard, once a day or once a week backup if you feel you have data that needs more protection.
Whatever solution you have in place, make sure you document what you have and what the limitations are of the current solution. Sure, you were told that you can’t offsite your backups because it would cost too much, but make sure it is documented and understood that if there is a problem with the data center, backups will be lost and data will be unrecoverable. Best that everyone understands the risks and accepts them in exchange for the cost savings, rather than finding out that the company is now out of business. I saw a data center taken out by a flooded toilet on a floor above; it happens.
So, why would Russian Roulette backups ever be ok?
I did start out saying that I think Russian Roulette can be an acceptable policy. It boils down to understanding the odds of losing data versus the cost of protecting that data. If you have data that is easy to replace or can afford to be offline for a while, or you won’t suffer a serious loss if it disappears all together, then Russian Roulette may be acceptable for you. Ask yourself what happens if the server turns into a molten pile of slag or if the data center becomes a smoldering hole in the ground. If that destroys your backups as well and you still stay in business, then you are fine. That isn’t the case for very many companies, but maybe it is for yours.
So, go head, spin the cylinder and pull the trigger. What’s the worst that could happen?