Category Archives: Backup Stuff

Russian Roulette Backup Policies

Are you playing Russian Roulette with your data? Many people are and may not even realize it. I actually think Russian Roulette is an acceptable policy, as long as you understand the consequences, which most people don’t.

In Russian Roulette, one takes a single bullet and puts it in a revolver, spins the cylinder, puts the gun to his head and pulls the trigger, thus a 1 in n chance of losing, where n is the number of chambers in the revolver (typically 6). Shouldn’t be necessary to explain the consequences.

However, it often is necessary to explain the consequences when playing Russian Roulette with data. In my view, you are playing Russian Roulette with your data when you have a weak, or untested, or non-existent backup plan. It’s not unusual to run across a company where backups were configured several years ago and no one has touched them since. There is no testing of restores to verify that backups are working, there is no sending of data to an offsite location or maybe just no visibility into how backups are running. No one will know how they are doing until there is a problem and the backups are needed, by which time it is too late.

The phrase Russian Roulette came to my mind over a dozen years ago, when, as a Sys Admin, I took a look at our backups and was a bit freaked out by what I saw. We had dozens of tape drives physically attached to larger servers and no barcodes on any of the tapes. A handful of operators would swap tapes out every morning, manually label the previous night’s tapes and put in new ones. For larger servers, they had to stand by during the backup to swap out tapes. It had evolved from simple, unplanned growth of a system that used to work fine many years before. As soon as I saw how these backups were running, I couldn’t stop seeing the image of Christopher Walken in The Deer Hunter, holding a gun to his head while his captor’s yelled “Mau”! I told my boss we were playing Russian Roulette and it was just a matter of time before we landed on the loaded chamber and we would be unable to recover data. We didn’t actually lose data, but it did take us more than a day to recover our primary database server after an incident, so we lost more money than we would have spent fixing our backups.

The importance of RTO and RPO

Let’s cover a couple key terms in backup and recovery, RTO and RPO.

RTO stands for Recovery Time Objective and is a measure of how quickly data must be restored. It actually refers to the maximum amount of time required to get the application back up to the point it was prior to whatever outage or problem suffered.

RPO is Recovery Point Objective and is a measure of how much data one can afford to lose, or how far back to go for a restore. If you do nightly backups, you have an RPO of approximately 24 hours.

So, designing a backup solution for your environment means determining what the appropriate RTO and RPO would be for your environment.

Most environments will have a mix of data, some very important data that has to be restored quickly (small RTO) and can tolerate no data lose (small RPO) and some data which can take hours or days to restore and can go back a day or two to the last good copy. To further complicate things, many end-users and business units can’t easily distinguish which data is more important that the rest, hence they are not sure what RTO/RPO to assign to their data.

To help sort this out, you need to ask the same questions multiple times in different ways. Basically, how much would it cost if we had to restore our data? Is it ok to get the data from an hour ago, a day ago, a week ago? How would we recreate the data if we couldn’t recover it at all? It may take some research to find these answers. Don’t settle for the easy answer of ‘we can’t lose any data ever and have to have it instantly restored’ (RTO/RPO of 0). That may be the case, but not usually. Most environments have a small percentage of data with RTO/RPO of zero, say 5-20%; and some environments have no data that fits that. It is more common to have data that can’t go back more than an hour, or 4 hours, than data that can’t go back seconds or minutes. If you really need aggressive RTOs or RPOs, then that’s fine, it just costs more to implement. Don’t go too far the other way and be complacent with your standard, once a day or once a week backup if you feel you have data that needs more protection.

Whatever solution you have in place, make sure you document what you have and what the limitations are of the current solution. Sure, you were told that you can’t offsite your backups because it would cost too much, but make sure it is documented and understood that if there is a problem with the data center, backups will be lost and data will be unrecoverable. Best that everyone understands the risks and accepts them in exchange for the cost savings, rather than finding out that the company is now out of business. I saw a data center taken out by a flooded toilet on a floor above; it happens.

So, why would Russian Roulette backups ever be ok?

I did start out saying that I think Russian Roulette can be an acceptable policy. It boils down to understanding the odds of losing data versus the cost of protecting that data. If you have data that is easy to replace or can afford to be offline for a while, or you won’t suffer a serious loss if it disappears all together, then Russian Roulette may be acceptable for you. Ask yourself what happens if the server turns into a molten pile of slag or if the data center becomes a smoldering hole in the ground. If that destroys your backups as well and you still stay in business, then you are fine. That isn’t the case for very many companies, but maybe it is for yours.

So, go head, spin the cylinder and pull the trigger. What’s the worst that could happen?

Advertisements

Do CIOs Get Backups?

Sitting in a press room listening to a panel at an industry event reaffirmed my skeptical view of management. I dunno, I guess I don’t really expect or want a CIO to know the nitty gritty about how their stuff is backed up, but I didn’t appreciate the responses that any of them made to the question ‘how are you backing up your virtual infrastructure?’.

Each of the CIOs from the keynote session gave weak answers. They would have been better off saying that they have not had a loss of data due to any failed restores rather than just making up stuff.

The first response was given with an accompanying look of confusion as to why the question would even be asked, ‘why, we merely replicate all our VMs, so there is no problem with restoring’. Ouch, where to start with that one. So, you can successfully restore your corrupt or deleted files that successfully replicated to your other site? Maybe there is more involved, maybe he meant to say that replication, on its own, is not a backup, so we use a CDP (continuous data protection) solution to capture all the changes and replicate those offsite. Or, we perform some form of snapshot and replicate that offsite. Without some form of rollback capability, replication may fail to restore you to the state you require. Replication, on its own, is not a backup solution, since any corruption or accidental deletion would simply get replicated.  As well, snapshots, on their own, are not a backup solution, since a loss of the appliance will mean a loss of the snapshots, so replicating those snapshots to another appliance is necessary.

The other response that got my hackles up was, ‘the beautiful thing about VMs is that they are ultimately just a single file, so it is really easy to just back that file up’. So, all of you who are experiencing challenges backing up your virtual environments must be missing out on that fact. They’re just files! They back up super easy! Right? Who says CIOs don’t have a sense of humor? If it were that easy, there would be no third party solutions needed and no agents or APIs for backup products, we would just perform regular file level backups with whatever backup product we have on hand.

I don’t recall the other responses, but they were equally useless. Again, silly to expect a CIO to know how things are done, but I thought that was why they were chosen for the panel, so they could explain how they are doing things with their virtual environments. You could look at it this way, they don’t know because there hasn’t been a problem restoring, so they haven’t needed to dwell on the how. Sure, let’s go with that.

Easy Hard Drive Upgrade with Mac

I’ve been limping along with my little 128 GB SSD drive on my MacBook Pro.  (That statement still seems odd to me considering that my first hard drive upgrade, ages ago, was to a 30 MB hard drive that my sister told me I’d never be able to fill.)  My kind boss felt sorry for me and sprang for a new, roomy drive.  I received my new 500 GB SSD drive and went straight to swapping hard drives, which was incredibly easy.  The steps I took were:  1. made sure my backups were current, 2. swapped hard drives, 3. restored my OS using Time Machine, 4. reinstalled my Windows Boot Camp using Winclone and then 5. checked to make sure everything was working.

Step 1: Ensure Backups are Current

This could be as simple on a Mac as clicking on the Time Machine symbol on your menu bar:

I have an external drive plugged in, so backups are updated every hour.  I verified that mine was only a few minutes old, then checked on my external drive for a current Winclone backup of Windows running on Boot Camp and then verified that I had an up-to-date backup with CrashPlan, ’cause I’m just that paranoid.

Step 2: Swap Hard Drives

I use ifixit.com for things like this, they have nice step-by-step instructions, complete with pictures.  For my laptop, I used this guide, http://www.fixit.com.

The oddest thing I ran across was the bizarre screws (Tri-wing Y1 screws) used to attach the battery, I’ve never seen them anywhere before.  Even my cool little screwdriver with combo phillips #00 and T6 torx bits doesn’t have that.

Moving slowly and being careful not to touch the motherboard (since I couldn’t remove the battery), it still only took me about 10 minutes to swap the drives.

Step 3: Restore OS with Time Machine

This part is super easy, only made difficult by the fact that we have 4 different MacBook Pros in our office and, apparently, they each have their own distinct install disc.  Yes, this could have been made a non-issue by taking 10 seconds to use a sharpie and write some comments on each disc.  Since we didn’t do that, I grabbed what I thought was the most logical disk to use.  Silly me, thinking I could just use a Snow Leopard disc.

I did learn a new way to eject discs; when Command-E and the eject button don’t work, you can push the track-pad button while rebooting.  Turns out that rather than saying “this is the wrong disc, idiot”, Mac simply decides to keep the disc and not allow normal eject methods to work.  Kind of like what a London bank ATM did with my bank card after one failed PIN attempt.  Nice.

So, after finding my specific Mac OS X install disc, I did a boot to the disc, chose ‘Restore System from Backup’ from the Utilities menu, selected the backup volume and the Time Machine backup I wished to restore to (most recent).

It says 2 and a half hours in the picture above, but it turned out to only take about an hour and a half.  While this churned away, I got started on an all day patch extravaganza of our new HP Windows 7 laptop (for the new sales guy).

Step 4: Restore Boot Camp Partition with Winclone

I would have been done at this point, except that my Windows install uses Boot Camp.  I’ve been a fan of Winclone for a couple of years now, though it turns out they stopped development and decommissioned their website!  I understand someone has released an unofficial copy of Winclone 2.3, which is supposed to support Lion (I believe you can just edit the appropriate plist file to ignore the version check if you want to continue using Winclone 2.2).  See my side rant below.

To restore Windows, I opened the Boot Camp Assistant, told it I have the Mac OS X install disc, then chose Create or remove a Windows partition, and then set the partition size to 40 GB.  I did not format the drive, as Winclone will do this for me.

I’ve used Winclone before when increasing the size of my Boot Camp partition.  I simply blew away the partition with Boot Camp Assistant, increased the size, and then restored using Winclone.  I did the same thing here with my hard drive swap.  To perform the restore, you choose the restore tab, pick the backup file to restore to and choose the Boot Camp partition as the destination, unless you really want to repeat step 3 above.


Side Rant

I must have been snoozing or something, because I had no idea the developers of Winclone over at twocanoes.com were in trouble or were even considering abandoning Winclone, which they did sometime in 2010.  However sad that may be, why in the world did they just dump it?  Seems like it would have been just as easy for them to move it to source forge and turn it over to someone else.  I’m not aware of any program that replaces it, so it’s not like they gave up due to the competition.  So, my rant?  If you’re going to stop developing a free product that lots of people use, at least put it up on source forge and ask someone to take it over.  For me, moving forward, I figure it’s just as easy to simply dd the partition to an external drive.  That could even be scripted and run from cron.

Hmm, if only there were some site that would discuss how to do things like that, such as  backupcentral.com (this link points to an online version of the Linux and Windows Bare Metal Restore chapter I wrote in W. Curtis Preston’s book, Backup & Recovery).

Step 5: Verify Everything

It should be pretty obvious if the restore succeeded or not.  If things are all wonky and restoring to a different backup in Time Machine doesn’t work, then you will be stuck with a re-install of the OS, followed by restoring data from a backup (maybe the file restores will work from Time Machine or maybe you have another product grabbing files, like I do with CrashPlan).

For me, everything was working fine, except for a couple of things, one of which is minor, the other turned out to be a non-issue.  One was Microsoft Office and the other was my Parallels version of Windows.

Microsoft Office had two issues, a database index that it said it needed to rebuild, and a prompt requiring me to re-enter my product key.  Both of these were pretty minor, though still unexpected.

The other issue was Parallels.  I use Parallels to run my Boot Camp version of Windows as a virtual machine while running Mac OS.  However, after my recovery, there was no Parallels applications folder in my dock and the pvm file was missing to start-up the virtual machine.  There is an option within Parallels that says “Don’t backup with Time Machine” and I’m pretty sure I checked that way back when I set this up, since I’m using Winclone.  However, I’m not sure of that, and it’s not the default.  So, I simply went to CrashPlan to restore the pvm file, but discovered that it was 30+GB, which didn’t make sense, since it should just be a small file around 25 MB pointing to the Boot Camp location.  All I did was start-up Parallels, choose “New” and point it to the Boot Camp Partition, then I was back in business.

The nice thing about testing your restores with a hard drive swap is the built-in fallback should it have failed.  I could have gone back to my puny little 128 GB drive.  Good thing I didn’t have to do that!  Who lives with such small drives these days? ;)

Microsoft System Restore Failure Leaves Me Hanging

So, I’m minding my own business, just getting ready for another seminar, setting up our one and only Windows laptop to our projector when Power Point decides to crap out.  Reboots didn’t help and time was running short.  No problem, right?  Just roll back to a previous restore point and I should be fine.  Well, that didn’t turn out to be true.  I’m actually not that torqued that rolling back didn’t fix anything, I’m torqued that the rollback failed and the chose to wipe out all my other restore points!

I chose a restore point that was just before the last Microsoft update, thinking that may have been the culprit and knowing that everything worked fine the last time I used the laptop.  However, the system restore failed with an “unspecified error” and told me I should disable my antivirus and try again.  It then informed me it would have to undo the attempted restore.  Ok, sure, why not?  Here’s the fun part; the system restore got to the end of the rollback, failed and spit out an error that says something to the effect of “unable to rollback from system restore since the system restore did not complete correctly”.  That’s genius!  So, I have to rollback due to a failed restore, and simultaneously can’t rollback due to the same failed restore.  Oh, and just for good measure, we’ll destroy all the other restore points that were there so that you can’t use any of those or attempt to do this again.  Nice.

If you were wondering, the antivirus I run on this laptop is Microsoft Security Essentials, which apparently Microsoft System Restore doesn’t know how to interact with.

End of the world?  No, we proceeded with our broken PowerPoint and showed our video clips outside of PowerPoint and lived without answers to all of our audience response slides.  Once back home, I was able to uninstall and reinstall Office 2007 and get things working again (no, repair didn’t work).  I also had an image I could have used to perform a bare metal restore, though I failed to bring it along on my external drive.  It wouldn’t have mattered as I didn’t have an hour or two to wait for a bare metal restore.  I have, however, managed to get the audience response system(the only reason we need to run PowerPoint on a Windows computer) working on Windows running on Parallels on two of our Macs, so I have multiple redundant systems moving forward.

Backing Up With CrashPlan

We moved into our new office in February of 2011.  One of the first things we did was to get our lab back up and running (well, after installing the new fridge to cool my soda).  Having heard good things about CrashPlan, we decided to give it a test run.  The short version of the story is: it’s easy to install, administer, restore from and even easier to migrate to new hardware than I originally anticipated.

Environment

Just a quick review of our environment. We are backing up 2 Macbook Pro laptops and 1 iMac using the free version of CrashPlan.  We are backing up to a Windows 7 box in our lab with a 16TB Drobo attached to it.  Yes, we could have backed up each laptop to the CrashPlan site and not had to point it to our own server and storage, but where’s the fun in that?  We may add that functionality in the future.

This is not intended as an in-depth review or detailed walk-through of the product.  I was actually impressed with the ease of migrating to a new server and decided to share my experience with the product.

Installation and Use

Installation on all 4 machines (the macs and the backup server) took only minutes and was a simple process.  The initial backup to our local server took the better part of a few days to complete, mostly because one of us actually has some 200+ GB of stuff on his 500 GB drive (I can “name fingers and point names”, but we’ll just say his name rhymes with Wurtis).

Restoring a file was as simple as walking through my filesystem in the CrashPlan GUI and selecting the file to restore, which restores the most recent copy to the Desktop by default, but it is easy to click and change which version and where to restore.  I even did a restore while the backup was still running.

I’ll note here that administration with this version is rather minimal, as it was designed for personal backups, CrashPlan Pro is really where you would go to manage backups for multiple computers.  With the version we have, I can see the status of all 3 laptop backups only if I log into CrashPlan on the backup server, and only get notifications for backups of my own laptop.

Migrating to New Hardware

Ok, so this would not have been a problem I would have to worry about if I had simply enabled the offsite feature and backed up to the CrashPlan site.  However, there would not have been as much to test or play with had we gone that route.

The problem was our backup server which started to flake out on us.  Since it is running on hardware that wasn’t exactly the newest and fastest when we bought it some 3 or 4 years ago, I didn’t even bother trying to spend much time troubleshooting and fixing. I just bought a new box from Fry’s.

Once I had gone through the 5,000 or so windows updates/patches, I installed CrashPlan on the new server, shutdown CrashPlan on the old server and moved the Drobo to the new server.  I was anticipating re-configuring each of the 3 clients to point to the new server and having them re-sync themselves.  I knew they wouldn’t need to start over on the backups since we essentially have the backups seeded on the Drobo.  I was pleasantly surprised to learn that it was even easier than that.  Once I started CrashPlan on the new server, and signed in with the same account as the old server, it showed me both computers the account was assigned to and asked me if I wanted it to take over (adopt) backups for the other computer.  Why yes I do, thank you very much!  That was all it took.  Backups resumed for all 3 clients with no additional configuration needed.

Summary

As I said, easy to install, easy to restore from and even easy to move to new hardware if you’re using the onsite option.