How do you test your backups?

CraigN · 22 Mar 2010 at 17:16

Im looking at a DR offsite atm and i was asked to provide info as to our current restore times.

At present i have a mixed setup of VMware and some physical pizzabox servers. The VMware is on a HP BL460c blades using an EVA 4400. The pizzaboxes are a mix of DL360s ranging from G3 to G6 and DL380's again from G3 to G6.

I have given a rough estimate of 3-4 weeks to restore from LTO3 (physical server backups) and LTO4 (VM server Backups). At present i backup all servers even VMs using the agent method.

The FD's face went white when i mentioned a month to restore just the critical servers. So im looking at some form of replication.

however i have now been asked to test this. Restoring servers from my Tapes.

Ive been thinking im gonna need some spare servers and another VMware environment along with another tape drive. So has anyone else done this if so how?

Ofcourse the lowest price solution will be best received by the man with the purse strings.

bigredshark · 22 Mar 2010 at 18:16

3 to 4 weeks? What are you planning on doing? I could build a new infrastructure from scratch for an SME in that kind of timescale - are you including sourcing new servers in that or something?

iaind · 22 Mar 2010 at 18:35

3 to 4 weeks is ludicrous - when I was on tape with traditional DR mechanisms (a sungard contract), we were looking at 2 days MAX. Now we're using replication, we're at about 4 hours

CraigN · 22 Mar 2010 at 18:38

that is just the thing ive never tested my backups so i dont know how long to restore.

We have a massive amount of data to restore about 52 TB on our san atm

That includes sourcing new servers yes all new hardware to our spare site recalling tapes rebuilding or restoring our infrastructure then individual servers and data.

Im looking at 72 servers in total. 3 Large SQL boxes and 2 massive file store servers that our document management system is on.

so my question is do you guys test your backups and if so how as i really want to get it done to give an accurate timescale.

iaind · 22 Mar 2010 at 18:45

You shouldnt be relying on buying new kit for quick DR - you either need standby kit in your DR location or a DR contract with the necessary gear on it.

Both will cut a week of procurement out of your time and mean you can be doing regular DR test, which you should be doing anyway.

I do 2 a year, one data recovery test and one full scenario based DR test

CraigN · 22 Mar 2010 at 18:52

this is the other part of the issue

i was very conservative with my estimates as i want them to shell out for a replicated solution.

We were going for another C7000 with another EVA4400 and cisco ip distance gateways and san to san block level replication.

They put the stop to that after we did our main site due to cost. So im looking at an iscsi san and some large pizza boxes atm.

But i still need an accurate restore time as i am now i will have to buy new hardware and restore everything from tape, tapes that im not sure will work because ive never restored from them.

We have had quotes for a hosted DR solution but i used to work for a large outsourcing firm and i would prefer to keep it inhouse.

Could you give a bit more detail as to how you do your DR testing.

Little_Crow · 22 Mar 2010 at 20:24

You're walking a thin line, you are correct to be conservative with your estimates (the scotty approach), but push it too far and your bosses may question your competence.

We are doing a DR project at the moment, replicated EVA6400's and VMWare SRM.

You could look at doing full VM image backups for a start, something like vRanger is good and pretty cheap. This could reduce your VM restore times dramatically.
There is also a component for replication of VM's which I have used and seems very effective.

You need to have a regular plan to check your backups, you don't need to restore your whole estate, but picking a random file/folder from a server is a reasonable check to confirm you don't have a tape full of nothing.

To get an estimate for each server you just need to try it. Restoring lots of small files will take a darn site longer than large files so it's impossible for anyone to guess even if they know all about your environment.

Just kick off a restore of everything on a server, wait an hour, and extrapolate from there.
Add in the time taken to rebuild a server from scratch (presuming you have no baremetal restore) and add in a margin for safety.

Bare in mind that Windows server builds are pretty easy, and could be carried out by someone with a little technical knowledge (provided you have Server names, IPs etc, to give them), so you can get others to assist freeing you for tasks only you can do.

bigredshark · 22 Mar 2010 at 20:48

In terms of testing, we don't really do DR, everything essential has multiple sites in an active/active design (there are exceptions, I'm no fan of active/active MS SQL so that runs active/passive for instance).

For frontline business systems we frequently fail over between our sites for maintenance, software deployments and the like anyway, which handily verifies it works. The active/passive systems get tested less frequently but still every 6 to 8 weeks or so.

For backups we use snapshots on the SANs as our principle rollback for most things (again, exceptions where stuff like shadow copies represents a better option). We don't keep anything but essential data on tape (LTO4 with Iron Mountain) as to loose it would mean the simultaneous loss of 4 datacenters spread across 3 countries at least. That which we do have on tape we verify before handing it over and every six months we request a random set back and verify it can be restored to the SAN.

But what you need to do is...

- liase with the business and draw up a priority list of which systems need to be restored first (accounts matters, contact systems matter, customer details matter...marketing and HR can wait - that sort of thing)

- separately liase with the business to get a list of the restore times which are considered acceptable for each system (how soon do you need email back online?)

- Work out how quickly you can restore things currently. Now you've got a picture of your current situation and where you want to be, work out a path between the two (that list of priorities comes in here) and present costed proposals...

Given 50TB of data (not a huge amount by enterprise standards) and (you didn't say but I'll assume) 25-30 servers, I'd be aiming for essential services restored within 48 hours with the rest coming back online over a week or so (excluding hardware procurement), that's as a bare minimum though, serious DR would be a bit faster but two days is a good target as an SME should survive that long without essential systems but not much longer.

Ace T · 22 Mar 2010 at 21:18

For us we are now using Veeam and since they've announced SureBackup this means that all VM's will be able to be tested internally with the application without any user intervention...very clever technology. Our physicals are converted to virtuals and therefore it will be very easy to test the overall datacenter with everything being virtual or having a virtual copy.

knowlesy · 22 Mar 2010 at 21:22

sorry to thread hijack here but,

this is the type of thing Im looking to learn in a lot more depth, can any of you guys point me to any sites, books to get me on this stepping stone of vmware as well as backups sans etc ?

CraigN · 23 Mar 2010 at 09:25

Bigredshark and little crow very helpful thanks

We already have VRanger we purchased it when we went through our virtulization project however what we didn't scope for was the massive amount of storage we would need for Vranger to back up to before taking it to tape. I being new to VMware at the time did not have the experience needed for troubleshooting VCB based issues and backup exec is something i know. I have since got vranger working but i simply dont have the available storage to use it. I am actually in reading tomorrow at symantec looking at BE 2010 as now vsphere has got rid of VCB i want to see how they will do it going forward.

Regular check of my backups is what i want. My problem is I cannot restore my backups to the live environment, im currently looking at buying a pizzabox with a couple of quads and a load of ram to have a single test environment to test my restores on.

I also use LTO4s i do full backups friday/ weekend and diffs every day I send tapes off site to iron mountain everyday.

Bigredshark - i have done this we already have what we call a zero to four group (0-4) and then other groups after that based on business critical boxes first.

Our environment is strange we actually have over double that number of boxes. 72 at last count (inc VMs and Physicals).

Im currently looking at 3 options:

Rolls Royce Option: Replication of the current Blade + SAN environment in a second site with block level SAN replication. This is my preferable solution with very quick recovery should we loose a site. Downside is Massive cost more storage at our end new blades and SAN + expensive Cisco Fibre to IP distance gateways.

Second Option: 2 or 3 large Pizzabox servers With a iSCSI San. I have been looking at Left hand networks HP and also Dell kit. Would however require software replication Like Doubletake or something similar i have tested this to a degree, would also require me to up my WAN lines to cope. Roughly half the cost of option 1.

Third Option: Hosted solution, we are a small company and im hands on i don't like the idea of these things being taken away from my control so im all ready against this. It also comes in at £10k over Option 2, would also need a WAN line upgrade and if you invoke they charge you £10k per month of use.

All of these options would meet the criteria of having the business up and running within 4 hours of a total loss. Which is the aim.

But this has gone slightly off track I merely wondered how people test there backups be it in a test environment, on the live environment etc etc.

Thanks for all input

Nikumba · 23 Mar 2010 at 11:51

We have approx 50 windows servers, and 6 IBM AS/400 servers.

The IBMs are backed up to LTO3 tapes and the windows servers are backed up to our new LTO4 system.

We use BackupExec 12.5 for the software, all windows servers are backed up with the agent process. We also have a number of ESXi servers, and we back those up as VM images.

We are in the process of getting a second lefthand so we can snapshot out the current lefthand that hosts the VM servers.

In the event of a full DR we have a contact with ICM, and we go to their site to do the DR. We test twice a year, both as a full DR, we have a router on their site on our MPLS cloud, so we can connect to the DR site from our training room.

We have not done a DR test yet with our new VM environment, but our current plan is to have all major servers restored in approx 2 days, the IBMs are done within a day. It would take about a week to bring over servers online, but for us the DC, Email, Fileserver, Printserver and IBMs are the major systems we need to keep on trading as a business.

Kimbie

CraigN · 23 Mar 2010 at 13:20

This is uncanny

I currently Use ICM/Servo/Phoenix for our current hardware support. (whatever there called now, they keep changing names and im not sure who owns what)

The contracting arm of Servo was who we went to for our VM Supply and setup.

However there off site stuff seems very expensive. hmmm

khrall · 26 Mar 2010 at 16:56

CraigN said:
This is uncanny

I currently Use ICM/Servo/Phoenix for our current hardware support. (whatever there called now, they keep changing names and im not sure who owns what)

The contracting arm of Servo was who we went to for our VM Supply and setup.

However there off site stuff seems very expensive. hmmm

Slightly OT but:

Phoenix aquired ICM, as well as Servo, NDR, Trend and others.

The DR/Business Continuity side of the business goes under the name of ICM Continuity Services (ICM Business Continuity Services + NDR).

The Maintenance side is Servo (Servo + ICM Managed Availability Services).

We also have a hell of a time when turning up to different sites, introducing ourselves differently depending on the contract/sub contract.