British Airways - Massive IT Disruption Worldwide

Soldato
Joined
25 Jun 2011
Posts
5,468
Location
Yorkshire and proud of it!
How can one power supply issue ground the entire fleet of BA? This is gross ineptitude.

I can offer an example. Please note two things - this is only an illustration of how one power supply issue could cause this, not a claim that it is. Secondly, that this is explanation, not excuse.

So, big companies with critical IT systems run multiple datacentres in case of disaster - including power supply failure. However, companies are usually pretty terrified of actually testing complete data centre failures. I know - I had to have a big argument with a director once on the necessity of doing this. They are risk averse. I had to show them that the small risk of causing an outage was greater than never testing it and having it fail under circumstances out of our control. In any case, whether regularly tested or not, failover of very large systems such as BAs flight management handling bookings, boarding passes, cancellations, et al. is a complex task.

There are reports of things going wrong before the complete collapse. People seeing wrong destinations come up, missing flights and similar. What this sounds like to me is a partial failover. For example, the off-centre / off-centres were not properly synced. The system failed over and either data was missing because it hadn't propagated yet or it failed over and then the original centre came back online and systems transferred back over to that one and THAT one was no longer current. Or possibly it was running from both intermittently / simultaneously. One way of transferring between datacentres is to update firewalls / load-balancers to direct traffic to the different centre and this can take a few moments to propagate. I'm not sure what database systems BA use (but probably Oracle) but if you enter a scenario where you have two MASTER databases (in practice, BA will have a more complicated than just two) and they get conflicting data in them, it can be a nightmare to disentangle. I know. One job I had was to repair a situation where two databases that were supposed to be in sync had diverged. Note, diverged is different from one merely being behind where you just replay the transactions to catch it up.

So basically, in this hypothetical (possibly real, who knows?), they did failover to a different data centre but either the data wasn't there or more likely if the other errors reported are true, systems were bouncing between them and causing data corruption.

Anyway, that's just an example of how a power supply failure at one place could collapse the whole system. If anyone is inclined to write a post saying how this shouldn't happen or listing ways to prevent it - feel free. Just don't write it as if I'm the person arguing against you! ;)
 
Soldato
Joined
15 Sep 2008
Posts
2,501
.. TCS have a lot to answer for and knowing the infrastructure they widely standardise on for clients, I look forward to hearing stories at work on Tuesday ;)

I'm sure I'll hear many theories too, from a hack attack and a cover-up by the government! Be sure to share :).

I can offer an example. Please note two things - this is only an illustration of how one power supply issue could cause this, not a claim that it is. Secondly, that this is explanation, not excuse.

snip .. a very plausible and interesting explanation .. snip

Anyway, that's just an example of how a power supply failure at one place could collapse the whole system. If anyone is inclined to write a post saying how this shouldn't happen or listing ways to prevent it - feel free. Just don't write it as if I'm the person arguing against you! ;)

I'm no expert in the field of disaster recovery but I know the importance of testing, and now BA do too!
 
Soldato
Joined
20 Jul 2004
Posts
3,614
Location
Dublin, Ireland
I can offer an example. Please note two things - this is only an illustration of how one power supply issue could cause this, not a claim that it is. Secondly, that this is explanation, not excuse.

So, big companies with critical IT systems run multiple datacentres in case of disaster - including power supply failure. However, companies are usually pretty terrified of actually testing complete data centre failures. I know - I had to have a big argument with a director once on the necessity of doing this. They are risk averse. I had to show them that the small risk of causing an outage was greater than never testing it and having it fail under circumstances out of our control. In any case, whether regularly tested or not, failover of very large systems such as BAs flight management handling bookings, boarding passes, cancellations, et al. is a complex task.

There are reports of things going wrong before the complete collapse. People seeing wrong destinations come up, missing flights and similar. What this sounds like to me is a partial failover. For example, the off-centre / off-centres were not properly synced. The system failed over and either data was missing because it hadn't propagated yet or it failed over and then the original centre came back online and systems transferred back over to that one and THAT one was no longer current. Or possibly it was running from both intermittently / simultaneously. One way of transferring between datacentres is to update firewalls / load-balancers to direct traffic to the different centre and this can take a few moments to propagate. I'm not sure what database systems BA use (but probably Oracle) but if you enter a scenario where you have two MASTER databases (in practice, BA will have a more complicated than just two) and they get conflicting data in them, it can be a nightmare to disentangle. I know. One job I had was to repair a situation where two databases that were supposed to be in sync had diverged. Note, diverged is different from one merely being behind where you just replay the transactions to catch it up.

So basically, in this hypothetical (possibly real, who knows?), they did failover to a different data centre but either the data wasn't there or more likely if the other errors reported are true, systems were bouncing between them and causing data corruption.

Anyway, that's just an example of how a power supply failure at one place could collapse the whole system. If anyone is inclined to write a post saying how this shouldn't happen or listing ways to prevent it - feel free. Just don't write it as if I'm the person arguing against you! ;)

I've been tasked with setting up a Secondary DR site for my company, and this post scares me to death :)

Nate
 
Soldato
Joined
25 Jun 2011
Posts
5,468
Location
Yorkshire and proud of it!
I've been tasked with setting up a Secondary DR site for my company, and this post scares me to death :)

Nate

Heh. Good luck. I had the benefit of a good team but it's not easy. One challenge was that the company I was contracted to had had a couple of quite visible failures in the past due to data centre switches and configuration discrepancies. Specifically in outward-facing security solutions. So the non-technical people were very reluctant when the IT teams said they wanted to switch sites. It was compounded by the fact that the IT team they had didn't actually do routine switching for testing purposes. They did it only when they needed to upgrade or reconfigure something at one of the sites. Which was bad practice.

I had to convince the higher level management that routine, scheduled switching between centres resulted in the smallest possible risk scenario and by doing so any issues were found in controlled circumstances when the entire team were present and actively monitoring for problems. Whereas the alternative was being in a continuous state of unknown risk waiting for it to inevitably happen at a time nobody was prepared for. Full failover tests are essential.

The database divergence took me two weeks to sort out, though of course I had things up and running for 98% of their customers far sooner. But when all your foreign keys are utterly out of whack between Master and Slave because non-propagated INSERTS are all over the place, you can imagine what that's like. Rebuilding a new database with new identifiers and figuring out how to merge the lost data back into the live database and resolve all the conflicts on unique data. Next project they did, moans of "lets just use MySQL" were met with my meanest possible expression. "SQL Server or Postgres - take it or leave it" :D
 
Soldato
Joined
9 Dec 2006
Posts
9,246
Location
@ManCave
Flight's just been cancelled... Going to be a very unhappy 5 year old in the morning!

Only allowed to rebook up until the 10 the of June, which is useful considering the summer holidays don't start till middle of July! >_<

Am guessing our travel insurance isn't going to cover the cost of the hotel we're currently staying in or the parking for the week since we've used it so that's £150 down the drain, grr!
their giving out full refunds. more to point just fly with another airline, #problemsolved
 
Soldato
Joined
9 Dec 2006
Posts
9,246
Location
@ManCave
I can offer an example. Please note two things - this is only an illustration of how one power supply issue could cause this, not a claim that it is. Secondly, that this is explanation, not excuse.

So, big companies with critical IT systems run multiple datacentres in case of disaster - including power supply failure. However, companies are usually pretty terrified of actually testing complete data centre failures. I know - I had to have a big argument with a director once on the necessity of doing this. They are risk averse. I had to show them that the small risk of causing an outage was greater than never testing it and having it fail under circumstances out of our control. In any case, whether regularly tested or not, failover of very large systems such as BAs flight management handling bookings, boarding passes, cancellations, et al. is a complex task.

There are reports of things going wrong before the complete collapse. People seeing wrong destinations come up, missing flights and similar. What this sounds like to me is a partial failover. For example, the off-centre / off-centres were not properly synced. The system failed over and either data was missing because it hadn't propagated yet or it failed over and then the original centre came back online and systems transferred back over to that one and THAT one was no longer current. Or possibly it was running from both intermittently / simultaneously. One way of transferring between datacentres is to update firewalls / load-balancers to direct traffic to the different centre and this can take a few moments to propagate. I'm not sure what database systems BA use (but probably Oracle) but if you enter a scenario where you have two MASTER databases (in practice, BA will have a more complicated than just two) and they get conflicting data in them, it can be a nightmare to disentangle. I know. One job I had was to repair a situation where two databases that were supposed to be in sync had diverged. Note, diverged is different from one merely being behind where you just replay the transactions to catch it up.

So basically, in this hypothetical (possibly real, who knows?), they did failover to a different data centre but either the data wasn't there or more likely if the other errors reported are true, systems were bouncing between them and causing data corruption.

Anyway, that's just an example of how a power supply failure at one place could collapse the whole system. If anyone is inclined to write a post saying how this shouldn't happen or listing ways to prevent it - feel free. Just don't write it as if I'm the person arguing against you! ;)

sounds familiar, not sync data correctly could cause this, but in that case a simple fix would be, if the server(s) cannot be contacted go to Data center #2 then propagate after datacenter #1 is online.

if its a world wide company they should have multiple data centers with the same information at all times

just shows how some companies do not test...
 
Caporegime
Joined
18 Oct 2002
Posts
26,083
I've been tasked with setting up a Secondary DR site for my company, and this post scares me to death :)

Nate

It's a lot more straightforward if it's a genuine DR site and you have an RPO measured in hours. If you're trying to run it in a way that gets you availability then things become more complicated and you need to be looking at adding a third site to act as a witness/quorum for database services so that you never have an even number of systems voting for whoever has the most up-to-date data.
 
Soldato
Joined
20 Jul 2004
Posts
3,614
Location
Dublin, Ireland
It's a lot more straightforward if it's a genuine DR site and you have an RPO measured in hours. If you're trying to run it in a way that gets you availability then things become more complicated and you need to be looking at adding a third site to act as a witness/quorum for database services so that you never have an even number of systems voting for whoever has the most up-to-date data.

Yeah - currently I've pushed it back to management to solidly define the SLAs - DR is the easiest to implement, though Active/Active sites are desired from above, supposedly for simplicity, when it is anything but.

Nate
 
Caporegime
Joined
18 Oct 2002
Posts
26,083
Yeah - currently I've pushed it back to management to solidly define the SLAs - DR is the easiest to implement, though Active/Active sites are desired from above, supposedly for simplicity, when it is anything but.

Nate

Yeah that's quite a different beast. If you have services that can cluster at a software level then you're in a much better position. As soon as you need to have a legacy piece of software that you're adding all the availability to by using features in VMware then everything gets a fair bit more complicated.
 
Soldato
Joined
6 Oct 2004
Posts
18,325
Location
Birmingham
I'm at t5. There are still a lot of cancellations and delays
My flight is 2.5 hours late but is going so not too bad.

Is it still rammed & chaotic there?

Have to say I'm not looking forward to getting through that in the morning (assuming I don't get another notification at 4am telling me it's cancelled)
 
Man of Honour
Joined
28 Nov 2007
Posts
12,736
Is it still rammed & chaotic there?

Have to say I'm not looking forward to getting through that in the morning (assuming I don't get another notification at 4am telling me it's cancelled)
Not too bad to be fair. A lot of flights cancelled so volume of people not that bad.
 
Back
Top Bottom