British Airways - Massive IT Disruption Worldwide

h4rm0ny · 28 May 2017 at 08:30

Mysterae_ said:
How can one power supply issue ground the entire fleet of BA? This is gross ineptitude.

I can offer an example. Please note two things - this is only an illustration of how one power supply issue could cause this, not a claim that it is. Secondly, that this is explanation, not excuse.

So, big companies with critical IT systems run multiple datacentres in case of disaster - including power supply failure. However, companies are usually pretty terrified of actually testing complete data centre failures. I know - I had to have a big argument with a director once on the necessity of doing this. They are risk averse. I had to show them that the small risk of causing an outage was greater than never testing it and having it fail under circumstances out of our control. In any case, whether regularly tested or not, failover of very large systems such as BAs flight management handling bookings, boarding passes, cancellations, et al. is a complex task.

There are reports of things going wrong before the complete collapse. People seeing wrong destinations come up, missing flights and similar. What this sounds like to me is a partial failover. For example, the off-centre / off-centres were not properly synced. The system failed over and either data was missing because it hadn't propagated yet or it failed over and then the original centre came back online and systems transferred back over to that one and THAT one was no longer current. Or possibly it was running from both intermittently / simultaneously. One way of transferring between datacentres is to update firewalls / load-balancers to direct traffic to the different centre and this can take a few moments to propagate. I'm not sure what database systems BA use (but probably Oracle) but if you enter a scenario where you have two MASTER databases (in practice, BA will have a more complicated than just two) and they get conflicting data in them, it can be a nightmare to disentangle. I know. One job I had was to repair a situation where two databases that were supposed to be in sync had diverged. Note, diverged is different from one merely being behind where you just replay the transactions to catch it up.

So basically, in this hypothetical (possibly real, who knows?), they did failover to a different data centre but either the data wasn't there or more likely if the other errors reported are true, systems were bouncing between them and causing data corruption.

Anyway, that's just an example of how a power supply failure at one place could collapse the whole system. If anyone is inclined to write a post saying how this shouldn't happen or listing ways to prevent it - feel free. Just don't write it as if I'm the person arguing against you!

NewGamer11 · 28 May 2017 at 08:34

Have they turned it off then back on again?

h4rm0ny · 28 May 2017 at 08:37

NewGamer11 said:
Have they turned it off then back on again?

There's someone on their way from New Dheli to do that even now.

Mysterae_ · 28 May 2017 at 08:58

icehot said:
.. TCS have a lot to answer for and knowing the infrastructure they widely standardise on for clients, I look forward to hearing stories at work on Tuesday

I'm sure I'll hear many theories too, from a hack attack and a cover-up by the government! Be sure to share

.

h4rm0ny said:
I can offer an example. Please note two things - this is only an illustration of how one power supply issue could cause this, not a claim that it is. Secondly, that this is explanation, not excuse.

snip .. a very plausible and interesting explanation .. snip

Anyway, that's just an example of how a power supply failure at one place could collapse the whole system. If anyone is inclined to write a post saying how this shouldn't happen or listing ways to prevent it - feel free. Just don't write it as if I'm the person arguing against you!

I'm no expert in the field of disaster recovery but I know the importance of testing, and now BA do too!

Haggisman · 28 May 2017 at 09:28

Irish_Tom said:
snip
@Haggisman, I'm glad you've got another flight booked but be aware of the above — good luck!

Thanks for the heads up! We're going to terminal 5... Hopefully it's a bit more organised by tomorrow morning :/

Nate--IRL-- · 28 May 2017 at 10:00

h4rm0ny said:
I can offer an example. Please note two things - this is only an illustration of how one power supply issue could cause this, not a claim that it is. Secondly, that this is explanation, not excuse.

So, big companies with critical IT systems run multiple datacentres in case of disaster - including power supply failure. However, companies are usually pretty terrified of actually testing complete data centre failures. I know - I had to have a big argument with a director once on the necessity of doing this. They are risk averse. I had to show them that the small risk of causing an outage was greater than never testing it and having it fail under circumstances out of our control. In any case, whether regularly tested or not, failover of very large systems such as BAs flight management handling bookings, boarding passes, cancellations, et al. is a complex task.

There are reports of things going wrong before the complete collapse. People seeing wrong destinations come up, missing flights and similar. What this sounds like to me is a partial failover. For example, the off-centre / off-centres were not properly synced. The system failed over and either data was missing because it hadn't propagated yet or it failed over and then the original centre came back online and systems transferred back over to that one and THAT one was no longer current. Or possibly it was running from both intermittently / simultaneously. One way of transferring between datacentres is to update firewalls / load-balancers to direct traffic to the different centre and this can take a few moments to propagate. I'm not sure what database systems BA use (but probably Oracle) but if you enter a scenario where you have two MASTER databases (in practice, BA will have a more complicated than just two) and they get conflicting data in them, it can be a nightmare to disentangle. I know. One job I had was to repair a situation where two databases that were supposed to be in sync had diverged. Note, diverged is different from one merely being behind where you just replay the transactions to catch it up.

So basically, in this hypothetical (possibly real, who knows?), they did failover to a different data centre but either the data wasn't there or more likely if the other errors reported are true, systems were bouncing between them and causing data corruption.

Anyway, that's just an example of how a power supply failure at one place could collapse the whole system. If anyone is inclined to write a post saying how this shouldn't happen or listing ways to prevent it - feel free. Just don't write it as if I'm the person arguing against you!

I've been tasked with setting up a Secondary DR site for my company, and this post scares me to death

Nate

h4rm0ny · 28 May 2017 at 10:34

Nate--IRL-- said:
I've been tasked with setting up a Secondary DR site for my company, and this post scares me to death

Nate

Heh. Good luck. I had the benefit of a good team but it's not easy. One challenge was that the company I was contracted to had had a couple of quite visible failures in the past due to data centre switches and configuration discrepancies. Specifically in outward-facing security solutions. So the non-technical people were very reluctant when the IT teams said they wanted to switch sites. It was compounded by the fact that the IT team they had didn't actually do routine switching for testing purposes. They did it only when they needed to upgrade or reconfigure something at one of the sites. Which was bad practice.

I had to convince the higher level management that routine, scheduled switching between centres resulted in the smallest possible risk scenario and by doing so any issues were found in controlled circumstances when the entire team were present and actively monitoring for problems. Whereas the alternative was being in a continuous state of unknown risk waiting for it to inevitably happen at a time nobody was prepared for. Full failover tests are essential.

The database divergence took me two weeks to sort out, though of course I had things up and running for 98% of their customers far sooner. But when all your foreign keys are utterly out of whack between Master and Slave because non-propagated INSERTS are all over the place, you can imagine what that's like. Rebuilding a new database with new identifiers and figuring out how to merge the lost data back into the live database and resolve all the conflicts on unique data. Next project they did, moans of "lets just use MySQL" were met with my meanest possible expression. "SQL Server or Postgres - take it or leave it"

smogsy · 28 May 2017 at 10:42

Haggisman said:
Flight's just been cancelled... Going to be a very unhappy 5 year old in the morning!

Only allowed to rebook up until the 10 the of June, which is useful considering the summer holidays don't start till middle of July! >_<

Am guessing our travel insurance isn't going to cover the cost of the hotel we're currently staying in or the parking for the week since we've used it so that's £150 down the drain, grr!

their giving out full refunds. more to point just fly with another airline, #problemsolved

smogsy · 28 May 2017 at 10:45

h4rm0ny said:
I can offer an example. Please note two things - this is only an illustration of how one power supply issue could cause this, not a claim that it is. Secondly, that this is explanation, not excuse.

So, big companies with critical IT systems run multiple datacentres in case of disaster - including power supply failure. However, companies are usually pretty terrified of actually testing complete data centre failures. I know - I had to have a big argument with a director once on the necessity of doing this. They are risk averse. I had to show them that the small risk of causing an outage was greater than never testing it and having it fail under circumstances out of our control. In any case, whether regularly tested or not, failover of very large systems such as BAs flight management handling bookings, boarding passes, cancellations, et al. is a complex task.

There are reports of things going wrong before the complete collapse. People seeing wrong destinations come up, missing flights and similar. What this sounds like to me is a partial failover. For example, the off-centre / off-centres were not properly synced. The system failed over and either data was missing because it hadn't propagated yet or it failed over and then the original centre came back online and systems transferred back over to that one and THAT one was no longer current. Or possibly it was running from both intermittently / simultaneously. One way of transferring between datacentres is to update firewalls / load-balancers to direct traffic to the different centre and this can take a few moments to propagate. I'm not sure what database systems BA use (but probably Oracle) but if you enter a scenario where you have two MASTER databases (in practice, BA will have a more complicated than just two) and they get conflicting data in them, it can be a nightmare to disentangle. I know. One job I had was to repair a situation where two databases that were supposed to be in sync had diverged. Note, diverged is different from one merely being behind where you just replay the transactions to catch it up.

So basically, in this hypothetical (possibly real, who knows?), they did failover to a different data centre but either the data wasn't there or more likely if the other errors reported are true, systems were bouncing between them and causing data corruption.

Anyway, that's just an example of how a power supply failure at one place could collapse the whole system. If anyone is inclined to write a post saying how this shouldn't happen or listing ways to prevent it - feel free. Just don't write it as if I'm the person arguing against you!

sounds familiar, not sync data correctly could cause this, but in that case a simple fix would be, if the server(s) cannot be contacted go to Data center #2 then propagate after datacenter #1 is online.

if its a world wide company they should have multiple data centers with the same information at all times

just shows how some companies do not test...

Caged · 28 May 2017 at 11:02

Nate--IRL-- said:
I've been tasked with setting up a Secondary DR site for my company, and this post scares me to death

Nate

It's a lot more straightforward if it's a genuine DR site and you have an RPO measured in hours. If you're trying to run it in a way that gets you availability then things become more complicated and you need to be looking at adding a third site to act as a witness/quorum for database services so that you never have an even number of systems voting for whoever has the most up-to-date data.

gary996 · 28 May 2017 at 11:16

This is going to cost BA some money!

Nate--IRL-- · 28 May 2017 at 11:29

Caged said:
It's a lot more straightforward if it's a genuine DR site and you have an RPO measured in hours. If you're trying to run it in a way that gets you availability then things become more complicated and you need to be looking at adding a third site to act as a witness/quorum for database services so that you never have an even number of systems voting for whoever has the most up-to-date data.

Yeah - currently I've pushed it back to management to solidly define the SLAs - DR is the easiest to implement, though Active/Active sites are desired from above, supposedly for simplicity, when it is anything but.

Nate

Quartz · 28 May 2017 at 11:41

Nate--IRL-- said:
I've been tasked with setting up a Secondary DR site for my company, and this post scares me to death

Your job has now been justified. As have any budgetary increased you might require.

h4rm0ny · 28 May 2017 at 11:44

Quartz said:
Your job has now been justified. As have any budgetary increased you might require.

Ha! Spot on! If their first meeting with upper management doesn't begin with the words "BA was an airline that existed between 1974 and 2017..." then they're doing it wrong!

Haggisman · 28 May 2017 at 11:45

Looks like flights are leaving Heathrow regularly again, watching them go over every minute or 2, bet they have some backlog to deal with though!!

Caged · 28 May 2017 at 12:13

Nate--IRL-- said:
Yeah - currently I've pushed it back to management to solidly define the SLAs - DR is the easiest to implement, though Active/Active sites are desired from above, supposedly for simplicity, when it is anything but.

Nate

Yeah that's quite a different beast. If you have services that can cluster at a software level then you're in a much better position. As soon as you need to have a legacy piece of software that you're adding all the availability to by using features in VMware then everything gets a fair bit more complicated.

Skynet5 · 28 May 2017 at 13:25

Well if they outsource when most smart people are insourcing what do they expect. And TCS of all companies. Oh dear.

Skidder · 28 May 2017 at 17:30

I'm at t5. There are still a lot of cancellations and delays
My flight is 2.5 hours late but is going so not too bad.

Haggisman · 28 May 2017 at 17:44

Skidder said:
I'm at t5. There are still a lot of cancellations and delays
My flight is 2.5 hours late but is going so not too bad.

Is it still rammed & chaotic there?

Have to say I'm not looking forward to getting through that in the morning (assuming I don't get another notification at 4am telling me it's cancelled)

Skidder · 28 May 2017 at 18:07

Haggisman said:
Is it still rammed & chaotic there?

Have to say I'm not looking forward to getting through that in the morning (assuming I don't get another notification at 4am telling me it's cancelled)

Not too bad to be fair. A lot of flights cancelled so volume of people not that bad.