WhatsApp Down?

Soldato
Joined
29 Dec 2014
Posts
5,780
Location
Midlands
Too early to speculate this is related to the Facebook whistleblower due in front of senate today?

I think that it's plausible that this might have been caused by the actions of a disgruntled employee...

The reason I think it's plausible, is the nature of the change; At 1540 Facebook made a change to it's external routing (BGP) which withdrew a large number of BGP prefixes, some of these prefixes were for it's DNS and internal infrastructure ranges - which is weird. It's weird because a change like that (anything involving important public prefixes) would normally require several levels of approval, and it would also be subject to health checks (pre and post checks) and automated rollbacks if any dashboards went red during or after the change.

It just seems fishy that this failed in a way which was essentially non-recoverable and it lasted so long, the fact this whistleblower thing is going on at the same time might just be a complete coincidence, but it's a fact that one of the main causes of cyber/DOS attacks is disgruntled/upset employees.
 
Soldato
Joined
21 Jan 2010
Posts
22,170
I think that it's plausible that this might have been caused by the actions of a disgruntled employee...

The reason I think it's plausible, is the nature of the change; At 1540 Facebook made a change to it's external routing (BGP) which withdrew a large number of BGP prefixes, some of these prefixes were for it's DNS and internal infrastructure ranges - which is weird. It's weird because a change like that (anything involving important public prefixes) would normally require several levels of approval, and it would also be subject to health checks (pre and post checks) and automated rollbacks if any dashboards went red during or after the change.

It just seems fishy that this failed in a way which was essentially non-recoverable and it lasted so long, the fact this whistleblower thing is going on at the same time might just be a complete coincidence, but it's a fact that one of the main causes of cyber/DOS attacks is disgruntled/upset employees.
It is a much more interesting story.

I'd argue it is more likely that BGP changes are absolutely few and far between, and someone done-goofed. We have all felt that dread when a ping isn't responded to from a remote device. Messing the config up on a router on the other side would make anything other than physical access or a fully redundant out of bands access futile.
 
Soldato
Joined
30 Sep 2005
Posts
16,546
I thought the same thing. It would have had to gone through various levels of approval, and each and every time something was missed?

Now, it could be that someone just made a simple mistake, but things like that are few and far between. Change control is pretty damn good at places like facebook. The code, files, scripts etc etc are part of the change, so doubtful multiple people missed the error.
 
Soldato
Joined
29 Dec 2014
Posts
5,780
Location
Midlands
I'd argue it is more likely that BGP changes are absolutely few and far between, and someone done-goofed. We have all felt that dread when a ping isn't responded to from a remote device. Messing the config up on a router on the other side would make anything other than physical access or a fully redundant out of bands access futile.

Yeah I mean it's possible that someone just 'screwed up'

The problem with that though, is that Facebook have some of the worlds best network automation - a lot (if not all) of their changes are modelled and tested before they're deployed. If anything goes wrong - it's normally rolled back automatically or halted mid-deployment.

What's also weird, is that Facebooks network is massive and highly distributed - it's very unusual to make a change, which would affect their entire infrastructure across the globe at once, especially with things like public internet facing prefixes - stuff like that would normally have to be approved and checked numerous times, that's before any automated checks.
 
Soldato
Joined
21 Oct 2011
Posts
21,592
Location
ST4
Well, far from me to be a CT nutter but it does seem a bit coincidental that this happened the day after that kicked off, conveniently burying the story in the media.

Just a coincidence. Much like Covid-19 starting in a Wuhan wet market when there's one of the only labs in the world dealing with this particular virus just down the road.
 
Soldato
Joined
21 Jan 2010
Posts
22,170
Yeah I mean it's possible that someone just 'screwed up'

The problem with that though, is that Facebook have some of the worlds best network automation - a lot (if not all) of their changes are modelled and tested before they're deployed. If anything goes wrong - it's normally rolled back automatically or halted mid-deployment.

What's also weird, is that Facebooks network is massive and highly distributed - it's very unusual to make a change, which would affect their entire infrastructure across the globe at once, especially with things like public internet facing prefixes - stuff like that would normally have to be approved and checked numerous times, that's before any automated checks.
I don't doubt it - I can just imagine a conversation going something like...

"We are going to deploy all this uber cool stuff to automate/security check/double and counter sign any config changes with auto regression and failover backups plus simulation"
"OK where do we start?"
"The devices we mess about with most"
And ergo, BGP peering routers that get updated once in a blue moon drop to the bottom of the list.

I imagine the network administrators dealing with BGP are the Rayban wearing, rollerblading kind who don't need no config or change approval :cool:
 
Soldato
Joined
13 Apr 2013
Posts
12,399
Location
La France
I don't doubt it - I can just imagine a conversation going something like...

"We are going to deploy all this uber cool stuff to automate/security check/double and counter sign any config changes with auto regression and failover backups plus simulation"
"OK where do we start?"
"The devices we mess about with most"
And ergo, BGP peering routers that get updated once in a blue moon drop to the bottom of the list.

I imagine the network administrators dealing with BGP are the Rayban wearing, rollerblading kind who don't need no config or change approval :cool:

You must have worked in telecoms.

“Should we test this in the network test bed or on a small cluster of quiet cells in the middle of the night first?”
“Nah, it’ll be fine. Let’s do central London first during peak network load.”

A few moments later…

“Ah… ****! Roll it back and delete the system access logs so no-one knows it was us!”
 
Soldato
Joined
29 Dec 2014
Posts
5,780
Location
Midlands
I don't doubt it - I can just imagine a conversation going something like...

"We are going to deploy all this uber cool stuff to automate/security check/double and counter sign any config changes with auto regression and failover backups plus simulation"
"OK where do we start?"
"The devices we mess about with most"
And ergo, BGP peering routers that get updated once in a blue moon drop to the bottom of the list.

I imagine the network administrators dealing with BGP are the Rayban wearing, rollerblading kind who don't need no config or change approval :cool:

lol

I know a bunch of the network engineers at Facebook, a lot of them are straight talking Russians - but funnily enough right now, none of them are talking :D

Facebook's edge network is pretty awesome to be honest, they have over 160 POPs globally and the DCs to go with it - I'm just having a hard time picturing a normal, scheduled change taking it all down at once, for 6 hours... It's just weird.
 
Soldato
Joined
13 Apr 2013
Posts
12,399
Location
La France
lol

I know a bunch of the network engineers at Facebook, a lot of them are straight talking Russians - but funnily enough right now, none of them are talking :D

Facebook's edge network is pretty awesome to be honest, they have over 160 POPs globally and the DCs to go with it - I'm just having a hard time picturing a normal, scheduled change taking it all down at once, for 6 hours... It's just weird.

Overload test gone wrong? T-Mobile Germany once went down for almost 48 hours after an overload test on their shiny new Huawei HLRs went sideways on them.
 
Man of Honour
Joined
20 Sep 2006
Posts
33,991
Are we expecting Facebook to explain what went wrong? They're bound to have a bunch of angry shareholders and business customers. Or are they going to be able to hide in the dark like other 'Cloud' companies do after a severe outage.
 
Back
Top Bottom