WhatsApp Down?

Lysander · 5 Oct 2021 at 00:06

Damn, WhatsApp is back.

Sho · 5 Oct 2021 at 00:30

WhatsApp was down? Didn't even know.

SexyGreyFox · 5 Oct 2021 at 07:32

Lysander said:
Damn, WhatsApp is back.

Thank goodness, I was getting worried.
Now I can get 7 people synchronised.

Django x2 · 5 Oct 2021 at 07:45

Too early to speculate this is related to the Facebook whistleblower due in front of senate today?

Diddums · 5 Oct 2021 at 08:07

Django x2 said:
Too early to speculate this is related to the Facebook whistleblower due in front of senate today?

Well, far from me to be a CT nutter but it does seem a bit coincidental that this happened the day after that kicked off, conveniently burying the story in the media.

dlockers · 5 Oct 2021 at 08:25

Django x2 said:
So not your domain then, and would have still worked even if yours had gone down. Winner

Overly confident, lots of assumptions and a jump to conclusion there - you work in 1st line?

Kanpai · 5 Oct 2021 at 08:49

Django x2 said:
Too early to speculate this is related to the Facebook whistleblower due in front of senate today?

Too much of a coincidence to be honest, 6 hours should be enough time to wipe off things they didn't want someone to see

4K8KW10 · 5 Oct 2021 at 08:51

lnoton said:
What am I meant to do? Speak to the wife?

Kerplunk said:
Loving the Daily Mail tracert image

oh dear ... :cry:

Felon · 5 Oct 2021 at 08:52

Django x2 said:
Too early to speculate this is related to the Facebook whistleblower due in front of senate today?

I think that it's plausible that this might have been caused by the actions of a disgruntled employee...

The reason I think it's plausible, is the nature of the change; At 1540 Facebook made a change to it's external routing (BGP) which withdrew a large number of BGP prefixes, some of these prefixes were for it's DNS and internal infrastructure ranges - which is weird. It's weird because a change like that (anything involving important public prefixes) would normally require several levels of approval, and it would also be subject to health checks (pre and post checks) and automated rollbacks if any dashboards went red during or after the change.

It just seems fishy that this failed in a way which was essentially non-recoverable and it lasted so long, the fact this whistleblower thing is going on at the same time might just be a complete coincidence, but it's a fact that one of the main causes of cyber/DOS attacks is disgruntled/upset employees.

dlockers · 5 Oct 2021 at 09:08

Screeeech said:
I think that it's plausible that this might have been caused by the actions of a disgruntled employee...

The reason I think it's plausible, is the nature of the change; At 1540 Facebook made a change to it's external routing (BGP) which withdrew a large number of BGP prefixes, some of these prefixes were for it's DNS and internal infrastructure ranges - which is weird. It's weird because a change like that (anything involving important public prefixes) would normally require several levels of approval, and it would also be subject to health checks (pre and post checks) and automated rollbacks if any dashboards went red during or after the change.

It just seems fishy that this failed in a way which was essentially non-recoverable and it lasted so long, the fact this whistleblower thing is going on at the same time might just be a complete coincidence, but it's a fact that one of the main causes of cyber/DOS attacks is disgruntled/upset employees.

It is a much more interesting story.

I'd argue it is more likely that BGP changes are absolutely few and far between, and someone done-goofed. We have all felt that dread when a ping isn't responded to from a remote device. Messing the config up on a router on the other side would make anything other than physical access or a fully redundant out of bands access futile.

glenimp617 · 5 Oct 2021 at 09:12

I thought the same thing. It would have had to gone through various levels of approval, and each and every time something was missed?

Now, it could be that someone just made a simple mistake, but things like that are few and far between. Change control is pretty damn good at places like facebook. The code, files, scripts etc etc are part of the change, so doubtful multiple people missed the error.

Felon · 5 Oct 2021 at 09:15

dLockers said:
I'd argue it is more likely that BGP changes are absolutely few and far between, and someone done-goofed. We have all felt that dread when a ping isn't responded to from a remote device. Messing the config up on a router on the other side would make anything other than physical access or a fully redundant out of bands access futile.

Yeah I mean it's possible that someone just 'screwed up'

The problem with that though, is that Facebook have some of the worlds best network automation - a lot (if not all) of their changes are modelled and tested before they're deployed. If anything goes wrong - it's normally rolled back automatically or halted mid-deployment.

What's also weird, is that Facebooks network is massive and highly distributed - it's very unusual to make a change, which would affect their entire infrastructure across the globe at once, especially with things like public internet facing prefixes - stuff like that would normally have to be approved and checked numerous times, that's before any automated checks.

Malevolence · 5 Oct 2021 at 09:16

Diddums said:
Well, far from me to be a CT nutter but it does seem a bit coincidental that this happened the day after that kicked off, conveniently burying the story in the media.

Just a coincidence. Much like Covid-19 starting in a Wuhan wet market when there's one of the only labs in the world dealing with this particular virus just down the road.

dlockers · 5 Oct 2021 at 09:21

Screeeech said:
Yeah I mean it's possible that someone just 'screwed up'

The problem with that though, is that Facebook have some of the worlds best network automation - a lot (if not all) of their changes are modelled and tested before they're deployed. If anything goes wrong - it's normally rolled back automatically or halted mid-deployment.

What's also weird, is that Facebooks network is massive and highly distributed - it's very unusual to make a change, which would affect their entire infrastructure across the globe at once, especially with things like public internet facing prefixes - stuff like that would normally have to be approved and checked numerous times, that's before any automated checks.

I don't doubt it - I can just imagine a conversation going something like...

"We are going to deploy all this uber cool stuff to automate/security check/double and counter sign any config changes with auto regression and failover backups plus simulation"
"OK where do we start?"
"The devices we mess about with most"
And ergo, BGP peering routers that get updated once in a blue moon drop to the bottom of the list.

I imagine the network administrators dealing with BGP are the Rayban wearing, rollerblading kind who don't need no config or change approval :cool:

Terminal_Boy · 5 Oct 2021 at 09:31

dLockers said:
I don't doubt it - I can just imagine a conversation going something like...

"We are going to deploy all this uber cool stuff to automate/security check/double and counter sign any config changes with auto regression and failover backups plus simulation"
"OK where do we start?"
"The devices we mess about with most"
And ergo, BGP peering routers that get updated once in a blue moon drop to the bottom of the list.

I imagine the network administrators dealing with BGP are the Rayban wearing, rollerblading kind who don't need no config or change approval

You must have worked in telecoms.

“Should we test this in the network test bed or on a small cluster of quiet cells in the middle of the night first?”
“Nah, it’ll be fine. Let’s do central London first during peak network load.”

A few moments later…

“Ah… ****! Roll it back and delete the system access logs so no-one knows it was us!”

Housey · 5 Oct 2021 at 09:33

Makes me want a turn off day every week...

Felon · 5 Oct 2021 at 09:36

dLockers said:
I don't doubt it - I can just imagine a conversation going something like...

"We are going to deploy all this uber cool stuff to automate/security check/double and counter sign any config changes with auto regression and failover backups plus simulation"
"OK where do we start?"
"The devices we mess about with most"
And ergo, BGP peering routers that get updated once in a blue moon drop to the bottom of the list.

I imagine the network administrators dealing with BGP are the Rayban wearing, rollerblading kind who don't need no config or change approval

lol

I know a bunch of the network engineers at Facebook, a lot of them are straight talking Russians - but funnily enough right now, none of them are talking

Facebook's edge network is pretty awesome to be honest, they have over 160 POPs globally and the DCs to go with it - I'm just having a hard time picturing a normal, scheduled change taking it all down at once, for 6 hours... It's just weird.

nine_tails · 5 Oct 2021 at 09:37

Sho said:
WhatsApp was down? Didn't even know.

You need to join a daily cat facts group.

Terminal_Boy · 5 Oct 2021 at 09:39

Screeeech said:
lol

I know a bunch of the network engineers at Facebook, a lot of them are straight talking Russians - but funnily enough right now, none of them are talking

Facebook's edge network is pretty awesome to be honest, they have over 160 POPs globally and the DCs to go with it - I'm just having a hard time picturing a normal, scheduled change taking it all down at once, for 6 hours... It's just weird.

Overload test gone wrong? T-Mobile Germany once went down for almost 48 hours after an overload test on their shiny new Huawei HLRs went sideways on them.

ChrisD. · 5 Oct 2021 at 09:41

Are we expecting Facebook to explain what went wrong? They're bound to have a bunch of angry shareholders and business customers. Or are they going to be able to hide in the dark like other 'Cloud' companies do after a severe outage.

WhatsApp Down?

Capodecina