Hey....I intended this to be a quick post because info is in the RFO arriving shortly, but I took a bit more time to write something here too seeing as writing techy hosting stuff is how I got my MoH status. It's just better when it's not about my own company's situation, but either way....
One certain company recently told me that downtime reports were false positives caused by lost wifi packets, even after I'd escalated the request. So, yeah.....I understand any lack of trust in what a hosting company says....
Live chat, one of the DigitalOcean London servers is off, has been since this morning. Apparently they have no idea when it will be back, they don't rollover or have any contingency plans so server off then all customers on that server off. Seems a poor setup to me.
Live chat and support ticket attitude was a total lack of concern or care, totally the opposite of how Dom was when I moved over.
I've read your live chat and I think Nick was polite and apologetic? I think it's a bit unfair to paraphrase/assume things that simply weren't said.....I know it's obviously frustrating when something is offline, especially if (hypothetically speaking...) you have also dealt with a company who may as well let Trump run their communications.
Maybe we just need to update the status page to reassure that we have standby plans, but that actually implementing them is a pending decision. But that causes more questions and hence workload so has its own issues.
- One thing to slightly clear up now there's more time, you asked if emails would be 'rejected'...technically no, they'll be queued (and resent by now) by the sending mailserver, ie you won't lose any messages, but they wouldn't arrive until after the system was back. So Nick was right to mean they wouldn't be working until the system was fixed. If that makes sense.
- You didn't actually ask about backups but I see what you are getting at. 95% of the time, live migration tech with DO and GCP avoids lengthy problems because maintenance can be more or less transparent. But this was an obscure network card problem which in turn caused other issues - making any quick fix tricky.
As for what happened...we had one server offline from 12:51 to 3:51 because of a hardware failure, subsequent replacement and filesystem check. We store several copies, including twice-daily offsite backups, of all data, but need a host cPanel server to be live and kicking too.
Restoring from scratch is possible but a last resort because it still incurs some data loss on systems which are being actively used, even when most of the user-facing data is stored centrally.
I have a direct line to senior teams at DO but until the engineer physically looking at the box had the answer and has fully fixed/replaced it, there is not much else we can immediately decide.
I'll always follow the route of least data loss with quickest fix time. I was about to begin a restore process in parallel but fortunately we got a resolution.
The last time I recommended a globally loadbalanced, industry-leading, ultra-reliable mail service (Gmail) to a client, Google had a meltdown the next day!
@visibleman yes we can quickly spin up new instances, and generally run small instances to contain any disruption to small sets of people.
Hope that helps explain things and sorry to anyone else affected too