High availability across physical sites without a single point of failure

Redgie · 17 Jan 2017 at 10:17

I'm researching for a project whereby a web-hosting solution must be in place that guarantees the highest possible uptime, without fail. This means that cross-site servers are a must, and due to the nature of the project, an Active-Passive setup is the most likely (whereby one site is responsible for hosting, until a fault occurs or a manual intervention is done, at which point the passive site takes over and they switch titles).

The problem I have is that the load balancer is going to be a single point of failure. Now, Digital Ocean did a great article on this, but unfortunately solved the problem using a "floating IP", which is a problem for 2 reasons:

It's a product, not a technology, so we'd be tied to a provider of said floating IP
The service responsible for dealing with the floating IP still represents a single point of failure

That said, their diagram did perfectly illustrate what I'm trying to achieve (albeit with an internal IP, whereas mine would be public):

One other common solution is to have 2 (or more) load balancers, with a DNS record for each site attached to domain.com, so that every user will have the ability to try every load balancer in case one goes down, however this seems to have its own problems:

Some clients will not rotate to the next IP if the first fails
The timeouts for failing to a 2nd IP can be large
This approach puts a lot of responsibility on the user
If a change needs to be made, the TTL of the DNS record becomes important

Finally another option is to use a virtual IP, however due to the need for the redundant server(s) to be located in a different geographical area, it seems that the same-subnet limitation of virtual IPs makes this approach unsuitable.

Am I missing a fundamental, and commonly known solution to a cross-site solution with no single point of failure? Thanks in advance

Deleted member 138126 · 17 Jan 2017 at 12:14

You need two pairs of load balancers, and the load balancers need to support GSLB.

Two pairs in case one fails, configured as an HA pair so the 2nd one can take over almost instantly.

GSLB uses DNS to offer you the floating IP functionality without an actual floating IP (stretching a public IP between 2 data centres is going to be extremely difficult and expensive). You delegate a subdomain of your domain (e.g. gslb.domain.com) to the load balancers, and then within the load balancer config you create a record (e.g. website.gslb.domain.com) with an IP at each datacentre. The load balancers act as a DNS server for the subdomain, and the GSLB algorithm looks at the source traffic, and decides where to send the packets (it's intended to provide localised content, i.e. if the client is in Europe, send them to the European server). You can also say "always send it to primary, unless it is down, in which case send it to secondary".

Finally, you replace your website DNS A record with a CNAME that points to website.gslb.domain.com, and you make sure it has a low TTL.

A10 load balancers can do all of this, and they are really good, and extremely cheap for what you get.

Redgie · 17 Jan 2017 at 14:16

Hmm, so what's the failover procedure in the event of the primary load balancer failing? How does the client know to look for the secondary balancer?

ov_sjo · 17 Jan 2017 at 14:22

The client does a DNS lookup and the result will point to the active site.

Redgie · 17 Jan 2017 at 14:37

But does that not then have the same issues as with any DNS based solution (caching of DNS records, slow TTLs, client side oddities etc.)?

ov_sjo · 17 Jan 2017 at 15:43

Redgie said:
But does that not then have the same issues as with any DNS based solution (caching of DNS records, slow TTLs, client side oddities etc.)?

Some. You tend to use a very low TTL with a DNS failover solution. Some resolvers will ignore though.

There's no other practical solution to be honest, this is what most people do.

Redgie · 17 Jan 2017 at 16:16

Damn, was really hoping there would be something I was missing.

Oh well, thanks for the help

beavis · 17 Jan 2017 at 16:34

disclaimer, I've only skimmed through this....

pfsense CARP deals with it in a slightly different way, although you will need 3 public IP addresses. Each HA instance has a single dedicated IP and the floating IP is picked up by the active one. It's explained better here than on the pfsense site...
https://www.howtoforge.com/how-to-configure-a-pfsense-2.0-cluster-using-carp

edit: sorry, I've read it this time and had missed the requirement for two data centres, you can ignore my message now

Deleted member 138126 · 17 Jan 2017 at 18:45

Redgie said:
Damn, was really hoping there would be something I was missing.

Oh well, thanks for the help

You have two load-balancers per site, and you should have at-least two front-end servers per site, and if it's so important, you should have redundant data centre Internet feeds, so how often exactly is this failure going to occur? You have to be very careful when designing resilient systems, that you aren't engineering for stuff that will rarely happen, because you end up introducing a lot of unnecessary complexity that creates a whole new set of problems you are not aware of.

I've seen the Google site 404 for crying out loud. So if Google can fail from time to time, you should be designing the simplest solution that provides an agreed and realistic RTO (Recovery Time Objective) in the event of something as major as either a data centre loss, or at least the total loss of connectivity to that data centre.

A low DNS TTL is in my opinion a pretty valid trade-off in this situation.

Deleted member 138126 · 17 Jan 2017 at 18:47

Redgie said:
But does that not then have the same issues as with any DNS based solution (caching of DNS records, slow TTLs, client side oddities etc.)?

It is DNS-based, but the DNS is being actively managed by the load balancers, i.e. no human intervention.

Ploppy2k3 · 17 Jan 2017 at 19:02

im with rotor on this.

you need to understand the cost of the 9's vs downtime costs

its great to want 99.9999% uptime but does the cost warrant the expenditure when 99.99% would probably do.

and yup i've seen a Google 404 as well lol

ov_sjo · 17 Jan 2017 at 22:10

If you're not familiar with, look up the definitions of terms such as RTO and RPO. This is how you define your disaster recovery requirements and from that you design the solution which meets it.

Typically an organisation will create bands of DR capability, each with an associated RTO/RPO value, for example:

Not important: RPO = 24 hrs, RTO =7 days
Fairly important: RPO = 2 hours, RTO = 24 hours
Important: RPO = <5 minutes, RTO = 4 hours
Critical: RPO = 0, RTO = 0

Now there are various ways you would associated a 'band' with your system, but usually in larger organisations it comes down to cost, e.g. how much money per second/minute/hour/day you are going to lose if the system goes down. The type and amount of infrastructure you need as you reduce your RTO/RPO increases in cost and potentially complexity, so the initial capital expenditure and potential on-going revenue costs have to cover the perceived risk.

So back to your OP, "highest possible availability", well you can get RTO/RPO of zero if you throw enough money at it. At least in theory.

edit: as a real world example, and something to get you thinking..

The lowest band above "not important".. The RTO of 7 days pretty much still dictates you have to have H/A at all your tiers in the application stack as it's highly unlikely you'd be able to replace a single point of failure (a single server) within 7 days. Unless of course you keep a cold standby in stock or you have some kind of special agreement with a supplier to have an identical server on hand. Most organisations take forever to raise a purchase order, get approval for spending, vendor SLA on delivering new servers, getting someone to check/unpack, rack and stack, load the o/s and other bits and bobs.. 7 days is a challenge.

I've worked in orgs where the lowest category was RTO in excess of 30 days. In other words, if there's a big disaster, this junk is the last priority and all the more important systems will most likely take up to 30 days or more to get back on-line before the non important stuff is even thought about.

Caged · 18 Jan 2017 at 00:32

You're trying to make something really resilient and then sort of tying one arm behind your back by architecting something that is active/passive - i.e. you need to keep throwing resources at the primary if you want to scale.

Have a read of these to get an idea of architecting something that can cope with failure, rather than trying to prevent failures from occurring. It's relevant even if you don't want to use AWS:

https://d0.awsstatic.com/whitepapers/architecture/AWS_Well-Architected_Framework.pdf
https://d0.awsstatic.com/whitepapers/aws-web-hosting-best-practices.pdf

Also read https://www.amazon.co.uk/dp/149192912X/

Redgie · 18 Jan 2017 at 07:51

Interesting, the background for this project is healthcare, so there are certain scenarios where availability needs to be guaranteed. But I definitely take the point about ensuring the system isn't over-complicated in the pursuit of the elusive five 9's.

Perhaps then the DNS solution is appropriate. Now to find out exactly which implementation of a DNS solution is the most appropriate.

Thanks for all the information guys. And for the book recommendation Caged

ov_sjo · 19 Jan 2017 at 11:54

You're never going to get five 9's across two data centres. The internet is too unpredictable and it's out of your control.

Nikumba · 19 Jan 2017 at 12:17

Sorry if if it has been mentioned but missed, what about your WAN links? Are these redundant? While they might be from different ISP are they using BT and Virgin for the last mile fibre?

Does the fibre come in from different points and run to the demark in differnet routes etc?

High availability across physical sites without a single point of failure

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126