Hang on, I have a deck for that somewhere . ..: December 2018

"Let's sort out the NFRs of this system!"
"OK, First we do HA/DR"
"Errr . .which one do you actually want to do first - HA or DR?"

Garratt's 1st Law of availability: HA is not DR
(Garratt's 2nd Law of availability: DR is not HA)

HA=High Availability

'Availability' is a measure of when the system is available. If you try and use the system (by making a request to it) then it's available if it can take your request.

'taking' the request can mean either processing the request, or accepting the request and processing it later. More on this below.

Usually, HA means that if one (or more) components of the system stop working, or are lost/destroyed or are taken out of service for maintenance, the system carries on running and therefore the availability of the system (the proportion of time it is available vs the proportion of time when it isn't) is high.

DR=Disaster Recovery

This is how you recover from a disaster which affects your system. More details on what a 'disaster' might be is discussed below.

A disaster may be the loss of many components or a large 'component' due to physical factors e.g. a flood or fire. (The usual example is 'what if a plane lands on a data center'? - this is not useful as it hardly ever happens. Floods, fires or power outages are much more likely).

A disaster may also be data corruption (deliberate or accidental), someone deploying the wrong version of an update or other non-physical cause.

These are not the same thing - HA is not DR!

In one sentence "A disaster is something that happens to your system that HA cannot recover from".

Assignment: Compare and Contrast HA and DR

To try and bring out some more examples, below are a number of differences between the two...

HA is usually active/active. DR may need to be active/passive

Everyone wants active/active. Why would you not want instant recovery? Why wait to have the recovery site 'recover from cold'.

Let's consider the following active/active situations that HA cannot recover from autonomously.

Replicated corruption

We have two copies of our data, one on each site. This protects us against loss of a site. If we make a change to data on site 1, the change is copied instantly to site 2.

If we corrupt the data on site 1, the corruption is copied instantly to site 2. Now both sites are corrupt - how do we recover?

Someone makes a bad software update on site 1. The change is copied to site 2. Now both sites won't start up. What now?

Split-Brain (Dissociation)

We have two copies of our data, one on each site. This protects us against loss of a site. If we make a change to data on site 1, the change is copied instantly to site 2. If we make a change on site 2, this is copied instantly to site 1.

Now let's say we lose the link between sites 1 and 2. We make a change to customer #123 on site 1. We now have two different copies of the data. Which is right? We then make another change to the same customer on site 2. Which is right now?

When we restore the link - which side is 'right'? We effectively have data corruption. How do we recover?

At this point, we have corrupted data or a corrupted configuration on both sites. We have nowhere to go.

If we had an offline copy or 'last known good' then we can shut down the 'live' system and move to the 'last known good' one. This may take some time to start up the 'passive' copy, but it's a lot easier than trying to fix corrupted data!

HA is usually automatic/autonomous, DR should have human intervention

The usual way of implementing HA is to have some redundant duplicates of components, for example application servers. If you need 6, have 7 or 8. Balance the load across all of them. If you lose one (or two) then the rest pick up the load. The load balancer will detect that one is 'down' and will not send requests to it.

Monitoring software can detect if a component is 'down' (e.g. if it has crashed) and can attempt to restart it. In this case, the load balancer will detect that it is 'back up' and route requests to it again.

All of this happens automatically. Even at 3am. Most of the time, the users will not even be aware.

DR is a lot more 'visible'. For things to be bad enough that a DR situation occurs, people have usually been impacted (although not always).

Often the system as a whole has been 'down' for a period, or is operating in an impacted state e.g. running slowly, not offering all functions, 'Lose one more component and we're offline' etc.

At this point someone needs to say 'Invoke the DR plan'. This can be obvious 'The police just called - Data Center 1 is flooded' or a judgement call 'If we can't fix the database in the next 30 minutes, we will invoke DR'.

The decision to invoke the DR plan is usually taken by a human. Many of the actions needed to invoke the DR plan rely on human actions as well.

HA usually has no user impact. DR may be visible to the users.

When a component fails in an HA system, requests are routed to other components and the system carries on. When components are maintained/patched/upgraded, they are done one-at-a-time so that the rest can carry on processing requests. Users are unaware of this.

In a disaster situation, it's usually visible. Requests cannot be processed (or not all requests can be processed). Responses may be incorrect. The system may be behaving unpredictably. One reason for invoking DR is that the system is not seen to be behaving correctly and needs to be 'shut down before it can cause any more damage'

HA is normally near-instantaneous. DR takes time (RTO)

HA is achieved usually be re-routing requests away from failed components to active ones. These components are 'standby hot' or 'active/active redundant'. There is no delay.

In a DR situation, requests may not be processed for some time - usually a small number of hours.

DR usually has a 'Recovery Time Objective' or 'RTO' which is how long it takes from when the system is down to when it has recovered.

HA does not usually result in data loss. DR might (RPO)

HA often switches requests between multiple redundant components. These components may have multiple copies of the application data. If one fails, others have copies. There is no data loss.

In a DR situation, there may be data loss. If data is copied asynchronously between sites, there may be a small amount of lost data (e.g. subsecond) if the primary site is lost.

Where DR is invoked due to data corruption, the system may be rolled back to a 'last known good' data point which may be minutes or even hours ago.

In either of these cases, the system has a 'Recovery Point Objective' or 'RPO' which is the state or point to which the system is recovered.

This might be a time (usually equivalent to the state in which the system was last backed up) - for example: "System will be restored to the last backup. Backups are taken on an hourly basis".

It might also be expressed in terms of data e.g. 'Last committed transaction' where transactions are synchronously replicated across sites.

Due to the seriousness of rolling back to the last backup and the resulting data loss, many organisations plan to fix corruption by using a 'fix forward' approach where the corrupted data is left in the system but is gradually fixed by correcting it. The corrections are kept in the system and are audited.

HA situations can happen regularly - DR never should.

If systems are built at large scale, individual components will fail. There is a 'mean time between failures' for most hardware components which predicts the average time a component will last before failing. Things like disc drives just wear out. We plan for these with redundant copies of components and we replace them when we fail. HA approaches mean that our users don't see these failures.

DR is something we hope will never happen. It's like having insurance. No-one really plans for a data centre to burn down once every 10 years, or for their systems to be hacked or infected with a virus. Like insurance though, we have DR provisioning because we never know ...

With HA, the system appears unaffected. In a DR situation, things might not be 100% 'Business as Usual'

HA usually recovers from a failure of a component. DR often recovers from the failure of a system.

If you have 10 components and you add 2 for HA, that doesn't cost so much. Building a whole other data center and a copy of all your components is expensive. So you may want to look at other approaches when in 'DR' mode.

Remember: DR should never happen. And if there is a good reason (fire/flood), it's perfectly acceptable to tell your customers 'Look, we've had a disaster, we are in recovery'.

If your house burnt down, you'd tell people you were living in a hotel and that you couldn't have them over for dinner, wouldn't you?

Reduced Availability

Simply put, this means that your system in DR state cannot process requests as quickly, or cannot process as many requests. It may be that your live system has 10 servers but your DR system only has 5 (remember - you don't expect it to ever be invoked).

Alternate Availability

This is where not all services are available as usual and you have made alternative arrangements.

For example 'We can't offer account opening on-line at the moment. Please contact your local bank branch'.

From an IT point of view, changes can be made. Following 9/11, the BBC put out a 'reduced graphics' version of their site with the emphasis on text and information and not video and graphics. This was due to an overload of their servers due to the number of people wanting information.

Deferred Availability

The 'Thanks for your request - we'll process it in a while and come back to you' approach.

This is where queueing mechanisms or similar are used. The system cannot handle all requests in real-time, but may be able to process some overnight when demand is low. It may be that you can't print the event tickets out immediately but you can send copies by email then next day for example.

Hang on, I have a deck for that somewhere . ..

Monday, 24 December 2018

#livingInTheFuture: Things I couldn't do x years ago . . .

HA is NOT DR - no, really it isn't!