Monday 24 December 2018

#livingInTheFuture: Things I couldn't do x years ago . . .

I've made a few tweets with the hashtag of #livingInTheFuture

This was inspired by some books I had when I was growing up which showed visions of what the future would be like in 20 or 30 years time. They showed everyone wearing spacesuit-like jumpsuits with jetpacks but also some prophetic visions such as 'huge TVs that can hang on your wall like paintings' or 'personal communicators that you can carry with you'.

Even things from Star Trek have come to pass - like talking to the computer or the personal comms device (again). There are problems with this as our current technology in some areas has overtaken that in The Original Series (TOS) with Kirk and Spock - why do the need a comms panel on the wall of the ship for example? No mobiles?

So, as I was paying with my smart watch at a drive through sitting in my electric-hybrid car, I thought 'I wouldn't have been able to do this a few years ago' so:

It's now 2018 - what can I do now that I couldn't have done in 2017 - tech wise? 2016? 2017?

(This is based on either roughly when these things were 'available to adopt' for regular people in the UK - or alternatively when I got one. Some of these may be biased towards certain vendors!)

Any other favourites? Comment away . ..

2018 - Unlock my hotel room with my phone without needing a key - Hello Hilton Digital Key
2017 - Drive to and from the dog-walking downs in an electric hybrid car using no petrol - Hello Mitsubishi Outlander PHEV
2016 - Use VR (easily) at home: Hello Sony PSVR
2015 - Pay for purchases with my phone or watch - Hello Apple Pay
2014 - Control the lights and heating in my house remotely using my phone - hello Homekit
2013 - Log into my phone using my thumb - hello Touch ID
2012 - Watch TV and play games in 3D - Thanks LG and Sony Playstation 3
2011 - Talk to my smartphone rather than through it - Hello Siri
2010 - Chat to the entire world 140 characters at a time - Hello Twitter
2009 - Use my mobile phone as an airline boarding pass - Cheers, British Airways
2008 - Read a book on a portable electronic device with eInk - Hello Amazon Kindle
2007 - Watch TV that I missed by downloading it from the internet - Hello iPlayer
2006 - Play games wirelessly, standing up using the whole body - Hello Nintendo Wii
2005 - Have my car navigate for me - hello mainstream SatNav/GPS
2004 - Use reliable two-way video calling over the internet with no charges - hello Skype
2003 - Be unable to cross the atlantic on a commercial flight in less than 3 hours - goodbye Concorde.
2002 - Keep most of my music on a small portable player - goodbye CDs, hello IPod
2001- Look things up on a user-generated online encyclopaedia - hello Wikipedia!

HA is NOT DR - no, really it isn't!

"Let's sort out the NFRs of this system!"
"OK, First we do HA/DR"
"Errr . .which one do you actually want to do first - HA or DR?"

Garratt's 1st Law of availability: HA is not DR
(Garratt's 2nd Law of availability: DR is not HA)

HA=High Availability

'Availability' is a measure of when the system is available. If you try and use the system (by making a request to it) then it's available if it can take your request.

'taking' the request can mean either processing the request, or accepting the request and processing it later. More on this below.

Usually, HA means that if one (or more) components of the system stop working, or are lost/destroyed or are taken out of service for maintenance, the system carries on running and therefore the availability of the system (the proportion of time it is available vs the proportion of time when it isn't) is high.

DR=Disaster Recovery

This is how you recover from a disaster which affects your system. More details on what a 'disaster' might be is discussed below.

A disaster may be the loss of many components or a large 'component' due to physical factors e.g. a flood or fire. (The usual example is 'what if a plane lands on a data center'? - this is not useful as it hardly ever happens. Floods, fires or power outages are much more likely).

A disaster may also be data corruption (deliberate or accidental), someone deploying the wrong version of an update or other non-physical cause.

These are not the same thing - HA is not DR!

In one sentence "A disaster is something that happens to your system that HA cannot recover from".

Assignment: Compare and Contrast HA and DR

To try and bring out some more examples, below are a number of differences between the two...

HA is usually active/active. DR may need to be active/passive

Everyone wants active/active. Why would you not want instant recovery? Why wait to have the recovery site 'recover from cold'.

Let's consider the following active/active situations that HA cannot recover from autonomously.

  • Replicated corruption
We have two copies of our data, one on each site. This protects us against loss of a site. If we make a change to data on site 1, the change is copied instantly to site 2.

If we corrupt the data on site 1, the corruption is copied instantly to site 2. Now both sites are corrupt - how do we recover?

Someone makes a bad software update on site 1. The change is copied to site 2. Now both sites won't start up. What now?
  • Split-Brain (Dissociation)
We have two copies of our data, one on each site. This protects us against loss of a site. If we make a change to data on site 1, the change is copied instantly to site 2. If we make a change on site 2, this is copied instantly to site 1.

Now let's say we lose the link between sites 1 and 2. We make a change to customer #123 on site 1. We now have two different copies of the data. Which is right? We then make another change to the same customer on site 2. Which is right now?

When we restore the link - which side is 'right'? We effectively have data corruption. How do we recover?

At this point, we have corrupted data or a corrupted configuration on both sites. We have nowhere to go.

If we had an offline copy or 'last known good' then we can shut down the 'live' system and move to the 'last known good' one. This may take some time to start up the 'passive' copy, but it's a lot easier than trying to fix corrupted data!

HA is usually automatic/autonomous, DR should have human intervention

The usual way of implementing HA is to have some redundant duplicates of components, for example application servers. If you need 6, have 7 or 8. Balance the load across all of them. If you lose one (or two) then the rest pick up the load. The load balancer will detect that one is 'down' and will not send requests to it.

Monitoring software can detect if a component is 'down' (e.g. if it has crashed) and can attempt to restart it. In this case, the load balancer will detect that it is 'back up' and route requests to it again.

All of this happens automatically. Even at 3am. Most of the time, the users will not even be aware.

DR is a lot more 'visible'. For things to be bad enough that a DR situation occurs, people have usually been impacted (although not always).

Often the system as a whole has been 'down' for a period, or is operating in an impacted state e.g. running slowly, not offering all functions, 'Lose one more component and we're offline' etc.

At this point someone needs to say 'Invoke the DR plan'. This can be obvious 'The police just called - Data Center 1 is flooded' or a judgement call 'If we can't fix the database in the next 30 minutes, we will invoke DR'.

The decision to invoke the DR plan is usually taken by a human. Many of the actions needed to invoke the DR plan rely on human actions as well.

HA usually has no user impact. DR may be visible to the users.

When a component fails in an HA system, requests are routed to other components and the system carries on. When components are maintained/patched/upgraded, they are done one-at-a-time so that the rest can carry on processing requests. Users are unaware of this.

In a disaster situation, it's usually visible. Requests cannot be processed (or not all requests can be processed). Responses may be incorrect. The system may be behaving unpredictably. One reason for invoking DR is that the system is not seen to be behaving correctly and needs to be 'shut down before it can cause any more damage'

HA is normally near-instantaneous. DR takes time (RTO)

HA is achieved usually be re-routing requests away from failed components to active ones. These components are 'standby hot' or 'active/active redundant'. There is no delay.

In a DR situation, requests may not be processed for some time - usually a small number of hours.

DR usually has a 'Recovery Time Objective' or 'RTO' which is how long it takes from when the system is down to when it has recovered.

HA does not usually result in data loss. DR might (RPO)

HA often switches requests between multiple redundant components. These components may have multiple copies of the application data. If one fails, others have copies. There is no data loss.

In a DR situation, there may be data loss. If data is copied asynchronously between sites, there may be a small amount of lost data (e.g. subsecond) if the primary site is lost. 

Where DR is invoked due to data corruption, the system may be rolled back to a 'last known good' data point which may be minutes or even hours ago.

In either of these cases, the system has a 'Recovery Point Objective' or 'RPO' which is the state or point to which the system is recovered.

This might be a time (usually equivalent to the state in which the system was last backed up) - for example: "System will be restored to the last backup. Backups are taken on an hourly basis". 

It might also be expressed in terms of data e.g. 'Last committed transaction' where transactions are synchronously replicated across sites.

Due to the seriousness of rolling back to the last backup and the resulting data loss, many organisations plan to fix corruption by using a 'fix forward' approach where the corrupted data is left in the system but is gradually fixed by correcting it. The corrections are kept in the system and are audited.

HA situations can happen regularly - DR never should.

If systems are built at large scale, individual components will fail. There is a 'mean time between failures' for most hardware components which predicts the average time a component will last before failing. Things like disc drives just wear out. We plan for these with redundant copies of components and we replace them when we fail. HA approaches mean that our users don't see these failures.

DR is something we hope will never happen. It's like having insurance. No-one really plans for a data centre to burn down once every 10 years, or for their systems to be hacked or infected with a virus. Like insurance though, we have DR provisioning because we never know ...

With HA, the system appears unaffected. In a DR situation, things might not be 100% 'Business as Usual'

HA usually recovers from a failure of a component. DR often recovers from the failure of a system.

If you have 10 components and you add 2 for HA, that doesn't cost so much. Building a whole other data center and a copy of all your components is expensive. So you may want to look at other approaches when in 'DR' mode.

Remember: DR should never happen. And if there is a good reason (fire/flood), it's perfectly acceptable to tell your customers 'Look, we've had a disaster, we are in recovery'.

If your house burnt down, you'd tell people you were living in a hotel and that you couldn't have them over for dinner, wouldn't you?

  • Reduced Availability
Simply put, this means that your system in DR state cannot process requests as quickly, or cannot process as many requests. It may be that your live system has 10 servers but your DR system only has 5 (remember - you don't expect it to ever be invoked).

  • Alternate Availability
This is where not all services are available as usual and you have made alternative arrangements.

For example 'We can't offer account opening on-line at the moment. Please contact your local bank branch'.

From an IT point of view, changes can be made. Following 9/11, the BBC put out a 'reduced graphics' version of their site with the emphasis on text and information and not video and graphics. This was due to an overload of their servers due to the number of people wanting information.
  • Deferred Availability

The 'Thanks for your request - we'll process it in a while and come back to you' approach.

This is where queueing mechanisms or similar are used. The system cannot handle all requests in real-time, but may be able to process some overnight when demand is low. It may be that you can't print the event tickets out immediately but you can send copies by email then next day for example.