In light of the recent failures I though I would share my findings based on some AWS investigations I have carried out. There seems to have been a mind shift on how we provision and deploy infrastructures in the new Cloud world which has lead to some poor decisions being made.
With the increasing adoption of AWS to build and maintain applications and databases it is becoming commonplace to view a virtual server as a commodity that can be created and thrown away as required. Before AWS when servers where purchased as physical machines and hosted in dedicated datacentres or co-location suites every part of the hardware was investigated before going live to ensure there was no single point of failure
- Are the hard disk’s mirrored?
- Is there a separate power supply and is each power supply connected to a separate power sources
- How may NIC cards are there and are they connected to a different switch?
In the new world of AWS all that is needed is the press of a few buttons and a server appears with some storage attached to it. Very little thought is put into what happens when things fail, the common assumption is Amazon will take care of everything and if something fails we can always start a new instance.
What happens if you can’t start a new instance? In recent AWS failures there has been a rush to allocate new instances and Amazon has had to throttle requests to enable it to cope with the load.
22nd Oct 1:02 PM PDT We continue to work to resolve the issue affecting EBS volumes in a single availability zone in the US-EAST-1 region. The AWS Management Console for EC2 indicates which availability zone is impaired. EC2 instances and EBS volumes outside of this availability zone are operating normally. Customers can launch replacement instances in the unaffected availability zones but may experience elevated launch latencies or receive ResourceLimitExceeded errors on their API calls, which are being issued to manage load on the system during recovery. Customers receiving this error can retry failed requests.
(from the AWS status page)
I’m using Multi-AZ RDS that will keep my data available.
Multi-AZ RDS instances are within the same Amazon region. A failure of a region could affect both the primary and backup instances of your data. If the backup instance is available a failover to it can take a least 3 minutes possibly a lot long depending on the time taken to perform a crash recovery on the data. If this failover is part of a large outage this crash recovery could take longer than you expect because of high loads with other instances failing over.
I’m snapshotting the EBS volumes so I will be able to recreate the databases from that?
EBS snapshots are stored in EBS, in both of the recent big outages EBS failures have been one of the main root causes. Your backup may not be there when you are need it and if it is you may not be able to create a new volume from it because of EBS load or problems.
I’ve got replication Slaves in other regions I can fail over to those
Manually failing over to slaves is a complex process. Depending on your replication topology you could have slaves with different numbers of transactions application to them. Before you can allow your application to start again you need to complete the following
- Ensure all the remaining slaves are in a consistent point. Find the slave with the highest number of transactions applied to it and manually apply missing transactions to the remaining slaves until the are all consistent
- Manually reconfigure the replication topology and ensure transactions are flowing down to all the slaves
- Manually reconfigure your application to point to the new master.
While doing this your application is down. The length of time to do this will depend on the amount of data involved and the complexity of your replication topology.
What solutions exist to help
Obviously I’m biased toward Continuent Tungsten as they pay my wages but the reason I work for them is because I believe it is one of the best solutions available and it works. There are others out there MHA, Galera etc which I have used but I believe simple managed Async-replication is at the moment the best solution.