A lightning in Dublin knocked out Amazon and Microsoft data centres offline for few hours and it took sometime to get all the services restored.
Although it did affect Netflix, foursquare and few others, thankfully Druva cloud services were completely unaffected by this. Here is a small note on how we managed to keep our promised SLA.
I think its plain ignorance or mis-planing to assume 100% availability of underlying infrastructure. Just like any hardware, the AWS infrastructure is prone to failures, but the knowledge of these potential failure points can help improve availability.
Since its a backup service, we have have divided our cloud design into 3 parts based on the availability and durability guarantees :
And some design changes we incorporated to avoid downtimes :
We surely paid more for improved availability, but there are simple design changes which can help save as well. For example the 3-way replication increased our compute(EC2) cost by over 200%, but because of extra spare we could increased the data stored per instance, which was earlier restricted to maintain a good cache-vs-on-disk ratio.