A lightning in Dublin knocked out Amazon and Microsoft data centres offline for few hours and it took sometime to get all the services restored.
Although it did affect Netflix, foursquare and few others, thankfully Druva cloud services were completely unaffected by this. Here is a small note on how we managed to keep our promised SLA.
I think its plain ignorance or mis-planing to assume 100% availability of underlying infrastructure. Just like any hardware, the AWS infrastructure is prone to failures, but the knowledge of these potential failure points can help improve availability.
Since its a backup service, we have have divided our cloud design into 3 parts based on the availability and durability guarantees :
- Config (most available): Configuration data stored in Amazon RDS
- Meta-Data : Druva Dedupe file-system spanning across Cassandra nodes
- Data (most durable): Stored in S3
And some design changes we incorporated to avoid downtimes :
- Multi-Zone replication: Both RDS and Cassandra nodes are replicated across 3 availability zones. We use Cassandra in full-consistency mode and heavily rely on its self-healing, in case of service failures.
- Reduced Dependency on EBS: EBS is a software abstraction of an underlying SAN storage. And two independent EC2 instances may share same SAN for EBS. Given this we shifted our focus from EBS to local-storage for meta-data.
- Extra space copies in S3: We so maintain some extra redundancy on top of S3 for most referenced blocks. This essentially is to avoid the random (but less frequent) S3 time-outs and improve durability of most concurrent data.
We surely paid more for improved availability, but there are simple design changes which can help save as well. For example the 3-way replication increased our compute(EC2) cost by over 200%, but because of extra spare we could increased the data stored per instance, which was earlier restricted to maintain a good cache-vs-on-disk ratio.