High Availability and Amazon AWS

A lightning in Dublin knocked out Amazon and Microsoft data centres offline for few hours and it took sometime to get all the services restored.

Although it did affect Netflix, foursquare and few others, thankfully Druva cloud services were completely unaffected by this. Here is a small note on how we managed to keep our promised SLA.

I think its plain ignorance or mis-planing to assume 100% availability of underlying infrastructure. Just like any hardware, the AWS infrastructure is prone to failures, but the knowledge of these potential failure points can help improve availability.

Since its a backup service, we have have divided our cloud design into 3 parts based on the availability and durability guarantees :

  • Config (most available): Configuration data stored in Amazon RDS
  • Meta-Data : Druva Dedupe file-system spanning across Cassandra nodes
  • Data (most durable): Stored in S3

And some design changes we incorporated to avoid downtimes :

  • Multi-Zone replication: Both RDS and Cassandra nodes are replicated across 3 availability zones. We use Cassandra in full-consistency mode and heavily rely on its self-healing, in case of service failures.
  • Reduced Dependency on EBS: EBS is a software abstraction of an underlying SAN storage. And two independent EC2 instances may share same SAN for EBS. Given this we shifted our focus from EBS to local-storage for meta-data.
  • Extra space copies in S3: We so maintain some extra redundancy on top of S3 for most referenced blocks. This essentially is to avoid the random (but less frequent) S3 time-outs and improve durability of most concurrent data.

We surely paid more for improved availability, but there are simple design changes which can help save as well. For example the 3-way replication increased our compute(EC2) cost by over 200%, but because of extra spare we could increased the data stored per instance, which was earlier restricted to maintain a good cache-vs-on-disk ratio.

High Performance Deduplication

Time and again multiple enterprise customers, especially those who are migrating from competing solutions, ask us about scalability of Druva inSync. Since the launch of v4.0, inSync has scaled exceptionally well, especially for large deployments. The software has succeeded where majority of competing solutions have failed or turned off deduplication.

About a week back, (on request of a large customer) we started testing one of the competing solutions. We tested the software for 1 million files of total size of 2TB, of which 48% was duplicate. Insync finished the backup in about 22 hours and the competing software is still backing up.

InSync doesn’t support any “integration” with deduplication, but the whole software was designed around the deduplication and CDP. There is NO flag to turn off dedupe and there never will be.

This article focuses on my thoughts on how Druva succeeds where majority of others fail.

Why Source Deduplication Fails to Scale for Majority Vendors ?
The biggest bottleneck for performance scalability of deduplication is the random disk IO performance. Almost all dedupe systems include a database to store the block-hash index which needs to be checked for every hash check. A server class magnetic disk usually offers a latency of 8-12ms which restricts the hash matches to about 100/sec, throttling the dedupe performance drastically.

Now, when the data set is small the entire index can reside in memory and hence the hash checks as much faster. As the index grows, the I/O congestion brings down the software’s capacity to perform inline deduplication.
Consider this: Just about 1000 users can create over 10 Billion blocks for backup. And checking them with a rate of 100/sec could take 3.21 years.

Learnings from Storage Guys
Data domain had an interesting approach. They optimized their inline dedupe performance for backup streams. Since the backup was mostly for servers with few large files and the data streams were mostly long streams of data in tar format, Data domain used a simple index read-ahead algorithm to load the relevant parts of the index before the stream blocks hashes reached the server. Since the streams changed less than 10% across two simultaneous backups, the algorithm helped deduplicate them at a very fast pace.

Solid State Disks
A simple solution to the random-I/O problem is using SSDs to store the index. Although we did tweaked/changed certain features to support SSDs but the solution wasn’t complete because of the size limitation imposed by them.

Two Step Approach for Druva: No-SQL + HyperCache
The “Data Domain approach” did not work for us as our data was much more random and coming from different sources. But on the flip side we had much more knowledge of the data formats we were backing up.
The first step towards scalability was to get rid of the inbuilt SQL database which imposed a lot of latency because of SQL query serialization and execution. We replaced PostgreSQL with Oracle no-SQL BDB as an embedded database, which improved the performance and much simpler to maintain.

The second major innovation was HyperCache – a selective in-memory cache of index. Hypercache constitutes of both a positive and a negative cache, which remembers and caches both the most probable and the least probable hashes for on-going backup. HyperCache uses an ever learning algorithm and uses different parameters like time, frequency and probability of a hash to cache it.

The Result
The result was 85% reduction in disk I/O by using 4GB of RAM for every 1TB of data stored. The reduction in IO translates to 4X better scalability, and the solution can easily scale to thousands of users with linear improvement in scalability/performance.

Use of SSDs further improves the performance by 6X. InSync core has been modified to keep only the most concurrent part of the database index on SSDs and optimize it for solid state drives.

Sales Meetings and My iPad

Yesterday I got a ping from a PC magazine editor asking about my tablet usage and how I see the adoption in the enterprise. This was actually great timing (my brother had given me a gift of an iPad2 almost the same week it was released).

I’ve been travelling pretty heavily over the last 2 months, and had an opportunity to meet and learn from a lot of customers. The best companion during this travel has been my iPad. Here’s my small list of productivity apps/accessories which I use the most:

  • inSync iPad app – lets me access all files and all the versions backed up from my laptop
  • Smart cover – makes typing emails simple
  • Easy Sign app – makes signing documents easy (I realized this was the only reason I used my fax/scanner)
  • Kindle app – Bought 2 books: The idea book & The upside of irrationality
  • iPad VGA adapter – allows me to present directly from my iPad
  • Evernote – for taking meeting notes

I’m currently a happy beta tester of the new version of the inSync app for iPad which allows offline access of folders marked as “favorites”, and this was an absolute life saver when I was travelling in Europe and didn’t want to pay heavy roaming charges :)

I see tablets as great access devices, which in my opinion would affect two markets the most : desktop and printer. Although I am still adjusting to carrying 1 extra device, but IMO the laptop may soon become more of an in-office/limited-mobility device.

For Laptop Backup – Be Your Own Customer

The phrase “eat your own dog food” or the more palatable “drink your own champagne” gets bandied about a lot in Silicon Valley. (I swear I’ve heard it almost as much as “win-win” and “pure play”). I’ve worked at a number of companies where we’ve said this, and done it, with varying degrees of success. No matter the level of success, it’s an important process, since it forces your company to be more customer focused.

At Druva, we walk a mile in our customers shoes, and every employee uses inSync for laptop backup. Prior to any release, we go through the upgrade or installation process ourselves. What this allows us to do is feel any pain first, and make any adjustments as necessary to avoid transferring that pain to the customers. By using the product on a daily basis, we can put it through it’s paces and watch it for performance, scalability, ease of use and so on. It’s been fun and we’ve learned a lot, not only for laptop backup, but also using the iPad and iPhone interface. It also allows anyone within the company to make suggestions about the products as they use them, which is key to our ideas on innovation as we saw in our inaugural Druva Hack Day. We’ve had lots of great ideas from all of our employees, and our product roadmap is pretty exciting.

I actually had to recover a presentation I was working on that I lost recently. I felt the cold panic, the “Oh No!” moment (ok, I didn’t say “No”, I said something else, that rhymed with “truck”, but keeping this blog at a PG-13 rating is important). I recovered the file easily using inSync. It just worked. Very often in Marketing, we will tout certain features and throw out statistics on product performance, but the one thing I realized in using inSync is that, when we say it’s the most simplified laptop backup solution out there, it’s true. And it is lightning fast. It’s a good feeling to be able to stand behind your product like this, and we also hope that customers know that we’re in the trenches, eating dog food or drinking champagne with them.