Innovation Series, Tech/Engineering

How Druva reduced its cloud storage OPEX with an S3 database

Building an S3 based database to save cloud operational cost

The Druva Data Resiliency Cloud provides cyber, data, and operational resilience without any hardware, software, or associated complexity. Druva Data Resiliency Cloud uses AWS’s IaaS platform for its compute and storage capabilities. Just like any SaaS, resource utilization drives the cost of operation for Druva. It is important to optimize the utilization to save on billing expenses.

Druva Data Resiliency Cloud protects more than 200 Petabytes of data. This data is stored in the form of objects on AWS S3 — an object store optimized for throughput. There is also metadata associated with these objects that is stored in DynamoDB — a low latency key value database. Usage of S3 and DynamoDB contributes substantially to the storage costs. The following table represents pricing for the two storage offerings.

Reference: S3 pricing | DynamoDB pricing | AWS performance

As the table shows, writing to and reading from DynamoDB is cheaper by 27 and 11 times respectively than S3. However, storing data on DynamoDB is 12 times more costly.

For,
X = Number of writes
Y = Number of reads
Z = Number of GBs stored,

Cost of — DynamoDB = ($0.180*X) + ($0.036*Y) + ($0.25*Z)
S3 = ($5*X) + ($0.4*Y) + ($0.021*Z)

Sometime last year, Druva’s billing analysis showed that approximately 75% of Druva’s cloud usage cost comes from data storage and access. We were looking to optimize costs and started looking at storage patterns. A straightforward approach to save cost could be to move everything from DynamoDB to S3, since S3 is cheaper. But, Druva performs data deduplication which results in random access of data. Such data could not be moved to a relatively high latency storage like S3. However, further characterization of the data stored in DynamoDB pointed to about half of the data that was accessed sequentially — like metadata information related to objects, files, snapshots, etc.

Movement of this metadata to S3 without compromising on performance would result in significant storage cost reduction. In addition, with certain IO optimization techniques there could be further savings in access costs.

Figure 1: Cost comparison

 

Factors to consider

S3 IO optimization: The PUT/GET APIs cost the same, irrespective of object size, i.e. writing a single byte costs the same as writing a large object of a Gigabyte. Since IO to S3 is costlier, one way to save cost is to batch multiple data records together in a single object.

Figure 2: Batching

 

If we can batch 27 data units in a single S3 object, the IO cost will be equal to that of DynamoDB — the more the merrier. 

Cost of S3 = ($5*X/N) + ($0.4*Y/N) + ($0.021*Z/N)
Where N = number of records batched in a single S3 object.

S3 — The database: S3 can serve higher throughput, thereby masking the effect of higher latency, especially for sequential access. The batching of multiple records in a single object further helps in throughput. But how can an object store be used as a database?

This is exactly the problem that Druva needed to solve; S3 had to become a database. 

Potential savings: As mentioned earlier, Druva Data Resiliency Cloud performs frequent random data lookups to find duplicates, and needs a low latency storage for good performance. Some of the data operations need atomic access as well as the ability to perform conditional updates. These scenarios are supported by DynamoDB, and therefore, it’s best to keep such data on DynamoDB.

The infrequently and sequentially accessed data, like some metadata, is a good candidate to be moved to S3. Our access patterns suggested that close to 50% of the total contents from DynamoDB could be moved to S3. This would reduce the cost of DynamoDB by half.

DynamoDB = ($0.180*X/2) + ($0.036*Y/2) + ($0.25*Z/2)

At the same time, it will result into the cost of S3 as (assuming batching of 50 records):

Cost of S3 = ($5*(X/2)/50) + ($0.4*(Y/2)/50) + ($0.021*Z/2)

Since storage drove the overall cost, we were convinced about the savings. Further, our study showed 50 data units could be fit into a single S3 object, resulting in cost-effective IO access.

Implementation

Druva Data Resiliency Cloud protects petabytes of data from millions of devices, and the corresponding metadata is huge. To serve such large amounts of data, any database should be able to:

  • Store and retrieve values efficiently
  • Scale capacity on demand
  • Scale performance on demand
  • Be cost effective

The most important component of a database is its index — which is used to search data. This is maintained in sorted order for quick retrieval of information. If this index becomes huge, it starts to cause slower operations. Once written to S3, frequent updating and re-writing of the index can get expensive, as IOs on S3 are costlier. Therefore, it is better to create a new index for bulk data before merging with the existing index.

Figure 3: Conversion from Tree to object and vice versa

 

A data structure is needed to maintain these indices. B-Trees are good for search operations, reading, and writing data for large systems. This made them a good choice for our use case. The following figure shows a B-Tree for individual tables getting stored as objects in S3. The leaf nodes of the tree contain actual data and the non-leaf nodes act as an index. Figure 3 shows the B-Tree getting serialized and then saved into S3. Figure 4 shows the creation of new tables and the merge with old tables.

Figure 4: Table merge

 

Based on the number of metadata attributes to be stored, there could be multiple such trees. For example, data protection applications can have individual trees for:

  • Directory structure-related information
  • Filesystem snapshot related information

Figure 5: Multiple trees for different attributes

 

Figure 5 shows a directory with multiple files being copied to the cloud filesystem. Desired metadata is converted into separate tables, which are stored as B-Trees.

It should be noted that all this database management requires additional memory, storage, and computational power.

Conclusion 

The following table is an illustration of cost when data and corresponding IO is moved from DynamoDB to S3. This represents a pattern similar to one of the workloads protected by Druva. An average batching of 50 records in a single S3 object is assumed.

In the above example, with IO and storage moved to S3, the total savings is close to $175,000 (73%) per month. For simplicity, the cost of additional resources consumed by database management activity is not considered, since it’s only a fraction in comparison. 

Traditionally, the emphasis is on choosing the right type of storage for cost optimization. With this approach, we suggest introducing a layer of optimization that can change the existing access/storage patterns itself — to reduce cloud operational costs. 

Next steps

Looking to learn more about the technical innovations and best practices powering cloud backup and data management? Visit the Innovation Series section of Druva’s blog archive.

Looking for a career where you can shape the future of cloud data protection? Druva is the right place for you! Collaborate with talented, motivated, passionate individuals in a friendly, fast-paced environment; visit the careers page to learn more.

About the author

An experienced campaigner with performance expertise across storage, data protection, and hpc domain. At Druva, Kiran’s team looks at the performance of Druva’s Data Resiliency Cloud, providing inputs to create the most optimized solutions to challenging engineering problems.

LinkedIn: https://www.linkedin.com/in/kiran-nalawade/
Email: kiran.nalawade@druva.com