Innovation Series

Building high-performance cloud data protection

Sajeesh Nair, Head of Performance Engineering

There are several challenges to be solved in building a successful cloud-based data protection platform, with performance and scalability being especially important. The performance dynamics of a cloud backup solution are completely different from traditional on-premises products, and require a massive shift in how we architect backup solutions.

Druva is the first to build a cloud-native backup solution. In this blog, we cover the factors affecting the performance of cloud backups (for datacenter workloads), and how we at Druva solved them. We break these down in four key focus areas:

  1. Network
  2. RTO anxiety 
  3. Hyperscale cloud file system
  4. TCO balance

1. Network

Any architecture for cloud-based backup, requires taking into consideration the network between customer data centers and the cloud. Moving TBs of data across a network is a hard problem to solve. In addition, the quality of the network can vary greatly based on customers and their ISPs. This begs the question, how do we design a product that reliably meets RPOs and RTOs in the face of these impediments?

Bandwidth optimization:

Bandwidth is always a sparse commodity. And lesser bandwidth intuitively means lower performance. However, there are few ways to design around this:

  • Compression at the source — It is well known that the basic way to improve payload performance over the network is to compress it. On-premises backups often rely on compression as a means to reduce storage needs, and hence compress at the backup device. In the case of cloud backups, it becomes imperative to compress at the source to save bandwidth. There are several compression algorithms that can be used. It is important to understand the tradeoff between these algorithms. Usually, algorithms are compared on the basis of speed (data compressed per unit time), compression ratios yielded, and the CPU footprint that it requires. Higher compression ratios require more CPU for a given algorithm. Each algorithm has its strengths — some algorithms compress data at a high ratio, but do so at slower speeds resulting in lower throughput. 

To increase backup efficiency, we implemented methods to skip already compressed files like zip and images. Usually, attempting to compress these formats costs CPU cycles and does not yield any compression. 

Choosing the right compression algorithm for the use case is key for performance. Learn more about various compression algorithms on the github page here.

  • Deduplication — This process aims to remove duplicate data. By sending only unique data over the network, we further optimize the bandwidth usage. Unlike on-premises solutions, Druva does a dedupe check at the source. This is powered by our patented intelligent global deduplication. This helps with deduplication ratios by checking for dedupe across all data the customer has backed up with Druva to a given AWS region. Our backup sizes are typically only one-third of the customer’s source data.

Network latency: 

  • Maximize bandwidth usage — In data path-heavy applications like backups, network latencies limit the ability to use the available bandwidth efficiently. This happens because network packets have to wait longer to be acknowledged (at the TCP layer). This results in the application sending data being able to fill only part of the network pipe available. One way to solve this problem would be to use more TCP connections. However, using a large number of TCP connections is not a scalable approach from a cloud standpoint. Druva has a multi-tenant cloud which hosts tens of thousands of backups each day. This would result in several thousand, or even millions, of TCP connections, which would cause the servers to choke. Hence, it is important to improve bandwidth utilization by using an optimal number of connections. We solve this problem in two ways. First, Druva’s proprietary protocol (Druva RPC) multiplexes several payloads over a single TCPconnection. Second, we use a high number of workers to send a large number of packets without waiting for acknowledgments to compensate for the delays caused by latency. This enables our agents to use available bandwidth efficiently and overcome the impact of network latency. 
  • Request clubbing — The best way to avoid latency is to minimize the number of network calls. For this, we club the metadata and data calls at multiple layers. This is built into the Druva RPC protocol. Clubbing requires agents to bundle their calls into one request, enabling the server to unbundle that request into multiple service calls. Once again, tradeoffs need to be made to ensure that the benefits of minimizing calls are larger than the cost of bundling/unbundling. 

2. RTO anxiety

A common concern when people think about electric cars is range anxiety. This is the fear that a vehicle has insufficient range to reach its destination and would thus strand the vehicle’s occupants. However, for the majority of the commute or travel this is a non-issue. We observed a similar perception mismatch with customers considering cloud-based backup. The prevalent mental model of cloud and networks leads most people to believe that restores over the network will be slow. However, the majority of the customer RTO requirements are easily met with Druva’s solution. And customers don’t need to buy expensive backup appliances just to address RTO concerns. In addition, for customers who have narrow RTOs, Druva also offers CloudCache. CloudCache accelerates restores for low bandwidth-high latency use cases by maintaining a local copy of the data.

In order to address this perception of customers, we did a detailed exercise in collaboration with AWS. We were able to demonstrate that up to 500 miles from an AWS region, and even with a five percent network packet loss, there is marginal impact to restore times when compared to local restores. The latencies only start playing a role if a customer in the U.S. East coast tries to restore data from the U.S. West coast (over 2500 Miles). This is unlikely since customers back up their data to the closest AWS region to get the best possible performance. Here is the detailed writeup of the exercise done in collaboration with AWS.

3. Hyperscale cloud file system

In the sections above, we focused on how to efficiently send data from the customer’s premise to the cloud. On the other end of this is Druva’s own cloud-native file system. We have thousands of backups happening every day on the Druva Cloud. This means several Terabytes of source data is processed daily. Druva’s Cloud is built on a robust cloud-native file system which is powered by AWS DynamoDB and AWS S3 services

It is important to ensure the cloud architecture supports such high throughputs of backup data. Cloud storage provides self-healing properties, which results in several continuous consistency and integrity checks of the customer’s restore points. This ensures a healthy copy of data is always ready when there is a need to be restored. All these jobs mean several compute cores, hundreds of S3 puts per second, and heavy network throughput. Architecting a system with such characteristics requires rethinking data protection entirely.

4. TCO balance

One of the key expectations of customers from a cloud and SaaS-based solution is cost effectiveness. A lot of thought is put into designing a solution that keeps total cost of ownership (TCO) for customers as low as possible. However, cost and performance are usually on the opposite sides of the tradeoff. There are few ways to ensure optimal cost to performance tradeoff. 

  1. Optimize unit performance — The proxies we install on customer’s premises run on customer provided hardware/VMs. Hence, it is important to keep the footprint low while not compromising on RPO requirements (performance). Typically, this means optimizing the throughput per unit footprint. Thus, we tune our agents for a lower number of storage-reads, CPU cycles, memory usage, and network bandwidth (detailed above). Typically, this means focusing on metrics like GB per hour of throughput per core of CPU. 
  2. Cloud compute — Backup is a compute-heavy process on the cloud, primarily due to encryption and metadata operations. On AWS, higher compute equates to higher cost. Hence, sizing and efficient provisioning becomes key to achieving cost efficiency. This also requires deploying instance types, which provide a good cost to performance balance. 
  3. Storage costs — S3 storage costs are kept in check by using intelligent tiering of customer’s data with long-term retention. In this case, performance is deliberately traded off in favor of lower costs. This is typically done for older/archived data for which customers have more relaxed RTOs. S3 PUT costs are kept in check with optimizations like metadata clubbing or integrating data objects into one S3 object. This once again requires understanding the performance tradeoff. 

Conclusion

Building a successful cloud-based data protection platform requires solving the above challenges. Solutions to these problems can often be contradictory. For example, higher throughput may require more powerful infrastructure, but that may increase customer TCO. The key to customer satisfaction lies in finding an optimal solution that doesn’t sacrifice performance for meeting cost expectations. Any application that requires a data path that traverses large networks will likely have to solve these architectural challenges. 

With more than a decade of innovation, Druva offers a backup solution that helps customers achieve their RTO and RPO requirements in the cloud. Download the new eBook to learn how your organization can better adopt best practices to reduce RPO/RTO.