Tech/Engineering, Innovation Series

Ensuring Performance at a Massive Scale — the Druva Difference

Arati Joshi, Sr. Principal Engineer

Druva is the industry's leading fully managed 100% SaaS data resilience platform to protect your data wherever it lives – backed by a $10M guarantee! Druva eliminates the need for costly hardware, software, and services through a simple and agile cloud-native architecture that delivers unmatched security, availability, and scale. 

As we expand Druva’s capabilities with the latest cutting-edge technologies and architectures, it is essential to ensure guaranteed performance at scale. This blog highlights some of our differentiated approaches to performance engineering.

Performance Engineering at Druva

Druva currently drives more than 4 billion backups a year across a wide spectrum of workloads. This number is growing exponentially. In order to deliver an improved customer experience, we continuously innovate by adopting new technologies. As new technologies are adopted, it is essential to validate and confirm that these new innovations are able to deliver the desired performance at Druva’s massive scale. This is where Druva’s performance engineering process comes into play.

We at Druva consider performance engineering a foundational component of innovation and a continuous process through the development lifecycle — not an afterthought. Druva’s data resiliency cloud is continuously validated for performance and resilience through a tiered multi-stage process that’s designed to optimize every API and workflow. This ensures performance improvements in every aspect of the cloud platform.

performance engineering 1

Let’s dive deeper into the last stage that qualifies Druva’s cloud at its massive production scale. At a high level, this stage is comprised of 3 sub-phases.

  1. Modeling production scale through learning and heuristics

  2. Simulating production load for rigorous validation

  3. Enhancements based on learnings from performance tests

Modeling Production Scale

The Druva Data Resiliency Cloud is powered by native algorithms enhanced through a rich and deep integration with AWS. You can read more about our innovative approach to cloud efficiencies in our Building S3-based database to save cloud operational cost blog.

Before running a performance test, it is important to model the production workload accurately.  Backups are not uniform over a day — there are peak hours based on customer usage patterns and there are spikes within even an hour due to backup schedules. SaaS solutions have to be adaptive to this kind of changing load. At peak hours, Druva's scalable architecture is capable of handling a load that is a multiple of the base load.

performance engineering 2

As a SaaS platform, we have continuous visibility and insights into backup trends, which helps us in accurate workload modeling.

Our homegrown workload model uses a wide array of parameters like:

  • Pattern of active jobs over time

  • Pattern of backed-up data over time

  • Workload distribution of backup data

  • API concurrency

The more refined the workload model, the better optimized it is for validation tests. 

Validating at Production Scale

Our performance tests are carefully architected to ensure that a product can serve the load without any impact on its performance. A well-defined and iterative set of goals drives our approach to performance validation.

Our first level goal is to measure and monitor key performance metrics, including: 

  • Throughput in terms of data processed per unit of time

  • API latencies

  • Resource usage footprint

  • Responsiveness to load changes

The next goal is to drive continuous optimization, such as:

  • Proactive mitigation of potential performance bottlenecks

  • Improved cloud efficiencies

  • Identify further improvements and optimizations

Our fundamental test methodology is to simulate the modeled workload. Backups are inherently data-intensive operations. To validate performance at scale, we have developed modern simulation approaches such as a Light Weight Load Generator powered by Golang to meet 3 main objectives.

  • To trigger a load of thousands of jobs

  • To generate API load on Druva cloud in the same pattern as production

  • To simulate load patterns across multiple customers and system operations

Best Practices for Validating Performance at Scale

Performance tests provide valuable information about the product’s behavior at a production scale. However, it is essential to conduct the tests and interpret the results in the right way. 

Here are some of the performance engineering best practices from our experience of running one of the largest cloud-native SaaS platforms.

  • While testing in a cloud environment, each service carries its own service quota and other limits. It is essential first to understand infrastructure capacity and estimate maximum theoretical performance based on this capacity.

  • It is not always possible to execute end-to-end flows at scale, considering infrastructure limits. It is essential to focus on and prioritize core logic. 

  • Average response times are essential, but not always sufficient. Percentile numbers provide better insight. The below plot shows that Test-2 average performance is degraded by 60%, but median values for response times are the same for Test-1 and Test-2. This indicates that response times of a few samples have increased in Test-2 (outliers) but not all.

performance engineering 3
  • Use visualizations to get a clear picture of the system's behavior. The below chart is an illustrative example to highlight the effect of request queuing. The difference between submitted API and active API indicates the queued requests. When the queue is full, the submitted request count is constant.
performance engineering 4

Conclusion

Not all SaaS platforms are created equal. To drive effective performance at a massive scale, the SaaS platform needs to start with a fully cloud-native architecture and layer with true cloud-native approaches to software development and validation. 

Unlike the waterfall approaches of monolithic architectures sub-optimally retrofitted to the cloud, true SaaS platforms need to consider Performance Engineering as a continuous process all through the product development cycle. Performance validation at scale is critical for ensuring that customers truly benefit from the innovations.