Tech/Engineering

Enabling fast queries at scale in the Druva data lake

March 24, 2020 Vinay Punera, Staff Software Engineer, Druva Labs

At Druva we perform more than 4 million backups every single day. As data continues to be the driving force for any organization, we are committed to developing innovative capabilities that are designed to benefit our customers and their data.

Due to the billions of backup events generated by the sheer volume and scale of unstructured data that we backup, it’s also important to have a central storage pool for each customer that can enable analytics and insights. Furthermore, it’s critical to deploy a primary query engine that is compatible with the repository to enable faster queries at scale.

Introducing Druva data lake

Druva data lake is designed to be the centerpiece of the Druva data architecture. By instrumenting the data lake, it is now possible to have a central place for data that can be used by Druva products for analytics requirements. This data lake can also be utilized by data science and machine learning teams for faster experiments.

We use Amazon Web Services (AWS) S3 as a storage layer for the data lake which enables cost optimization for data compared to Hadoop HDFS-based solution. In addition, we also use Apache Parquet file format to store data, which is a columnar storage format for efficient data storage. Our data is partitioned based on hour-value to support range queries.

The data lake holds backup events and is consumed by various internal applications. One of the applications that we will discuss further is our ‘backup metadata search’ application.

Enabling data lake access to internal applications

To access the data lake, we provide an abstraction to consumer applications, which is a query engine between consumers and the data lake.

diagram-data-lake

The scale at which the metadata search service queries the data lake is about 400 thousand queries per hour. This stems from one of many internal consumer applications. The volume of data will also increase — growing over time to petabytes. We also expect an increase in consumers of the data lake. To offer data query at such scale, the query engine must be capable of handling that amount of data.

To build such a capability and have it perform at such scale, we had some critical design criteria to consider:

  • Reduced storage costs,
  • Sub-second query response time; and
  • On-demand scalability

The data lake also contains Hive tables that are the first layer of abstraction to query data, which can be queried using various tools like:

  • Hive on Tez / Hive on Spark
  • Athena
  • Presto 0.215

Presto serves as the query engine for Druva data lake

Presto is an open source, distributed SQL query engine for running interactive, analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto allows for querying data where it lives, which in our case, is in Hive tables created on the data lake.

We chose Presto as our query engine for the data lake because:

  • Presto provided the highest query per second (~100 queries per second) among the available options.
  • Presto with Parquet allowed for pushing predicates to file level, which allows us to save on S3 reads.
  • Hive on Tez / Hive on Spark are a good fit for long-running queries asking for aggregation on large amounts of data.
  • During the evaluation, the Athena query engine was based on Presto 0.172, lacking some of the optimizations done by the Presto community over time. For instance, there were several Hive connector changes to Presto 0.215.

Understanding the queries beforehand

We understood the type of queries coming from our metadata search consumer and what we found is that we have a requirement for a large number of short-lived queries. What this means is that the query will ask for a particular backup event for a device within a particular time range. These queries do not ask for any aggregation but rather raw data from the data lake of a particular time interval.

Storage format in data lake

For storage formatting in the Druva data lake, we use Apache Parquet which allows Presto to apply predicates at the file level, fetching only what is needed. Parquet’s columnar storage format takes up less space in comparison to a row-based file format. Additionally, we perform gzip compression on Parquet files which further reduces the file size. The result is that the AWS S3 storage and read costs are well optimized.

Presto architecture

Originally, Presto is designed to support a coordinator-worker architecture which enables it to distribute load among multiple worker nodes which results in faster query response times.

diagram-presto

We initially opted for the Presto setup in AWS EMR, but later realized that after a certain load, the coordinator node slowed down and could not accept any more queries. For instance, in an EMR configuration where the master is m5.2xlarge type and 10 worker nodes of m5.xlarge type, we observed throttling in the master node. We decided to deploy Presto in a way so that it scales as queries increase, thus removing the coordinator node bottleneck entirely.

Below is a diagram that illustrates the Presto deployment architecture:

presto-deployment-architecture

  • AWS ECS is used to deploy custom Presto standalone containers.
  • AWS Application Load Balancer ensures scaling is done at the right threshold.
    • For instance, minimum nodes are set to 3, and if the ECS cluster usage goes above 60%, then a node is added to balance it. For load-balancing, AWS CloudWatch metrics are used.
  • Presto Hive connector leverages a separately deployed metastore to retrieve information about the data stored in the data lake.

Performance statistics

SNo Instance type Number of Nodes Number of concurrent queries Total number of queries Queries per second
1 m5.xlarge 10 500 25K ~24
2 m5.xlarge 20 500 25K ~50
3 m5.xlarge 40 500 25K ~110

Performance graph (Production)

Max nodes are set to 50, but the system hardly uses half of that.

performance-graph

Deploying Presto as the primary query engine for Druva data lake is enabling faster queries, extending capabilities throughout Druva products, and providing benefits for Druva customers.

While designing the system to perform at scale, we acquired some learnings around the concepts of data lake, storage formats, Presto and the cost aspects associated with it.

Key takeaways :

  • To perform faster reads, decide what you need to query and sort your data by the fields needed for filtering the data.
  • If you are not doing aggregations at the data lake level which requires processing large amounts of data, then it doesn’t need a coordinator-worker deployment.
  • Choosing the right storage layer is important. In our case, we had a flexible SLA for querying Presto which allowed us to use S3 instead of HDFS, which in turn saved costs and scale with data.
  • Choose a file format to store data. In our case, we use Parquet file format to store data which allows for faster reads and cost savings at the same time.
  • Presto is great for ad hoc queries, we never intended to re-invent that, but instead, we tweaked it for scalability and cost-effectiveness.

Druva continues to move the needle within cloud data protection and backup. Learn how Druva’s federated search capabilities are enhancing the efforts of forensic investigation teams.