Innovation Series

Building scalable data pipelines to harness the power of a data lake

Vinay Punera, Staff Software Engineer, Druva Labs

At Druva, we perform more than seven million backups every single day. As data continues to be the driving force for any organization, we are committed to developing innovative capabilities that are designed to benefit our customers and their data.

Due to the billions of backup events generated by the sheer volume and scale of unstructured data that we back up, it’s also important to have a central storage pool for each customer that can enable analytics and insights. We refer to this as a “data lake.” 

The Druva data lake

Druva’s data lake is designed to be the centerpiece of our data architecture. By instrumenting the data lake, it is now possible to have a central place for data that can be used by Druva products to meet analytics requirements. This data lake can also be utilized by data science and machine learning teams for faster experiments.

Data in the Druva data lake comes from various sources like files and streams. Different consumers are dependent on the Druva data lake for various use cases such as data insights, metadata search, machine learning, and more.

The diagram below illustrates the high-level flow of how different consumers utilize the data lake. 

Different uses of the data lake

Enabling access to big data

The Druva data lake holds data from various sources and not every consumer wants, or needs, to access the entire pool. For instance, a metadata analytics team would want access to data coming from the backup file event metadata stream, and only some specific attributes of this data might be required. To tackle this problem, big data pipelines are built which enable the extraction of only the relevant data required by each consumer. 

To understand this, we’ll need to explore the concepts of data pipelines, ETL, and why it is important to build a scalable data pipeline to enable access to big data.

What is a data pipeline?

A data pipeline involves a series of steps for moving data from a source system to a destination system. Typically, a data pipeline consists of three key elements — a source, series of data processing steps, and destination. A data pipeline may or may not include data processing as a part of its data movement. The following diagram illustrates a high-level data pipeline example.

Data pipeline

Elements of a data pipeline

A data pipeline starts with reading data from a source system, which can be in the form of files, a message queue, databases, etc. The extracted data may go through one or more processing steps which can include filtering data based on a certain timeframe, converting data types of some data attributes to match the destination system's format, aggregating attributes, and others. In the last step, the data is written into the destination system. 

What is ETL?

Extract Transform Load (ETL) is a special case of the data pipeline concept where the process also includes data processing steps to transform data before loading it to the destination system. The destination system is typically a data warehouse from which consumers can query data easily.

ETL

When do you need a big data pipeline?

The term “big data” is typically used to emphasize the volume of data for a data pipeline use case. The volume can be described as events per second in a streaming data pipeline, or the size of data in a batch-based data pipeline. Terms like “big data pipeline,” “big data ETL pipeline,” or “big data ETL” are used interchangeably.

Typically, if the data is only a few GBs, a data processing/transformation step can be as simple as a Python script, and it will typically handle transformations, like aggregations, very easily. However, if the data volume is in hundreds of GBs and constantly growing, the data processing engine should be capable of handling such a volume. Hence, data processing becomes a crucial part of the process to handle these huge volumes of data.

Data processing in a big data pipeline

Data processing can include simple steps like data type checks, type conversions, and complex aggregations such as grouping, or joins between two datasets. When these operations are applied on hundreds of GBs of data, choosing a right data processing engine becomes crucial. Below are some of many data processing engines available for large scale data processing. 

  • Apache Hadoop MapReduce
    • MapReduce is the processing layer of the Hadoop ecosystem. It is a highly scalable distributed data processing framework. It is comparatively slower for large datasets as it tries to write data to disk as a part of the MapReduce process.
  • Apache Spark 
    • Spark focuses on in-memory computations and leverages its distributed data structure, also known as RDD (resilient distributed dataset). Spark includes batch processing, stream processing, and machine learning capabilities. Taking batch processing into consideration, it is comparatively faster than the traditional MapReduce approach. Hence, Spark is popular for distributed batch processing.
  • Apache Flink
    • Flink is based on distributed stream processing, and is usually preferred for stream-based data pipelines.

Almost all of these available large-scale data processing engines are based on the concept of distributed computing and are capable of handling huge amounts of data. 

Data pipeline use cases

We will now discuss two big data pipeline use cases for the Druva data lake which give an idea of the different steps in building scalable data pipelines. 

Data lake formation

The Druva data lake holds hundreds of TBs of data which can be structured or semi-structured. All this data comes from various sources including files, streams, databases, and more.

To ingest all this data into the data lake, a data pipeline is created from each source. Some of these pipelines have a transformation step, for example, a file based data source. Whereas some of these pipelines just write the data as it is, for example, a telemetry stream which writes data to the lake directly from various applications.

An ETL process per data source is used to transform different data formats into a single columnar file format, usually Apache Parquet. Streaming sources usually load the data directly to the data lake without a transformation step, for example an AWS Kinesis stream.

Below are the three key steps of an ETL job. 

  • Extract — Data is extracted from a source location, such as a file system or object store like Amazon S3
  • Transform — Extracted data is then transformed into the desired structure/schema for the destination system
  • Load — Finally, transformed data is loaded into the destination system's file format, usually a columnar file format
Data sources to the data lake

Data lake consumption

As the Druva data lake holds a huge amount of data, accessing relevant data becomes a task like finding a needle in a haystack. To overcome this, ETL jobs are used which transforms data from the lake to get desired attributes for consumers. This way they only focus on the data relevant to them.

The diagram below illustrates how an ETL pipeline is utilized for data lake consumption.

ETL pipeline in use for data lake consumption

In a simple backup file metadata analytics use case, different stages from the above diagram are as follows.

  • Extract — Data from the lake is extracted and can include files or the hive tables over them, typically partitioned by date.
  • Transform — Data transformation includes tasks like reducing attributes, aggregations, data type corrections, etc. Typically, Spark jobs are written for data transformation as most of the analytics requirements don't require real-time access to data. Batch processing makes the most sense in this scenario.
  • Load — Data is then dumped into the destination systems, typically AWS S3, before Hive tables are created to make it queryable using tools like Presto or AWS Athena.

Tools to implement a big data pipeline

Common questions we receive are which tool/framework is best suited for writing an ETL job, which programming language is needed, which is the best orchestration tool, etc. The below points try to answer most of these questions.

Handling big data processing 

How to best handle big data processing is dependent on the use case in question. For instance, at Druva we use the following.

  • AWS Kinesis streams data analytics to load stream data directly into the data lake in the Parquet file format.
  • For transformations on big data in AWS S3, we run Apache Spark jobs to load data as well as process from the data lake.
  • AWS Glue and AWS EMR are used to deploy Spark jobs. For long-running jobs we prefer EMR, and for short-running and ad hoc ETL jobs, Glue is used. AWS EMR gives us a full Hadoop cluster to work with, whereas Glue is a serverless offering by AWS where we only pay per usage in DPUs (Data Processing Units).

The preferred programming language to write Spark ETL code is Scala, with Maven for building code. Scala is JVM-based and offers a number of advantages, including increased ease to debug and understand Spark code, functional and object-oriented paradigm support, and type safety. 

Storage and file format

AWS S3 is the storage layer for the data lake which saves costs over traditional disk-based storage. Additionally, we store the data in an Apache Parquet columnar file format which gives us a level of compression over row-based formats. Most data warehouse systems used for querying also also use AWS S3 for data storage and Hive tables.

The diagram below illustrates how data is loaded and queried from the data lake; it also applies to most data warehouse systems after an ETL process. 

Data loaded and queried from the data lake

Orchestration and monitoring

Druva’s ETL jobs have a trigger-based recurrence using time-based triggers. To orchestrate ETL jobs, we use Apache Airflow and AWS Glue triggers. Grafana is used to create charts for monitoring jobs. 

Key takeaways

Developers should choose a data processing technique based on what their particular use case demands. For instance, for real-time data needs, stream processing can be an attractive solution; otherwise, batch processing also offers benefits. Organizations will continue to produce more and more data, and as a result choosing the ideal file format and storage to meet your needs is important. Take steps to gain an understanding of how data will be consumed at the target systems, and implement a big data pipeline strategy accordingly. 

Druva strives to consistently update our products to provide customers with the functionality and expertise to thrive in the cloud era. Read Vinay Punera’s blog from March 2020 for more about the data lake and how to leverage its capabilities for fast querying, and discover Druva’s latest enhancements in the Tech/Engineering section of the blog archive.