Tech/Engineering

Building a Versatile Data Pipeline for a SaaS Cloud Platform

Rakesh Landge, Principal Engineer

Data silos are isolated data repositories. These are common in organizations that store and manage data independently. When silos are not managed properly, they can become a liability instead of an asset. Building a data pipeline is an effective way to manage data silos. 

This article explores the design of a data pipeline that feeds a warehouse with data from multiple heterogeneous stores within a SaaS product ecosystem. We will also go through the results and benefits of implementing this design. 

Problems with Data Silos 

In the dynamic landscape of SaaS cloud platforms based on microservices architecture, data silos are an all-too-common challenge. 

In a microservices setup, each microservice maintains its data store leading to fragmented data and difficulties in achieving a unified view of the organization's information. Accessing data across multiple microservices can result in slow retrieval and increased latency. Implementing uniform security and access control can be complex, and addressing data ownership and accountability issues are additional hurdles. 

A robust data pipeline is not just a necessity but a strategic asset in such an environment. This pipeline must be adaptable to a multitude of use cases, ranging from data analytics and dashboarding to reporting and machine learning. To meet these diverse demands, data must be collected from various subsystems and centralized in a warehouse or data lake. 

Key Considerations for the Pipeline Design

To ensure the data pipeline can handle these diverse use cases effectively, several key considerations should guide its design.

  1. Scalability and elasticity

    • Scalability to seamlessly handle growing data volumes, microservices, and user demands.

    • Autoscaling mechanisms to adapt to changing workloads.

  2. Performance

    • Ability to handle full data or just incremental data from diverse sources to meet Service Level Agreements (SLAs).

  3. Flexibility

    • Flexibility to integrate different data sources and the ability to adapt to changing business requirements.

    • Ability to fetch data from multiple data sources but store it in the same table in the warehouse.

  4. Fault tolerance

    • Error handling and retry mechanisms to gracefully handle transient failures.

  5. DevOps and CI/CD Integration

    • Automated configuration management and infrastructure provisioning to streamline pipeline maintenance.

Data Pipeline Solution

A data pipeline can take various forms, such as batch processing or stream processing. For the use cases we've mentioned, a batch processing pipeline is suitable where a high volume of data is moved at regular intervals. Several options are available that include AWS-native solutions like AWS Data Migration Service, AWS Data Pipeline Service, AWS Glue, and Apache Airflow, as well as third-party solutions like Hevo.

Data Pipeline Based on AWS Glue and Redshift

If the product tech stack is based on the AWS platform, then AWS Glue stands out as the preferred choice for data processing. It is a fully managed serverless platform that provides elastic scalability and offers cost optimization through a pay-as-you-go model. It is also optimized to work seamlessly with cloud-based data sources and destinations.

Redshift, a go-to store for warehouse solutioning in AWS, is a petabyte-scale data warehouse that can handle both structured and semi-structured data. It has materialized views that can be leveraged to create a flat list of raw data by joining multiple tables present at the central place based on some business logic. This is very useful for our use cases as views can provide fast retrieval of enriched data for downstream processing.

AWS glue

AWS Glue Components

  • Data Catalog

    • Stores metadata for Glue jobs to access data sources.

    • Contains connections, which are verified links between Glue and data sources, including Redshift (the destination).

  • Glue Jobs

    • Customizable scripts containing programming logic for ETL operations.

  • Glue Triggers

    • Run on a schedule, on command, or upon a job event, using cron commands.

  • Glue Workflows

    • Provides job orchestration, essential for our use case.

    • Workflows can be scheduled to run using Glue Triggers.

Automated Data Pipeline Provisioning

A pipeline provisioning service is created to manage the data pipeline by providing APIs to create, update, and delete pipeline components such as Glue Jobs, connections, triggers, workflows, Redshift tables, and materialized views.

Pipeline provisioning

This service allows automated creation of pipelines with minimal effort and time making them easy to manage without DevOps involvement. The service is implemented as an AWS Lambda function leveraging the AWS Glue/S3/Redshift-data SDKs to manage pipeline components.

Input to the Provisioning Service is a standardized metadata file in json format which acts as a registry and contains the pipeline components.

The provisioning service parses the JSON and creates the necessary components for Glue data processing. It stores metadata registry files on Amazon S3 for:

  • Further processing during job runs

  • Identifying changes during an update operation

  • Cleaning up of resources during the delete operation

  • Retrieving the latest version of the file during get operation

Metadata Registry Elements

  • version: Schema version.

  • connections: List of data source connections with fields for the connection name, DB type, DB host, port, DB name, username, password, and so on.

  • tables: List of table definitions with fields for source name, target name, connection name, list of columns unique keys, column used for incremental sync, etc. Column lists can have fields for source column name, target column name, data type, default value, etc.

  • materializedViews: List of Redshift materialized views with a field for name and select query.

  • constants: Optional section to create a table of Constants/Enums/Status used in the pipeline with fields for table name, list of columns, and list of data rows.

Incremental Sync

AWS Glue doesn't provide a built-in change in data-capture mechanisms. For incremental sync of transactional data, we need special handling. 

The source table should have a column to identify changed records. For example, id column or created_timestamp column for immutable data or updated_timestamp column for transactional data. By identifying such a column and listing it in the metadata registry, the pipeline can keep the last sync value of a table in some persistent storage (such as Redshift) so that records can be fetched incrementally based on the last sync value.

Limitation in case of hard deleted records of a table identified for incremental sync

In the case of hard deletions, currently, there is no clean way to identify the hard-deleted rows in source tables.

One of the approaches could be to add a soft delete column on the source table. But, such an addition might require changes in existing business services logic.

If triggers are supported on source data stores, then another approach could be to add a PostDelete trigger that will add the primary key of the deleted record in a custom deleted_<source_table>_rows table. This table will be synced till the warehouse and can be utilized later during pipeline processing jobs to mark deleted rows in the warehouse table.

Data Processing Workflow

It is very important for a pipeline to run in a specific order to establish a reliable data transfer mechanism.

Workflows are created to run the jobs in a particular sequence.

data processing workflow
  • Source Sync Job workflow

    1. Read metadata from S3 and extract table and connection sections.

    2. Create a connection map using Glue connections for username, password, and URL.

    3. Fetch the last sync values, and then connect to Redshift and store source data in staging tables.

  • Upsert Job workflow

    1. Read metadata and create upsert and post-upsert queries for each table.

    2. Include conditional clauses for incremental sync based on unique keys.

    3. Execute queries using Redshift client; update last sync values and truncate staging tables.

  • Refresh Materialized Views Job workflow

    1. Read metadata to extract materialized views and connection sections.

    2. Use the Redshift client to iterate over materialized views, execute refresh queries, and wait for completion using describe_statement api of client.

Key Results

Since its inception, Druva’s SaaS platform has been designed to be 100% cloud-native and is based on the AWS platform. At Druva, we have built a data pipeline based on the above solution to deliver a unified reporting experience to our customers.

  • Scale metrics

    • ~250 tables present in 12+ diverse data stores

    • AWS Aurora and RDS source data stores having MySQL and PostgresSQL as DB engines

    • 3 data pipelines operating simultaneously in isolation to move contextual product data to the warehouse

    • Syncing of more than 1 billion records comprising multiple terabytes of transactional data in a day

  • Job configs

    • Sync frequency - Hourly

    • Spark-type Glue jobs for source data sync (max 4 workers) and Python shell for other jobs

With the above configuration, 60M records consisting of ~45GB of reporting data get synced within 40 minutes.

Advantages

  • Once the metadata registry is ready, the pipeline can be provisioned within minutes without any DevOps involvement.

  • Multiple data pipelines can be created which can run in isolation without impacting the execution of other pipelines. 

  • Source data stores can be present in the same or different AWS accounts and regions.

  • Data can be aggregated from multiple data sources and stored in a single table in the warehouse.

  • Schema modifications can be handled through changes in the metadata registry which will be handled by provisioning service based on delta.

  • If you need to curate data before storing it in a warehouse then Glue out of the box or custom transformations can be leveraged during job runs.

  • Data can be retained for the long term in warehouse tables even if it gets deleted in source tables.

Conclusion

A well-designed data pipeline serves as a strategic asset for any data-driven organization. By carefully considering the diverse use cases and by following best practices in design and implementation, you can build a data pipeline that empowers your organization to extract maximum value from its data assets. Leveraging cloud-native solutions like AWS Glue and Redshift, along with automation for pipeline provisioning, streamlines the process of handling data from heterogeneous sources and delivering actionable insights to meet the evolving needs of the business. Remember that a successful data pipeline is not static; it evolves with the organization's data needs and technological advancements, ensuring it remains a valuable asset for years to come.