Innovation Series, Tech/Engineering

Pioneer: a Druva solution for state graph workflows

Complex applications can be broken into different independent functions that communicate with each other and must be executed in a specific sequence. To simplify these complex workflows, the Druva team has created a development framework — “Pioneer,” a solution for state graph workflows providing the means to simply manage, control, and evaluate the efficiency of functions. This allows developers to focus on business logic implementation without worrying about concurrency, error handling, or other basic functional primitives.

What is Pioneer?

Pioneer is a state machine where the state is the primary abstraction. A state can be an independent function of the application. Pioneer pipelines the different states using communication channels to pass the data. It controls the application workflow by sequencing different tasks. With Pioneer, a developer can easily specify a sequence of states using a config. Pioneer creates a graph from the sequence where each node is a state. The data can be passed forward in sequence by states with the help of communication channels. Each state can be defined with its own concurrency and error handling independently.

Pioneer supports the following states:

  • Lambda state: Invokes lambda functions/API calls
  • Local state: Supports local function calls (supports Golang only)
  • Batch state: Batching/grouping of data
  • Map state: Converts array of data into individual elements
  • Choice state: Conditional choice like if-else
  • HTTP/s state: Handles HTTP/s request calls

Why Pioneer?

Though there are workflow management tools like Airflow, and AWS Step Functions, we built Pioneer for our distinct needs. Unlike Airflow, Pioneer is mainly designed and optimized to be ideal for data pipelining use cases. AWS Step Functions allows users to sequence lambda functions with a workflow monitor interface. Pioneer does not have a monitoring interface, but it provides detailed telemetry of the workflow and better error handling capabilities.

How does Pioneer work?

Pioneer coordinates the application workflow where a state is a single unit of work. A Pioneer configuration constitutes a map with all possible states and transitions between them. It executes this process following the steps below. 

The life cycle of Pioneer

  • Define states in JSON — A developer can define states and their transitions in JSON. Each state can be defined with its own attributes according to state type. 
  • Create state graph — Pioneer parses the JSON and builds a state graph with states and their transitions. The state graph is a graph where states are nodes and transitions are edges. 
  • Execute graph — Pioneer launches instances of all states from the graph with their specified concurrency. A communication channel is established for each edge in the graph. Entrypoint of execution is defined as the start state in the configuration. When the start state receives a request, it executes and passes the output to the next state in sequence via the communication channel. The data is manipulated and propagated throughout the graph by subsequent states. The start state sends a terminate signal once it has finished processing. This signal is propagated in the graph to subsequent states. When a state receives it, all instances of that state are terminated.

Pioneer offers developers extensive workflow management advantages; the following are example use cases for a Pioneer implementation.

Example workflow use cases

  • Simple sequence workflow — A workflow that executes states in a given sequence. The output of the state is passed as input to the next state. The last state output is the result. 

  • Parallel processing — A workflow where multiple instances of a state can be spawned. Instances run in parallel and are independent of each other. Each instance receives different data and passes the output to the channel. The concurrency value can be specified in the configuration for each state.

  • Branching — A workflow where a decision needs to be made to choose which state to call next. In such cases the choice state is useful. A choice can be made by evaluating fields in incoming data. The evaluation expression is defined in the configuration of choice state definition.

  • Mapping and batching — Pioneer supports the splitting array of data into individual elements using map state and aggregation of data using batch state.

  • Error handling — Pioneer provides a default basic error handler for all states. When a state encounters an error, a specified error handler is called and no subsequent states are executed. A state can be retried multiple times before calling the error handler.

  • Telemetry — One of the distinct features of Pioneer is its ability to provide detailed statistics about the state execution. Metrics including execution time, wait time, and error count are logged for each state. By analyzing these stats, developers can fine-tune the performance of their sequence.

Key takeaways

Druva’s Pioneer helps to coordinate application workflow by pipelining different functions. Pioneer is the best fit for pipelining lambda functions, and offers the following benefits:

  • Quick workflow updates: It is easy to add or remove a state from the pipeline by simply updating the configuration.
  • Code simplification: Pioneer implements basic operations like concurrency, branching, and error handling. This removes extra code that may be repeated in functions.
  • Performance insights: Pioneer provides statistics for functions including total calls, number of errors, execution time, wait time, and more. Developers can increase or decrease concurrency and output buffer count for functions with the help of these stats to improve performance.

At Druva, we use Pioneer for data pipelining use cases such as data enrichment workflows, which include operations such as named entity recognition (NER), topic classification, sentiment analysis, scoring and relevance, and more. Pioneer is the best fit for these operations as each is performed independently, concurrently, and follows a specific sequence. Pioneer’s telemetry helps to optimize performance by analyzing execution and wait time. 

Druva is at the forefront of cloud-based innovation, and is consistently updating our products to provide customers with next-generation functionality. Learn more about how we leverage applications like Pioneer, such as Jarvis, an intelligent bot for enhanced workflow management, in the Tech/Engineering section of the blog archive.