Innovation Series, Tech/Engineering

Achieving >1TB/hr backup speed by implementing the core client-side data pipeline in Rust

Druva provides the backup and restore of data to and from a variety of data sources (cloud apps, file servers, NAS, SQL servers, etc.), varying in magnitude as well as complexity. In our quest to serve our customers better, we are continuously striving for technologies which can make our core pipelines of backup and restore flexible, fast, and efficient. 

Motivation and requirements for porting our core pipeline

Agent is an executable running on a customer’s device which is responsible for the backup and restore of data to and from the Druva Data Resiliency Cloud. Existing Druva agents are written in Python/Golang, and utilize the power of asynchronous programming (a form of parallel programming) to speed up operation for its specific workload.

However, one thing common across all agents is the complex data pipeline transporting data and uploading it to the Druva Cloud, and vice versa for restore.

We thought of carving out this common piece of complex data pipeline and bundling it as a library that will be utilized by our agent. This is the core of our agents and we want it to develop a data pipeline with the following features:

  • Performant in terms of network IOPS
  • Scale with the device resources (CPU/Mem)
  • Bindings for Golang/Python
  • Mature library support of asynchronous programming
  • Cross-platform support
  • Preferably no garbage collection as we want tighter control over resources

Reasons for selecting Rust

The Rust language caught our attention and we chose it for the following reasons:

  1. Non-garbage collected (which means much tighter control over resources)
  2. Close to C performance (comparison)
  3. Compile to native object code
  4. Easily compiles down to a shared library, allowing source-less distribution of the SDK (Software Development Kit), along with C-API (C-based Application Programming Interface) support
  5. Very strong safety guarantees (compared to C and C++) for memory, and highly concurrent code
  6. Built-in support for asynchronous programming

Challenges with adopting the Rust ecosystem and its solutions

Buffer copying

Python and Golang are Garbage Collected (GC) languages, meaning memory is released automatically for the allocated objects which are not in use by any part of the program. Rust is a non-GC language, meaning the memory of the allocated objects needs to be managed explicitly by the code.

The problem with interfacing the GC and non-GC runtimes is when data crosses the language boundary it needs to be copied. This is not a problem for primitive data-types, but for large buffers, such as our core data pipeline, it becomes a problem.

Solution:

To solve this problem we relied on the Rust layer for allocations and deallocations of memory. The Go/Python layer allocates the memory by calling the APIs of the Rust library, it then fills up the allocated space buffer with data through the Rust library API calls. Once the data is uploaded to the Druva Cloud, the allocated space can be freed by calling the API of Rust library. In this way, memory management lies completely on the Rust layer, buffers need not be copied, and Python and Golang layers are free from cleaning the garbage memory.

Pressure on the OS due to system threads

This problem is particularly for the Golang, as Python typically uses a single thread due to Global Interpreter Lock (GIL). In Golang, any system call spawns a system thread. Golang mitigates this by reusing already spawned threads as much as possible. But if N requests are to be made to the Rust layer, then N threads will be spawned — this N can reach an excess of 500 as well. Multiple such processes could exist on the same host and it might put pressure on the OS.

Solution:

To mitigate this issue, we treated the Rust layer as a Remote Procedure Call (RPC) server and Golang layer as an RPC client. The interface is then modeled on RPC-style request/response. Each request, having a unique ID, will be enqueued to a channel. The other end of the channel dispatches requests to the Rust side. This dispatcher will actually call the Rust library API, which means a system thread will be spawned. However, in this case only one thread per dispatcher will be spawned. 

Once the request is handled, the pipeline will enqueue it to a queue from which a receiver go-routine will poll. Using the request ID, the receiver will send the response to the appropriate go-routine.

The effect of this is that the Go side can spawn many go-routines without incurring the system thread penalty. There can be multiple dispatcher and receiver go-routines to scale a high number of requests.


Key takeaways

Rust helped us deliver a fast and robust common data pipeline for all workloads. Adoption of the language enabled Druva to avoid a host of memory-related flaws. Our efforts on porting the core data pipeline to Rust delivered several performance benefits when compared to our existing agents in Golang/Python. We were able to achieve a little above 1TB/hr of backup speed with current architecture. For comparison, with Golang we clocked 800 GB/hr, and with Python around 500 GB/hr.

However, apart from domain-related problems, Rust has a steep learning curve. The onboarding of new developers into our team requires a 3-6 month timeframe. But with the velocity we are delivering new features, it’s worth the investment.

Next steps

Looking to learn more about the technical innovations and best practices powering cloud backup and data management? Visit the Innovation Series section of Druva’s blog archive.