Druva’s backup solutions were architected from the beginning to be cloud-native and built on AWS. Druva has built many applications on top of these cloud-native backup solutions which deliver additional value to customers; among these are solutions to help ensure customer cyber resiliency and resistance to ransomware. One of Druva’s new and innovative features to predict and prevent malicious attacks on customer data is Unusual Data Activity (UDA) detection.
In a typical attack, a malicious user or software modifies data in a suspicious manner on a device. This modification is considered UDA, and Druva inSync, Druva’s SaaS-based platform for protection and management across endpoints and cloud applications, leverages UDA detection to provide reports which help identify suspicious activity, such as:
- Large number of files deleted or added
- Unwarranted modification of files
- Suspicious encryption of files
This UDA feature was primarily available to inSync customers as it had extensive dependency and coupling with the endpoint backup framework. In short, this means the following:
- The coupling was in the form of REST APIs. The workflow of backup finish, detection of anomalies and submitting the result back to inSync Master service was done in a synchronous way. There was a two-way REST API communication between the inSync Master service and UDA service.
- The endpoint backup framework is made up of multiple smaller services which facilitate the execution of backup and recovery functions, these include: syncer service, master config. server, backup manager service, API service, node service, storage service, and user portal service.
Current coupling and design
UDA modules build an Artificial Intelligence (AI) model based on the number of files added, updated or deleted in a backup. The AI model learns the backup activity happening on the device. The model is built for each endpoint device separately. Backup metadata is generated in an active backup session and is synchronously pushed to the UDA master module for adding into the AI model along with the data encryption key. The backup information is then processed from the AI model within the UDA master module.
If any device is found to have anomalies, the UDA master provides a synchronous REST API call to the inSync Master service to inform the administrator and add the anomaly in an alert.
The REST API call is delivered via the active backup session. In this design, the inSync product was dependent on the UDA module to process and respond back for provided backup metadata. But, given the backup is the core functionality and UDA is an add-on feature of the backup, this created challenges in the coupling.
Developing a common layer: UDA
As Druva has multiple product lines, Phoenix, inSync, CloudRanger, and sfApex, engineers wanted to build a common service layer where all value added features would be placed, and could cover all products.
So, Druva developed and added a new module on top of inSync and Phoenix. This was not a simple task as the flow of information needed to be handled correctly without complicating data management.
UDA previously worked on a push model, where products would push backup metadata from REST API calls along with the data encryption key. The UDA model would push anomaly results back to products via REST API creating a dependency on the product. In this model there is a cyclic dependency between the inSync product and UDA module as seen in the first graphic above.
With a direct dependency on upgrading the inSync or UDA module, as they are coupled, engineers needed to push the new version of software simultaneously for both modules.
The deployment of the production environment is automated. The UDA build pipeline is configured under a Continuous Integration (CI) tool. As soon as the engineering team pushes the branch into the internal gitlab server, the integrated code passes through the automated Lint – Build – Test pipeline, at which point the new build is ready to be qualified by the QA team. After the QA team certifies the artifacts, they hand over the certified artifact index to CloudOps team, who submits the artifact index into the terraform suite. Finally, this suite updates the production service to the version given in the artifact index.
In addition, in an issue called ‘module slowness propagation,’ the UDA modules need to work at the speed of product and provision enough compute resources to avoid any slowness or lag in execution.
New architecture with Amazon SNS SQS
The first task towards decoupling was to remove API calls which were fired from inSync to the UDA module and make the push framework a pull framework. With a new pull framework, the REST API call direction was sending data from UDA to the inSync product.
Pull framework design
With the new changes in place, products began maintaining backup metadata along the tracking of backup jobs. This solved the issue of dependency direction, leaving the issue of which device’s information needed to be pulled for processing and at which times. Engineers addressed this problem by adding an asynchronous message passing protocol – the Simple Notification Service (SNS) and Simple Queue Service (SQS).
After adding SNS/SQS to the inSync product and UDA module respectively, the backup complete notification can now push synchronously and the UDA module can process them asynchronously. Through this approach, inSync or Phoenix may abstract which product or module is pulling backup metadata, solving the problem of direct coupling between the product and UDA module.
Decoupling solved the issues of dependency, and the decoupled UDA module enabled engineers to independently develop, test, and release new versions of the modules in the production environment.
After adding SNS/SQS to UDA, backup complete notifications are processed quickly, efficiently, and scalably. The architecture functions as follows, the UDA module begins by provisioning a minimum compute resource (AWS ECS), and scales up as the number of SNS events increases at peak times, scaling down to save compute resources in downtime.
Currently, the UDA service handles up to 5,000 backup events per minute, and can scale automatically, reaching up to 15,000 events per minute in peak load. Druva estimates these load capabilities will continue to increase in 2021.
How end-users interact with products has not changed, however, by delivering the enhanced capabilities of UDA to inSync and Phoenix, Druva has improved scalability, ease of data management and simplicity of maintenance.
Explore the many ways Druva’s innovative solutions enable a range of next-generation cloud-based applications, including how they simplify the management of data pipelines, in the Tech/Engineering section of the blog archive.