Platform
- Data Security Cloud
  Data Security Cloud
  Fully managed data security across enterprise, cloud, SaaS, and end user.
- Data Protection
  Data Protection
  Modernize data protection to reduce costs and complexity
- Cyber Response & Recovery
  Cyber Response & Recovery
  Bounce back from cyber attacks with data that is always safe and ready.
- eDiscovery & Compliance
  eDiscovery & Compliance
  Secure, protect, and streamline data governance.
- Meet Dru - Your Copilot for Data Security
Solutions
- Use Cases
  Use Cases
  Learn how Druva helps you accelerate key business initiatives
- Key Technologies
  - Public Cloud
    Public Cloud
    Protect native AWS and Azure deployments with secure backups without the cost and complexity
    
    Amazon EC2
    
    Amazon RDS
    
    Azure
  - Hybrid Workloads
    Hybrid Workloads
    Transform data center backup and disaster recovery for virtual environments
    
    VMware
    
    Hyper-V
    
    Nutanix
    
    Oracle
    
    MS SQL
    
    SAP HANA
    
    NAS/files
  - Endpoint and SaaS Apps
    Endpoint and SaaS Apps
    Enterprise Cloud Backup and data management across edge, on-premises and cloud workloads
    
    End User Protection
    
    Microsoft 365
    
    Salesforce
    
    Google Workspace
    
    Microsoft Entra ID
    
    Microsoft Dynamics 365
- Free Trial
Customers
- Explore All Customer Stories
  We are trusted by the world's leading organizations to protect their data. Explore customer success stories to see how your peers are using Druva.
- Ransomware recovery ready
  Learn why Medallia chose Druva
  
  SaaS data protection across the enterprise
  See why Regeneron partnered with Druva
Resources
- Druva vs. Veeam TCO Calculator
  Find the hidden costs of legacy backup
  
  Forrester: Total Economic Impact of Druva 2024
  Customers see 224% ROI: Find out how
Partners
- Programs
  Programs
  Learn how you can profit with Druva and a cloud-first SaaS selling motion. Explore partner programs, access resources, and discover the benefits of partnering with Druva.
- Strategic Partners
  Strategic Partners
  Learn about Druva's strategic capabilities across platform, OEM, and other partnerships. Find out how Druva accelerates and protects customers' cloud journeys.
  - Dell Technologies
  - AWS
  - VMware
  - Nutanix
- Become a Partner
Company
- - Company
  - Leadership
  - Investors
  - Careers
  - Contact Us
  - Newsroom
  - Awards
  - Events
  - Diversity, Equity & Inclusion
  - Blog
- Get in touch with us
  Contact Us
  
  News, product innovations, and more
  Blog
Get Started
Support
Login
Language
- English
- Deutsch

Tech/Engineering

Pushing Limits, Uncovering Strengths: Stress Testing Redefined

June 06, 2024 Rajat Gandhi, Sr. Staff SDET

Stress testing is a process used to know how a software or system or any component behaves when conditions are extreme. The goal of stress testing is to recognize the limits of the software, and to learn about any potential failures that could occur in the system under extreme load or huge traffic. This process is commonly used in the world of software engineering to test the stability, quality, and performance of software systems.

At Druva, we have created a stress-testing environment for validating the backup load. This setup simulates the production load and validates how our system behaves when there is heavy traffic and load.

In Druva, we have containerized architecture. We have a set of Amazon ECS clusters with multiple Amazon EC2 instances. Amazon Web Services (AWS) offers various Amazon EC2 instance types, and adopting them for production requires thorough validation. With this stress setup and the metrics in place, we can qualify the suitability of different instance types for production environments. By analyzing performance metrics, such as CPU usage, memory usage, and IOPS, we can assess the capability of each instance type to meet the requirements of our production workload. This validation process helps ensure that the chosen instance types can effectively handle the workload demands and provide the required performance and reliability.

This stress test serves several purposes in achieving our goals. For example:

Application Changes

Validating backup orchestrator workflow-related code modifications.
Introducing a new application layer.
Improving performance, and load handling. For example - Improving a DB query.

Infrastructure Changes

Upgrading container OS versions.
Upgrading third-party components like the HAProxy version.
Adopting a new Amazon EC2 instance without impacting any existing architecture/workloads.

Packages and Library Upgrades

Validating SSL, cryptography, and Python upgrades.

Architecture Diagram Of Druva Backup Stress Setup

Test Case

Analyzing the performance of Druva’s backup control plane while simultaneously spawning 200,000 backups

Context

Druva presently handles over 12 million backups per day, across diverse workloads, growing exponentially. To uphold the reliability of our releases, we conduct internal stress tests to guarantee that there is no performance degradation. This stress test aims to evaluate the performance and reliability of the Druva backup control plane under extreme load conditions, specifically when handling simultaneous backup operations from 200,000 devices.

Challenges for the stress test

Simulating 200,000 Devices: Creating actual devices for such a large-scale test is impractical due to high costs and logistical challenges.
Concurrency: Executing concurrent backup jobs to load the control plane.
Network Connection Limitations: A single machine with a single IP address can only establish a limited number of connections. The test must overcome this constraint to simulate high concurrency.
Cost Management: Amazon DynamoDB calls during backups incur costs, necessitating careful monitoring and optimization to manage expenses effectively during the test.
Monitoring Metrics: This involves monitoring system metrics, response times, error rates, and resource utilization during stress test runs, and creating reports to assess performance.

Implementation and Integration

To simulate a production-like environment we created 200,000 dummy clients instead of actual devices. These clients handle authentication and session creation on data plane instances. Moreover, these clients mimic common actions performed by the actual agent, such as transmitting device details and retrieving backup configuration.

Code Snippet:

srv = gclient.GClient(fqdn, ip, port, localip)

srv.connect()

rc = srv.v15.auth.authenticate('2.2', 'unused', str(did), did, True, cid, 0.0)

_end_time1 = timeit.default_timer() - _start_time1

### Execute pre-backup RPcs

_end_time4 = pre_backupRPCs(srv, cid)

### Handover to Node

_start_time2 = timeit.default_timer()

redirection, mtserver = srv.v18.misc.handover(BACKUP_TASK, None, 0, False)

_end_time2 = timeit.default_timer() - _start_time2

To address concurrency issues, we used asynchronous events and callback-based mechanisms. We've developed a GEvent-based process to exert load on the control plane. This GClient communicates using the Druva RPC protocol and adheres to the following specifications:

Utilizes SSL for communication.
Does not verify the SSL certificate of the server to conserve memory on the client side.
Communicates with Druva’s RPC using HTTP headers.
Does not recognize sequencing for responses.
Does not handle packet chunking; it assumes that full requests and responses fit within a single send/receive operation.
Primarily employed to stress test connection limits for Druva RPC server.

Code Snippet:

class GClient():

def __init__(self, fqdn, ip, port, localip):

self.address = (ip, port)

self.localip = localip

self.fqdn = fqdn

self.sock = None

self.ssl_sock = None

self._rpc = None

def connect(self):

# Will connect and do sslwrap

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

self.sock = s

# self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

self.sock.bind((self.localip, 0))

self.sock.connect(self.address)

self.sslwrap()

def sslwrap(self):

# Assumes socket is already connected

context = ssl.SSLContext(ssl.PROTOCOL_SSLv23)

context.verify_mode = ssl.CERT_NONE # Do not verify cert to reduce ram consumption

self.ssl_sock = context.wrap_socket(self.sock, server_hostname='inSyncBackup-' + self.fqdn)

# self.ssl_sock = context.wrap_socket(self.sock, server_hostname=self.fqdn)

self._rpc = RPC(self.ssl_sock, self.address)

def read(self):

return self._rpc._read_packet()

def shutdown(self):

try:

self._rpc = None

if self.sock is not None:

self.sock.shutdown(socket.SHUT_RDWR)

self.sock.close()

self.sock = None

except Exception, fault:

# print 'Unable to shutdown socket, %r' % (fault,)

pass

def _issue_request(self, methodname, args, kwargs):

# if not self.sock:

# self.connect()

response = self._rpc.issue_request(methodname, args, kwargs)

return response

# To emulate ServerProxy style (config.name) method calling

def __getattr__(self, name):

return _Method(self._issue_request, name)

def __repr__(self):

try:

return 'Client object address=%s' % str(self.address)

except:

return ''

__str__ = __repr__

__unicode__ = __repr__

To overcome network connection constraints on a single Amazon EC2 instance and to handle more than 65,000 sockets at any given time, we made the following modifications:

Adjusted the port range to accommodate a wider range of connections and implemented TCP TIME_WAIT recycling to optimize network resource utilization.
Additional elastic IP addresses were assigned to the Amazon EC2 instance to manage the workload effectively.
To avoid exceeding the socket limit of 65,000 per IP address, each process was configured to utilize a single IP.

This strategy enabled us to simulate increased loads and effectively stress-test the system.

Code Snippet:

## increase the port range and enable tcp timewait recycling if not already done in sysctl.conf

subprocess.call(['sudo', 'sysctl', '-w', 'net.ipv4.ip_local_port_range=1024 65535'])

subprocess.call(['sudo', 'sysctl', 'net.ipv4.tcp_tw_recycle=1'])

The test script accepts the start device range, end device range, customerid, and port.

Code Snippet:

if __name__ == "__main__":

parser = argparse.ArgumentParser(description='Stress Test for Backup')

parser.add_argument('cloud_fqdn', help='cloud name')

parser.add_argument('device_info', help='Device configuration in the format below - "start,end,cid1,port;start,end,cid2,port"')

parser.add_argument('--max_parallel_backups', default=50, type=int, help='Maximum backups allowed to ensure task ends')

args = parser.parse_args()

main(args.device_info, args.max_parallel_backups, args.cloud_fqdn)

As a part of backups, we save metadata generated by the data plane (Spot instance) and it is saved in Amazon DynamoDB. These Amazon DynamoDB calls incur costs on the cloud. In order to save this cost during the execution of our test, we generated mock data plane API calls.

Code Snippet:

shareName = 'backuptest-folder'

share = {

'excludeFiles': u'abc.txt',

'excludeFolders': u'Temp;Temporary Internet Files',

'platform': 'linux',

'folderToBackup': u'/home/users/Downloads',

'guid': u'ABJSJ121212',

'excludeExtention': u'*.exe'

}

_st_shareBegin = timeit.default_timer()

srv.inSync0114.MD.ShareBegin(share)

_et_shareBegin = timeit.default_timer() - _st_shareBegin

As part of the test, a queue comprising 200,000 devices slated for backup was established. Backup operations were sustained for a duration of 8 hours, during which the queue iteratively processed all the backup devices, ensuring continuous backup activities.

Code Snippet:

def __start_backups(self):

gs = []

log.debug("Started New Iteration for Backup for pnum %s", self.pnum)

while not self.backupQueue.empty():

try:

did = self.backupQueue.get()

gevent.sleep(0)

g = self.pool.spawn(sync_file, did, self.cid, self.localip, self.port, self.fqdn, ProcessBackup.stats)

gs.append(g)

except Exception, fault:

log.exception("Error While Spawning")

else:

self.__done += 1

else:

gevent.joinall(gs)

def __populate_queue(self):

log.debug("Populating Backup queue for pnum %s", self.pnum)

for did in range(self.start, self.end):

try:

self.backupQueue.put(did,block=True,timeout=1)

except Exception, fault:

log.exception("Error while populating queue for process %s", self.pnum)

Graphs and Stress Metrics

We ran tests on a dedicated control plane and dedicated RDS to ensure that there were no benchmarking issues.

Druva has a telegraph service that operates on Amazon EC2 instances and is responsible for data collection and transmission to InfluxDB. We leveraged the data gathered by this service to generate performance graphs, providing detailed insights from our stress test run.

Metrics

Test Type	Baseline
Avg_Bkp_per_Min	~5K
Control Plane	m5a.4xlarge (16Core, 64GB)
Control Plane CPU mean core	14
Control Plane Mem GB Max	12
Amazon RDS DB	m5.xlarge
Amazon RDS DB CPU mean Core	1
Amazon RDS DB Mem GB	13.1
Amazon RDS DB Conn Min, Max	300,420
Amazon RDS DB Write IOPS	1660
Device Count	200000
Max Data Plane Nodes Spawned	15

Benefits

We calculated several metrics during the test.

Connect time: The time taken to establish a connection with the Druva control plane.
PreBackup Time: The time taken for the session creation.
Handover Time: Time taken to connect to the data plane node.
File Backup Time: Time taken for syncing files.

Additionally, we monitored the following metrics:

Average backups per minute
Amazon RDS (relational database service) IOPS (input/output operations per second)
Amazon RDS CPU usage
Amazon RDS connections
Control plane CPU usage
Control plane memory usage

Monitoring the above-mentioned metrics allowed us to identify potential bottlenecks in the product and infrastructure. For example, the following can be analyzed:

An elevation in connect time may signal potential issues with HAProxy.
An uptick in PreBackup time might imply potential issues with the control plane.
Monitoring RDS statistics is critical, especially considering the IO-intensive nature of backups.

This practice enables us to gauge the effects of alterations to the control plane on Amazon RDS performance.

By closely monitoring these metrics, we can proactively detect and address performance issues, ensuring the reliability and efficiency of our system.

Conclusion

With this stress testing on the Druva backup control plane:

We stressed the system with backups of 200,000 devices.

A load of over 90% was imposed on the CPU for certain control plane services and approximately 80% on the memory of the backup Amazon RDS, which utilized around 10 or more data plane nodes.

Under Pressure, We Measure Performance

Pushing Limits, Uncovering Strengths: Stress Testing Redefined

Architecture Diagram Of Druva Backup Stress Setup

Test Case

Analyzing the performance of Druva’s backup control plane while simultaneously spawning 200,000 backups

Context

Challenges for the stress test

Implementation and Integration

Graphs and Stress Metrics

Benefits

Conclusion

Druva Blog: Cloud Technology & Data Protection Articles

Druva Data Security Cloud

The Druva Platform

Data Protection

Cyber Response & Recovery

eDiscovery & Compliance

Use Cases

Key Technologies

Customers

Resources

Partners

Company