Tech/Engineering

Pushing Limits, Uncovering Strengths: Stress Testing Redefined

Rajat Gandhi, Sr. Staff SDET

Stress testing is a process used to know how a software or system or any component behaves when conditions are extreme. The goal of stress testing is to recognize the limits of the software, and to learn about any potential failures that could occur in the system under extreme load or huge traffic. This process is commonly used in the world of software engineering to test the stability, quality, and performance of software systems.

At Druva, we have created a stress-testing environment for validating the backup load. This setup simulates the production load and validates how our system behaves when there is heavy traffic and load.

In Druva, we have containerized architecture. We have a set of Amazon ECS clusters with multiple Amazon EC2 instances. Amazon Web Services (AWS) offers various Amazon EC2 instance types, and adopting them for production requires thorough validation. With this stress setup and the metrics in place, we can qualify the suitability of different instance types for production environments. By analyzing performance metrics, such as CPU usage, memory usage, and IOPS, we can assess the capability of each instance type to meet the requirements of our production workload. This validation process helps ensure that the chosen instance types can effectively handle the workload demands and provide the required performance and reliability.

This stress test serves several purposes in achieving our goals. For example:

Application Changes

  • Validating backup orchestrator workflow-related code modifications.

  • Introducing a new application layer. 

  • Improving performance, and load handling. For example - Improving a DB query.

Infrastructure Changes

  • Upgrading container OS versions.

  • Upgrading third-party components like the HAProxy version.

  • Adopting a new Amazon EC2 instance without impacting any existing architecture/workloads.

Packages and Library Upgrades

  • Validating SSL, cryptography, and Python upgrades.

Architecture Diagram Of Druva Backup Stress Setup

Stress Testing Architecture

Test Case

Analyzing the performance of Druva’s backup control plane while simultaneously spawning 200,000 backups

Context

Druva presently handles over 12 million backups per day, across diverse workloads, growing exponentially. To uphold the reliability of our releases, we conduct internal stress tests to guarantee that there is no performance degradation. This stress test aims to evaluate the performance and reliability of the Druva backup control plane under extreme load conditions, specifically when handling simultaneous backup operations from 200,000 devices.

Challenges for the stress test

  1. Simulating 200,000 Devices: Creating actual devices for such a large-scale test is impractical due to high costs and logistical challenges.

  2. Concurrency: Executing concurrent backup jobs to load the control plane.

  3. Network Connection Limitations: A single machine with a single IP address can only establish a limited number of connections. The test must overcome this constraint to simulate high concurrency.

  4. Cost Management: Amazon DynamoDB calls during backups incur costs, necessitating careful monitoring and optimization to manage expenses effectively during the test.

  5. Monitoring Metrics: This involves monitoring system metrics, response times, error rates, and resource utilization during stress test runs, and creating reports to assess performance.

Implementation and Integration

To simulate a production-like environment we created 200,000 dummy clients instead of actual devices. These clients handle authentication and session creation on data plane instances. Moreover, these clients mimic common actions performed by the actual agent, such as transmitting device details and retrieving backup configuration. 

Code Snippet:

srv = gclient.GClient(fqdn, ip, port, localip)

srv.connect()

rc = srv.v15.auth.authenticate('2.2', 'unused', str(did), did, True, cid, 0.0)

_end_time1 = timeit.default_timer() - _start_time1

### Execute pre-backup RPcs

_end_time4 = pre_backupRPCs(srv, cid)

### Handover to Node

_start_time2 = timeit.default_timer()

redirection, mtserver = srv.v18.misc.handover(BACKUP_TASK, None, 0, False)

_end_time2 = timeit.default_timer() - _start_time2

 

To address concurrency issues, we used asynchronous events and callback-based mechanisms. We've developed a GEvent-based process to exert load on the control plane. This GClient communicates using the Druva RPC protocol and adheres to the following specifications:

  • Utilizes SSL for communication.

  • Does not verify the SSL certificate of the server to conserve memory on the client side.

  • Communicates with Druva’s RPC using HTTP headers.

  • Does not recognize sequencing for responses.

  • Does not handle packet chunking; it assumes that full requests and responses fit within a single send/receive operation.

  • Primarily employed to stress test connection limits for Druva RPC server.

Code Snippet:

class GClient():

  def __init__(self, fqdn, ip, port, localip):

      self.address = (ip, port)

      self.localip = localip

      self.fqdn = fqdn

      self.sock = None

      self.ssl_sock = None

      self._rpc = None

  def connect(self):

      # Will connect and do sslwrap

      s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

      self.sock = s

      # self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

      self.sock.bind((self.localip, 0))

      self.sock.connect(self.address)

      self.sslwrap()

  def sslwrap(self):

      # Assumes socket is already connected

      context = ssl.SSLContext(ssl.PROTOCOL_SSLv23)

      context.verify_mode = ssl.CERT_NONE  # Do not verify cert to reduce ram consumption

      self.ssl_sock = context.wrap_socket(self.sock, server_hostname='inSyncBackup-' + self.fqdn)

      # self.ssl_sock = context.wrap_socket(self.sock, server_hostname=self.fqdn)

      self._rpc = RPC(self.ssl_sock, self.address)

  def read(self):

      return self._rpc._read_packet()

  def shutdown(self):

      try:

          self._rpc = None

          if self.sock is not None:

              self.sock.shutdown(socket.SHUT_RDWR)

              self.sock.close()

          self.sock = None

      except Exception, fault:

          # print 'Unable to shutdown socket, %r' % (fault,)

          pass

  def _issue_request(self, methodname, args, kwargs):

      # if not self.sock:

      #     self.connect()

      response = self._rpc.issue_request(methodname, args, kwargs)

      return response

  # To emulate ServerProxy style (config.name) method calling

  def __getattr__(self, name):

      return _Method(self._issue_request, name)

  def __repr__(self):

      try:

          return 'Client object address=%s' % str(self.address)

      except:

          return ''

  __str__ = __repr__

  __unicode__ = __repr__

 

To overcome network connection constraints on a single Amazon EC2 instance and to handle more than 65,000 sockets at any given time, we made the following modifications:

  • Adjusted the port range to accommodate a wider range of connections and implemented TCP TIME_WAIT recycling to optimize network resource utilization. 

  • Additional elastic IP addresses were assigned to the Amazon EC2 instance to manage the workload effectively.

  • To avoid exceeding the socket limit of 65,000 per IP address, each process was configured to utilize a single IP. 

This strategy enabled us to simulate increased loads and effectively stress-test the system.

Code Snippet:

## increase the port range and enable tcp timewait recycling if not already done in sysctl.conf

subprocess.call(['sudo', 'sysctl', '-w', 'net.ipv4.ip_local_port_range=1024 65535'])

subprocess.call(['sudo', 'sysctl', 'net.ipv4.tcp_tw_recycle=1'])

 

The test script accepts the start device range, end device range, customerid, and port.

Code Snippet:

if __name__ == "__main__":

  parser = argparse.ArgumentParser(description='Stress Test for Backup')

  parser.add_argument('cloud_fqdn', help='cloud name')

  parser.add_argument('device_info', help='Device configuration in the format below - "start,end,cid1,port;start,end,cid2,port"')

  parser.add_argument('--max_parallel_backups', default=50, type=int, help='Maximum backups allowed to ensure task ends')

  args = parser.parse_args()

  main(args.device_info, args.max_parallel_backups, args.cloud_fqdn)

 

As a part of backups, we save metadata generated by the data plane (Spot instance) and it is saved in Amazon DynamoDB. These Amazon DynamoDB calls incur costs on the cloud. In order to save this cost during the execution of our test, we generated mock data plane API calls.

Code Snippet:

shareName = 'backuptest-folder'

share = {

   'excludeFiles': u'abc.txt',

   'excludeFolders': u'Temp;Temporary Internet Files',

   'platform': 'linux',

   'folderToBackup': u'/home/users/Downloads',

   'guid': u'ABJSJ121212',

   'excludeExtention': u'*.exe'

}

_st_shareBegin = timeit.default_timer()

srv.inSync0114.MD.ShareBegin(share)

_et_shareBegin = timeit.default_timer() - _st_shareBegin

 

As part of the test, a queue comprising 200,000 devices slated for backup was established. Backup operations were sustained for a duration of 8 hours, during which the queue iteratively processed all the backup devices, ensuring continuous backup activities.

Code Snippet:

def __start_backups(self):

  gs = []

  log.debug("Started New Iteration for Backup for pnum %s", self.pnum)

  while not self.backupQueue.empty():

      try:

          did = self.backupQueue.get()

          gevent.sleep(0)

          g = self.pool.spawn(sync_file, did, self.cid, self.localip, self.port, self.fqdn, ProcessBackup.stats)

          gs.append(g)

      except Exception, fault:

          log.exception("Error While Spawning")

      else:

          self.__done += 1

  else:

      gevent.joinall(gs)

def __populate_queue(self):

  log.debug("Populating Backup queue for pnum %s", self.pnum)

  for did in range(self.start, self.end):

      try:

          self.backupQueue.put(did,block=True,timeout=1)

      except Exception, fault:

          log.exception("Error while populating queue for process %s", self.pnum)

 

Graphs and Stress Metrics 

We ran tests on a dedicated control plane and dedicated RDS to ensure that there were no benchmarking issues.

Druva has a telegraph service that operates on Amazon EC2 instances and is responsible for data collection and transmission to InfluxDB. We leveraged the data gathered by this service to generate performance graphs, providing detailed insights from our stress test run.

Stress testing performance

 

Metrics

 

Test Type 

Baseline

Avg_Bkp_per_Min

~5K

Control Plane

m5a.4xlarge (16Core, 64GB)

Control Plane CPU mean core

14

Control Plane Mem GB Max

12

Amazon RDS DB

m5.xlarge

Amazon RDS DB CPU mean Core

1

Amazon RDS DB Mem GB

13.1

Amazon RDS DB Conn Min, Max

300,420

Amazon RDS DB Write IOPS

1660

Device Count

200000

Max Data Plane Nodes Spawned

15

Benefits

We calculated several metrics during the test.

  1. Connect time: The time taken to establish a connection with the Druva control plane.

  2. PreBackup Time: The time taken for the session creation.

  3. Handover Time: Time taken to connect to the data plane node.

  4. File Backup Time: Time taken for syncing files.

Additionally, we monitored the following metrics:

  1. Average backups per minute

  2. Amazon RDS (relational database service) IOPS (input/output operations per second)

  3. Amazon RDS CPU usage

  4. Amazon RDS connections

  5. Control plane CPU usage

  6. Control plane memory usage

Monitoring the above-mentioned metrics allowed us to identify potential bottlenecks in the product and infrastructure. For example, the following can be analyzed:

  • An elevation in connect time may signal potential issues with HAProxy.

  • An uptick in PreBackup time might imply potential issues with the control plane.

  • Monitoring RDS statistics is critical, especially considering the IO-intensive nature of backups. 

This practice enables us to gauge the effects of alterations to the control plane on Amazon RDS performance.

By closely monitoring these metrics, we can proactively detect and address performance issues, ensuring the reliability and efficiency of our system.

Conclusion

With this stress testing on the Druva backup control plane:

  • We stressed the system with backups of 200,000 devices. 

  • A load of over 90% was imposed on the CPU for certain control plane services and approximately 80% on the memory of the backup Amazon RDS, which utilized around 10 or more data plane nodes. 

Under Pressure, We Measure Performance