Tech/Engineering

Data protection-as-a-service begins with a resilient architecture

Stephen Manley, CTO

The evolution from “products” to “services” has disrupted the IT infrastructure industry in a way that new hardware and software never could. Since a service model shifts responsibility for delivering business outcomes from the customer to the vendor, as-a-service companies are adopting different architectures. The biggest shift has been a focus on global resiliency. While resiliency was always an important requirement for standalone products, it becomes an obsession when you manage thousands of companies’ data.

Of the infrastructure services, data protection-as-a-service (DPaaS) must be the most resilient because it is a customer’s last line of defense. As data infrastructure becomes even more important, DPaaS must deliver service levels that exceed what customers can cobble together with legacy products. DPaaS cannot simply run traditional backup solutions at scale but must be re-architected for comprehensive resiliency across storage, compute, network, geography, and management. When you’re personally responsible for your customers’ data protection, everything changes.

Why does data protection resiliency matter so much?

When data became mission critical, so did data protection. Businesses no longer accept weekly backups with 2-3% failure rates and no self-service because every step of the application lifecycle depends on creating and recovering data quickly.

DevOps and application teams create backups before any significant change, so they can recover quickly if something goes wrong. They need an always-on, self-service backup, and recovery service. Even for static applications, administrators need reliable, frequent backups, so they can truncate database logs and not overrun their storage capacity.

Extreme environments like edge/IoT, high-performance computing, and unstructured data lakes require reliable backups to keep pace with the influx of data. They often have limited bandwidth, small backup windows, and high rates of data change, so if even a single backup fails, the change rate can make it impossible for the backup process to ever “catch up” again.

Since businesses depend on data protection to be always-on, even in the face of hardware, software, and human errors, customers should be able to:

  • Create new backups anytime
  • Restore backups anytime
  • Meet their recovery point objective (RPO) and recovery time objective (RTO)

Legacy data protection approaches are not resilient

Today, backup teams lack the people, technology, and processes to deliver a truly resilient solution.

Traditional vendors sell siloed products that the backup team must then synthesize into a solution. Each backup team tries to create an internal service, but they lack the technology and processes to build a true “as-a-service” solution.

First, each product measures only its own resiliency. For example, a deduplication appliance can ensure that the data is invulnerable to corruption or loss, but either the appliance or the backup application could suffer an outage. During the downtime, backups and restores will fail. The product claims success, even as the backup team fails, because it does not deliver a complete solution.

Second, even integrated on-premises products are constrained by their physical limitations. Clustered backup appliances are resilient to node failures, but data center failures (e.g. natural disasters, ransomware attacks) can compromise their availability. Even in the absence of a disaster, the most scalable physical cluster still hits physical limits that prevent it from meeting the customers’ requirements.

Third, resiliency requires global integrated management to maintain and upgrade the infrastructure. Today to fix bugs, support new workloads, or integrate new appliances, backup teams run cascading upgrades of backup clients, servers, appliances. Those upgrades may disable services for hours at a time, and with the interdependence of sites (e.g. to replicate backups), the disruption may be global. With siloed products and distributed backup teams, it is almost impossible to provide a global resilient (dps) data protection service.

How to architect a resilient DPaaS

A resilient DPaaS architecture must deliver an integrated solution for storage, compute, networking, geography, and management.

Data protection resilience begins by building a reliable, scalable storage layer. Not only must the backup data be protected from underlying disk hardware errors, but it must also have a resilient, immutable metadata layer to prevent data loss from deduplication errors and ransomware attacks. Simply “building on reliable storage” like Amazon S3 is insufficient because the protection storage layer (e.g. deduplication, encryption, and catalog) must be as reliable as the underlying object storage. Metadata and data both matter.

Since so many workloads cannot afford extended backup downtime, the backup/recovery process must be both modular and restartable. When processing millions of concurrent backups, compute resources will fail in the middle of operations. A resilient service should auto-detect the failed processes, restart the operations from where it left off, and never affect any other backup process.

The network is the most common cause of backup failures, so a DPaaS architecture should minimize its exposure to outages by transmitting as little data as possible. Source-based deduplication reduces the amount of data transmitted over the network. Furthermore, a modular, restartable backup architecture not only minimizes data reprocessing but also network retransmission.

DPaaS must be resilient not only to system outages but also to data-center failures. When there is a massive outage, customers need data protection more than ever. Therefore, the backup processing must not be tied to one data center. Most importantly, the backup service must store data across multiple regions, so customers’ backups are safe regardless of what happens.

Finally, DPaaS cannot incur downtime for system or software upgrades. Features, bug fixes, and upgrades must be transparent to the customers. A resilient service does not restrict users with “upgrade outages” or “garbage collection windows.” Instead, the restartable, modular system enables rolling upgrades so that the only way users know something has changed is when they enjoy a new functionality.

Druva’s resilient DPaaS architecture

Druva built a resilient DPaaS with resiliency woven through the architecture, so that it is both simple and scalable.

For backup resiliency, Druva’s cloud-native file system splits the metadata and data. Metadata is stored in DynamoDB while the data resides in object storage. The split ensures that backups are immutable, isolated from ransomware, and verifiably correct. Druva stores backups in 15 AWS regions, so customers can be confident that their data will be preserved in the face of any geographical threat.

To ensure that backup and restore processes succeed, both the data and processing layers are modular. Druva’s metadata and data layer scale independently in the cloud, so there is always capacity and performance available for customer backups, no matter how large or urgent. By using containers and serverless functions, Druva can start (or restart) backups at any time on any infrastructure. Those containers and functions also enable Druva to transparently upgrade the service, so customers automatically get the most up-to-date functionality. Druva’s global source-based deduplication uses the file system’s high-performance centralized metadata to transfer the minimum amount of data, so the service minimizes its dependency on the network.

Druva’s cloud platform runs over five million backups and restores a day because it was built to be resilient to errors and outages – not just at a component level, but as an integrated service. The storage, network, compute, and management work together globally to keep customers’ data safe, secure, and available.

Conclusion

Data protection has become a mission critical requirement for most organizations. Every part of the organization — developers, operations, compliance, and security — depends on data protection working 24×7. Unfortunately, it is nearly impossible for a team to build a resilient data protection service from legacy components. The architecture, deployment, and processes need to be designed from the ground up.

Data protection-as-a-service redefines the resiliency of cloud data protection. Instead of building “reliable” storage or backup appliance silos, it incorporates: storage, compute, networking, geography, and management. Since each of those elements can fail and cause outages, the system builds in global resiliency. Instead of measuring “storage uptime” or “percentage of successful backups” Druva users get what they need — protected data that is always on, whenever and wherever they need it. Druva redesigned every part of data protection. That’s what happens when you stop selling products and start taking responsibility for keeping your customers safe.

Learn more about the Druva Cloud Platform and how it can help your business can improve your business resiliency while reducing IT cost and complexity.