Platform
- Data Resiliency Cloud
  Data Resiliency Cloud
  Enterprise Cloud Backup and data management across edge, on-premises and cloud workloads
- Data Protection
  Data Protection
  Modernize data protection to reduce costs and complexity
- Cyber Resiliency
  Cyber Resiliency
  Be ready for cyber attacks with data that is always safe, always ready
  - Accelerated Ransomware Recovery
  - Security Posture & Observability
- Governance & Compliance
  Governance & Compliance
  Secure, protect, and streamline data governance for all your critical data, wherever it lives
  - eDiscovery and Legal Hold
  - Sensitive Data Management
- Take a Tour
Solutions
- Business Drivers
  Business Drivers
  Learn how Druva helps you accelerate key business initiatives
- SaaS Applications
  SaaS Applications
  Druva provides comprehensive data protection that supports multiple SaaS applications from a single platform. Discover the Druva difference today.
- Enterprise Workloads
  - Virtualization
    Virtualization
    Transform data center backup and disaster recovery for virtual environments
    
    VMware
    
    Nutanix
  - Databases
    Databases
    Reduce the cost and complexity of data protection for enterprise databases
    
    Oracle
    
    MS SQL
    
    SAP HANA
  - Files
    Files
    Discover a more cost-efficient way to protect on-premises and cloud NAS
    
    NAS/files
  - Public Cloud
    Public Cloud
    Protect native AWS and Azure deployments with secure backups without the cost and complexity
    
    AWS
    
    Microsoft Azure
- Enterprise Endpoints
  Enterprise Endpoints
  Unify SaaS apps and end-user device protection to reduce data risks. Improve cyber resilience and compliance by protecting enterprise workloads and assets.
- Free Trial
Customers
- Explore All Customer Stories
  We are trusted by the world's leading organizations to protect their data. Explore customer success stories to see how your peers are using Druva.
- Ransomware recovery ready
  Learn why Medallia chose Druva
  
  SaaS data protection across the enterprise
  See why Regeneron partnered with Druva
Resources
- 2023 Gartner® Magic Quadrant™
  See why Druva is recognized as a Visionary
  
  Data Resiliency for Dummies
  Get your guide to data resiliency
Partners
- Strategic Partners
  Strategic Partners
  Learn about Druva's strategic capabilities across platform, OEM, and other partnerships. Find out how Druva accelerates and protects customers' cloud journeys.
  - Dell Technologies
  - AWS
  - VMware
  - Nutanix
- Programs
  Programs
  Learn how you can profit with Druva and a cloud-first SaaS selling motion. Explore partner programs, access resources, and discover the benefits of partnering with Druva.
- Become a Partner
Company
- - Company
  - Leadership
  - Investors
  - Careers
  - Contact Us
  - Newsroom
  - Awards
  - Events
  - Blog
  - Diversity, Equity & Inclusion
- Get in touch with us
  Contact Us
  
  News, product innovations, and more
  Blog
Get Started
Support
Login
Language

News/Trends, Product

Understanding Self-Healing Storage

August 26, 2014 Abhay Ghaisas, VP, Engineering

The primary objective of data storage systems is to persist data permanently (or at least until specifically destroyed). But hardware is imperfect, disks fail, servers crash, which leads to inconsistencies in the file-system metadata. The traditional ways to deal with errors require the system to go offline – not a pleasant scenario.

Implementing storage systems involves addressing peculiar challenges. Elements can fail at any number of different points in data access paths; these failures could be temporary or permanent.

Traditionally, each operating system has its own flavor of storage software architecture, including file systems, to address the challenge. File systems make sure that a user’s data is accessible under a user-defined name-space on underlying disk(s). UFS, ext2, NTFS, xfs, and VxFS are some file-system examples.

Among their other features, the file systems maintain metadata, a structured way for the OS (or other system-level software) to manage the additional file system information. For example, a file is typically internally represented by an inode, which holds a list of disk-blocks that contains the actual data of the file. In addition, inodes hold other attributes, such as a file’s modification time-stamp, its size, and its access control lists. Similarly, a folder’s inode contains a list of its children and their inode numbers.

A file system also has a free block map to manage disk blocks.

A single file-system operation, such as creating a file, involves multiple metadata update operations; in the case illustrated below, that includes creating a new inode and its directory entry in the parent folder.

As in any multi-step operation, failures may occur at any point. The process may crash, the disk may lose the update, etc. These failures could be evidenced in the form of the operating system not showing the created file, an inability to create a file with the same name, or a resource leak.

There are two standard ways to deal with such inconsistencies:

Use transactional mechanisms to update multiple metadata objects. These transactions typically adhere to ACID guarantees (Atomicity, Consistency, Isolation, Durability). So general failures arising out of a process- or system crash are handled cleanly in transactional systems. The entire transaction is rolled back or rolled forward, using redo/undo transaction logging.
Use ordered updates. Here, multiple updates are ordered in such a way that at any point, a partial list of updates is safe (from an overall system behavior perspective). But, periodically, the incomplete or partial updates need to be cleansed from a space perspective. (For more on ordered updates, see the seminal paper, Soft Updates, from the ACM Transactions on Computer Systems, Vol. 18, No. 2, May 2000.)

For example, to illustrate a “safe” order of updates, as part of adding a new data block to a file in Druva Storage:

The block is stored persistently,
Dedupe indexes are created, and
A block map of the file is updated.

Traditionally, file systems have deployed offline utilities such as fsck (file system consistency check) or chkdsk to fix such metadata inconsistencies and to restore sanity. These tools are mostly off-line, which implies down time or outage for the file system. Depending on the circumstances, that may be an extended outage – which can cause user and IT upsets.

Naturally, we felt we needed to create a better answer.

Druva Storage

Druva products deploy our own file system, called Druva cloud file system. The key features of the Druva cloud file system are:

Source-side data de-duplication (dedupe)
Continuous data protection
Compressed and encrypted data storage in transit and at rest
Policy-based data retention

Druva cloud file system addresses several key concerns:

Durability

Druva cloud file system, hosted on Amazon public cloud, uses AWS S3 to store data. Amazon S3 is designed to provide 99.99999% durability of objects. Also, Amazon S3 is designed to sustain the concurrent loss of data in two facilities.

Druva cloud file system also uses the AWS DynamoDB service to manage its file-system metadata. Amazon DynamoDB synchronously replicates data across three facilities within an AWS Region.

When inSync is hosted on-premise, Druva cloud file system uses the local file system to store data and an embedded BerkeleyDB database engine to manage metadata. For on-premise data and database reliability, we rely on underlying disk subsystem reliability mechanisms (that is, RAID). It’s also possible to achieve redundancy via the dual-destination backup feature in inSync.

Availability

The Druva inSync cloud service is hosted on Amazon EC2 instances and accessible over WAN. It hosts hundreds to thousands of devices and backups, which are happening across the globe. This scale has an implication that extended outages for cleaning up inconsistencies anywhere in the system are simply not acceptable. Hence it’s very important to be available at all times. Failover is seamless, despite EC2 failures.

On-premise Druva inSync runs inside our customers’ data centers. It’s possible to achieve availability with the aforementioned dual-destination backup feature. Availability is no less of a concern for on-premise deployments; again, tens of thousands of devices are backed up regularly to Druva inSync.

Self-Healing Storage

Druva cloud file system may also face crashes, in the form of process failure, network disconnects, etc. In addition, loss of database entries can happen due to disk corruption or other failures. At times, anti-virus software can create foul play if it’s misconfigured. At such large scales, bringing down services to regularly detect and correct inconsistencies is simply not feasible. This brings us to the idea of self-healing storage.

It’s important for inSync that Druva cloud file system continues to serve both backup and restore requests, despite any possible storage inconsistencies. That’s what we’re selling, after all.

Our most common use case is when a laptop is dead and the user wants his data back as soon as possible. The last thing anyone wants to see under such circumstances is restore failures. To address this restorability concern, as a regular inSync maintenance procedure, a restore is simulated for the latest snapshot of each device. This guarantees that if a restore is attempted for the snapshot it won’t fail due to any kind of metadata inconsistencies.

If inconsistency is detected during the simulated restore process, it gets purged. This ensures that the snapshot is restorable, though it may miss a few files. Also, inSync does a full backup to follow that inconsistency report, to guarantee that the subsequent snapshot will be clean and fully restorable. Thus, Druva storage ensures restorable snapshots for the mobile devices or laptops being backed up.

There are other possible inconsistencies which may not impact the restore process, but may prevent compaction or incremental backups of the device. To detect and fix them, Druva cloud file system has its own fsck functionality. Its responsibility is to detect, report, and fix inconsistencies.

Both of these mechanisms run in the background during off-peak hours, as a regular, scheduled maintenance procedure.

At the scale at which Druva storage operates, it’s almost impossible to manually detect and fix metadata inconsistencies. Having it automated and self-healing is the only way our serviceability could scale at these levels. And we at Druva understand it well!

Interesting in getting to know Druva? Find out more by checking out these popular resources:

Download our FREE executive brief on addressing data sprawl, below!

Understanding Self-Healing Storage

Druva Storage

Durability

Availability

Self-Healing Storage

Blog

Druva Data Resiliency Cloud

Cloud Backup & Recovery

Data Protection

Governance & Compliance

Cyber Resilience

Business drivers

Workloads

Partners

Customers

Resources

Company