When operating at scale, even backup systems need backups. If a disaster ever occurs, IT needs to be able to restore a historical snapshot of users’ data to keep the business up and running. But a single corrupted file among the millions or even billions of files in the backup snapshot can cause a restore error, leading to serious delays in getting the business back up and running. A truly reliable backup system needs to overcome storage inconsistencies to ensure that every snapshot is always restorable.
When a process fails, well-intentioned anti-virus goes awry, or hardware failures corrupt database entries, and stored data and metadata may become inconsistent. It’s difficult for backup systems to identify the correct data to restore and then restore that data without error. Old approaches to these challenges leave businesses vulnerable to restore errors and extended downtime, resulting in IT headaches and lost productivity.
Backup systems operate by taking snapshots to record the contents of a file system at any point in time, so that the data can be restored and made available in the event of disaster. The snapshots are incremental: rather than capturing the entire contents of the file system each time, they record changes to each file and directory. This helps save space while always maintaining a complete record of the file system they’re backing up. However, the incremental nature of the backups means that file corruption can become a difficult problem to solve.
For example, imagine that you have saved a file called A.txt. The file was last modified at time T1, and then backed up by a snapshot at time T2. But for some reason, that snapshot at T2 is corrupted—perhaps a network glitch interrupted the file backup momentarily, right when A.txt was being captured. Because the version of A.txt in the T2 backup is corrupted, an attempt to restore that snapshot will end in a restore failure.
The problem compounds over time because of the incremental nature of the snapshots. The next snapshot, at time T3, will reference the same corrupted version of A.txt as the previous one—and so will each future snapshot. This single file can cause the entire restore operation for each timestamp to fail, even though the corrupted file is just one tiny data point among millions.
Druva’s self-healing storage system protects against these restore failures. This is accomplished by regularly simulating restores of backup snapshots to verify their viability and root out any data or metadata that could cause a failure.
In the example above, Druva would simulate a restore for the T3 snapshot and find that A.txt can’t be restored. The self-healing mechanism will then remove the metadata entry for A.txt from snapshots T2 and T3, ensuring that those snapshots are available for restore even though they’ll be missing A.txt.
In addition, the self-healing mechanism will force a full backup, so the next snapshot is clean and fully restorable, and replace the corrupted version of A.txt with a new clean version. This guarantees restorable snapshots for all data and ensures business continuity and an available restore even when there’s corrupted data or metadata.
This protection isn’t limited to inconsistencies that impact the restore process. Other inconsistencies could prevent compaction or incremental backups of the device, but Druva has its own file consistency utility check (fsck) functionality to detect, report, and fix these inconsistencies.
Beyond the self-healing mechanisms discussed so far, Druva’s custom file system exceeds the capabilities of traditional file systems through an innovative approach to data durability and availability. Key features of the Druva file system include:
Additionally, the Druva cloud file system uses Amazon DynamoDB, a fast and flexible NoSQL database, to manage file system metadata. This enables Druva to provide consistent, single-digit millisecond latency at any scale.
Availability is provided by Amazon Elastic Compute Cloud (EC2), a highly scalable cloud computing environment. High availability is essential. This cloud service approach ensures that extended outages will not be necessary to clean up inconsistencies or operate at scale.
Learn more about how self-healing storage plays a crucial part of a Data Management-as-a-Service platform.