Tech/Engineering

Effective Data Recovery | File-Level Restores [Part 1]

Rakesh Sharma, Sr. Staff Software Engineer

Introduction

The ability to restore individual files is one of the most basic functionalities of a data backup and protection tool. As easy as it may seem, performing this function on a file located in a VM is a resource-intensive and time-consuming process. Even for restoring a small file, users need to restore the contents of the entire VM and then handpick the file from the data set. 

To get around this problem, we planned to use File-Level Restore or simply FLR. Part 1 of this blog series explains what is FLR, and how we went about implementing FLR. The second part of this blog will discuss the improvements we made in the FLR restore process to achieve faster execution speeds.

What is FLR?

FLR stands for File-Level Restore. It refers to the ability to restore individual files or folders from a virtual disk file that is stored in the cloud. This eliminates the need of downloading the entire virtual disk and attaching it to a virtual machine. FLR provides a more efficient and granular way to recover specific files or folders, saving time and resources compared to traditional full-disk restores.

Significance of FLR

File-Level Restore (FLR) plays a vital role in providing customers enhanced control over the data restoration process. Without FLR, even for a file as small as 16 MB, users would have to restore the entire virtual machine, leading to the potential download of multiple terabytes of data from the cloud. FLR offers a streamlined and efficient solution for selectively restoring specific files from a virtual disk backup.

The high adoption rate of FLR (accounting for approximately 50% of all restores) underscores the importance of this feature and its frequent utilization by customers.

Before diving deep into the solution, here’s the tech stack that we used and a few key terms that we will use frequently to explain the details of the solution. 

Technology stack that we used

  • Language: Python

  • FUSE: User module implemented with Python

  • Loop Devices

Terminology

  • FLR: File-Level Restore.

  • Virtual disk: A virtual disk is a file that appears as a physical disk drive to the guest operating system. Virtual hard disk files store information such as the operating system, program files, and data files.

  • Disk Offset: An offset into a disk is simply the character location within that disk, usually starting with 0; thus "offset 240" is the 241st byte in the disk.

  • File Offset: An offset into a file is simply the character location within that file, usually starting with 0; The important thing to note is that a File Offset is converted to a Disk Offset before the data can be read and this conversion is done by the underlying FileSystem.

  • Target/Target VM: Used interchangeably to refer to a target virtual machine where data has to be restored.

Original Solution

To access a file stored within a virtual disk, it is necessary to determine the disk offsets of the data blocks corresponding to that file. In the best-case scenario, the blocks of the file are stored sequentially. While in the worst-case recovery scenario, the data blocks may be scattered across the entire virtual disk.

File blocks to disk blocks mapping

On a Linux machine, it is possible to treat a regular file as a block device by utilizing a Loop Device. The Loop Device can treat the virtual disk as a block device on the local Linux machine and mount its volumes with the appropriate filesystem type. Once the volumes are successfully mounted, we gain the ability to read any desired file(s) or folder(s) stored within them. The mounted filesystem handles the conversion from file offsets to disk offsets seamlessly.

Fortunately, even if the virtual disk is not locally available, we can still mount its volumes on the local machine using Loop Devices and FUSE (Filesystem in Userspace). This enables us to access and work with the contents of the virtual disk without requiring its physical presence on the local system.

FUSE 

FUSE (Filesystem in Userspace) is a software interface that allows non-privileged users to create their own file systems on Unix and Unix-like operating systems without modifying kernel code. It achieves this by running file system code in user space while providing a bridge to the kernel interfaces.

In our use case, FUSE plays a crucial role in redirecting read system calls to cloud storage instead of serving them locally. This allows us to simulate file system operations according to our requirements.

Reading a single block of data looks like this. 

Single read request

The agent process is a process running on the local machine and is responsible for serving the reads from the cloud.

By adopting this approach, we can selectively read the desired file(s) or folder(s) without the need to download the entire virtual disk. For instance, if we only need to restore a 16 MB file, there's no need to retrieve the entire 2 TB virtual disk. This approach significantly reduces the amount of data transferred.

However, it's important to note that this solution has limitations in terms of download speed. Specifically, the FLR performance was approximately 20 GBPH (gigabytes per hour).

To address this limitation, we implemented a solution using duplicate loop devices per disk, which we’ll describe in the second part of this blog.

After downloading a block, the next step is to copy it to the target virtual machine. However, it is undesirable to download the entire file before initiating the write operation on the target VM. 

Initially, we utilized a VMware tools API called InitiateFileTransferToGuest to copy files to the target VM. This API accepts a source_path and destination_path and handles both reading and writing data to the target VM. While this API sufficed for file transfers, it exhibited poor performance when dealing with large data transfers in the gigabytes range. To address this limitation, we implemented a custom Reader/Writer Pipeline, which significantly enhanced the efficiency and performance of transferring GBs of data to the target VM.

Next steps

Although FLR helped us eliminate the need of downloading the entire virtual disk to restore a single file, its poor performance for large file restores became a hindrance. 

Improving the performance of large file transfers was a key element of our FLR implementation. Stay tuned for Part 2 of this blog where we take about the improvements that we made. 

To learn more about Druva’s technical innovations and how we deliver the best cloud-based backup and restore solution on the market, visit the tech/engineering section of the blog archive.