Tech/Engineering

Data Deduplication Among Untrusted Entities

Srikiran Gottipati, Senior Technical Director - Engineering

This is the third in our ongoing patent blog series. Read the previous one here.

Most storage systems implement data deduplication as a feature to reduce total storage capacity. Deduplication is a method of storing a single copy for each unique block of data even when a block appears multiple times in a single file and/or across multiple files. 

When such a data block is shared across multiple files, the files themselves could belong to different owners or storage clients. This sharing of blocks across clients in a deduplicated storage can be a source for data security breaches if not appropriately addressed by the storage protocol.

Our newly awarded patent solves this security problem. 

Before understanding the problem in detail, let’s understand how server-side and client-side deduplication work.

Server-Side Deduplication

In server-side deduplication, the data is deduplicated by the storage server once it receives the write request with data from the client. 

So, when a client needs to write a data block to storage:

  1. The client sends the data block and related metadata to the storage server

  2. The storage server

    1. Computes the fingerprint of the data block

    2. Looks up the fingerprint in a hash store, and

      1. If the fingerprint matches then it means that it is a deduplicated block. In such case, the server adds the metadata to point to the existing shared block

      2. If no matches are found, it means the block is non-deduplicated. In this scenario, the server

        1. Commits the block to storage

        2. Adds the metadata to point to the newly committed block

        3. Adds the previously unseen fingerprint to the hash store

It’s important to remember that when data deduplication is implemented on the server side, there are additional resources that are consumed: 

  • A CPU overhead from fingerprint calculation can limit the scalability of the storage.

  • A network overhead for shared blocks are sent to storage but not committed.

Client-Side Deduplication

In client-side deduplication, the client presents the fingerprint of the data in a pre-commit request. Based on the fingerprint, the storage server decides whether the client needs to commit the block or can reference an existing block (with the same fingerprint). 

So, when a client needs to write a data block to storage:

  1. The client computes the fingerprint of the data block and sends the fingerprint to the storage server

  2. The storage server

    1. Frist looks up the fingerprint in a hash store and sends a reference to the existing shared block if it is able to find a match

    2. Otherwise sends a reference to the to-be-committed block, and a pre-signed URL for the client to commit the block to the data store

  3. Then, the client

    1. For a deduplicated block, sends the shared block reference and related metadata to the server

    2. And for a non-deduplicated block, commits the block to the data store using the pre-signed URL and send the committed block reference and related metadata to the server

  4. Finally, the server

    1. For a deduplicated block, adds the metadata to point to the existing shared block

    2. And for non-deduplicated block, adds the metadata to point to the newly committed block and adds the previously unseen fingerprint to the hash store

The data deduplication process has several benefits for the client: 

  • As fingerprint calculation is off-loaded to the storage, CPU overhead is reduced

  • Network overhead is reduced as shared blocks are detected without being sent to storage

  • Much lesser data processing as data is committed directly to the data store

However, as the client claims to have the data, as well as its fingerprint, and never presents the actual data to the storage server, the server cannot ascertain if the client really has the data that it claims. 

So, the server will have to inherently trust the client to follow the protocol and not make any false claims about the fingerprints or the data that it owns.

The Trust Problem with Client-Side Deduplication

A rogue client can conceivably attack a deduplicated storage in several ways to either corrupt or gain access to data owned by other clients.

Fingerprint Fishing

A rogue client can fish for data blocks by presenting random fingerprints and falsely claiming ownership of the data block that matches the fingerprint. And if the hash store contains the fingerprint, the rogue client will have gained shared read access to the data block.

Fingerprint Squatting

A rogue client can prevent storage from storing actual data blocks by presenting random fingerprints and claiming to have uploaded the data when the hash store doesn’t contain the fingerprint. Any future writes or copies of that file/block will automatically point to non-existent data.

Fingerprint Ghosting

A rogue client can hide or corrupt specific files before their first backup by calculating the fingerprint of the actual data and by falsely claiming to have uploaded the data without actually doing so. Any future writes or copies of that file/block will automatically point to non-existent data.

Fingerprint Scrubbing

A rogue client can scrub a specific file before it is first committed to storage by calculating the fingerprint of the actual data and uploading spurious data instead. Any future writes or copies of that file/block will automatically point to the scrubbed data.

Fingerprint Forgery

A rogue client can forge a specific file before it is first committed to storage by calculating the fingerprint of the actual data and uploading forged data instead. Any future writes or copies of that file/block will automatically point to the forged data.

The Solution to the Trust Problem

Verification is a good method to get rid of rogue clients. The storage server needs a way to verify the claims made by the storage client by presenting a fingerprint as part of the pre-commit request. This is done by requiring the client to corroborate its claim by presenting a second fingerprint of the data that is uniquely modified. 

With the modified client-side deduplication scheme, when a client needs to write a data block to the storage:

  • The client sends an auth token to the server

  • The server then decrypts the customer encryption key (cek) using the auth token

  • The client now computes the fingerprint (let’s name this f1) of the data block and sends the fingerprint to the storage server

  • Next, the server computes an encryption key (ekey) from the fingerprint and the customer encryption key (cek) and sends the ekey to the client

  • The client then computes the fingerprint (let’s name this f2) after encrypting the data block with the ekey and sends both fingerprints (f2, f1) to the storage server

  • Next, the server

    • Looks up the fingerprint (f2) in a hash store, and

      • If found and doesn’t match the associated f1

        • Reads the block from storage and decrypts the block

        • Computes the fingerprint of the decrypted data block

      • If the fingerprint matches f1 (error: bad request)

        • Deems the current request as rogue

        • Else (non-deduplicated block)

          • Deems the stored block as rogue

          • Deletes the fingerprint pair (f2, f1) from the hash store

          • Treats it as a non-deduplicated block request (see below)

      • If found and matches the associated f1 (deduplicated block)

        • Sends the block reference and related metadata to the server

      • If not found (non-deduplicated block)

        • Sends a reference to the to-be-committed block, and

        • A pre-signed URL for the client to commit the block

  • The client

    • For a deduplicated block

      • Sends the shared block reference and related metadata to the server

    • For a non-deduplicated block

      • Commits the block to storage using the pre-signed URL

      • Sends the committed block reference and related metadata to the server

  • The server

    • For a deduplicated block

      • Adds the metadata to point to the existing shared block

    • For a non-deduplicated block

      • Adds the metadata to point to the newly committed block

      • Adds the previously unseen fingerprint pair (f2, f1) to the hash store

The modified data deduplication scheme implemented on the client side treats clients as untrusted entities and requires them to corroborate their claims on data presented by way of fingerprints. The data encryption and the subsequent fingerprint computation introduce additional overhead for the client. This data encryption overhead and additional request response roundtrip per write can be reduced by computing the second fingerprint over a fixed deterministic transformation of data like compression or bit inversion.

Conclusion

Client-side deduplication helps reduce server and network overhead to reduce backup duration. Also, with the use of pre-signed URLs, the storage architecture is now split into separate control and data services allowing for new and multiple data stores to be supported in the backend. Not only has this made Druva’s backups more secure, but also less costly. 

This is a testament to Druva's work culture that allows innovative solutions to challenging problems. If this interests you, take a look at our careers page for your opportunity to join Druva.