Platform
- Data Security Cloud
  Data Security Cloud
  Fully managed data security across enterprise, cloud, SaaS, and end user.
- Data Protection
  Data Protection
  Modernize data protection to reduce costs and complexity
- Cyber Response & Recovery
  Cyber Response & Recovery
  Bounce back from cyber attacks with data that is always safe and ready.
- eDiscovery & Compliance
  eDiscovery & Compliance
  Secure, protect, and streamline data governance.
- Meet Dru - Your Copilot for Data Security
Solutions
- Use Cases
  Use Cases
  Learn how Druva helps you accelerate key business initiatives
- Key Technologies
  - Public Cloud
    Public Cloud
    Protect native AWS and Azure deployments with secure backups without the cost and complexity
    
    Amazon EC2
    
    Amazon RDS
    
    Azure
  - Hybrid Workloads
    Hybrid Workloads
    Transform data center backup and disaster recovery for virtual environments
    
    VMware
    
    Hyper-V
    
    Nutanix
    
    Oracle
    
    MS SQL
    
    SAP HANA
    
    NAS/files
  - Endpoint and SaaS Apps
    Endpoint and SaaS Apps
    Enterprise Cloud Backup and data management across edge, on-premises and cloud workloads
    
    End User Protection
    
    Microsoft 365
    
    Salesforce
    
    Google Workspace
    
    Microsoft Entra ID
    
    Microsoft Dynamics 365
- Free Trial
Customers
- Explore All Customer Stories
  We are trusted by the world's leading organizations to protect their data. Explore customer success stories to see how your peers are using Druva.
- Ransomware recovery ready
  Learn why Medallia chose Druva
  
  SaaS data protection across the enterprise
  See why Regeneron partnered with Druva
Resources
- Druva vs. Veeam TCO Calculator
  Find the hidden costs of legacy backup
  
  Forrester: Total Economic Impact of Druva 2024
  Customers see 224% ROI: Find out how
Partners
- Programs
  Programs
  Learn how you can profit with Druva and a cloud-first SaaS selling motion. Explore partner programs, access resources, and discover the benefits of partnering with Druva.
- Strategic Partners
  Strategic Partners
  Learn about Druva's strategic capabilities across platform, OEM, and other partnerships. Find out how Druva accelerates and protects customers' cloud journeys.
  - Dell Technologies
  - AWS
  - VMware
  - Nutanix
- Become a Partner
Company
- - Company
  - Leadership
  - Investors
  - Careers
  - Contact Us
  - Newsroom
  - Awards
  - Events
  - Diversity, Equity & Inclusion
  - Blog
- Get in touch with us
  Contact Us
  
  News, product innovations, and more
  Blog
Get Started
Support
Login
Language
- English
- Deutsch

Tech/Engineering

Data Deduplication Among Untrusted Entities

August 02, 2023 Srikiran Gottipati, Senior Technical Director - Engineering

This is the third in our ongoing patent blog series. Read the previous one here.

Most storage systems implement data deduplication as a feature to reduce total storage capacity. Deduplication is a method of storing a single copy for each unique block of data even when a block appears multiple times in a single file and/or across multiple files.

When such a data block is shared across multiple files, the files themselves could belong to different owners or storage clients. This sharing of blocks across clients in a deduplicated storage can be a source for data security breaches if not appropriately addressed by the storage protocol.

Our newly awarded patent solves this security problem.

Before understanding the problem in detail, let’s understand how server-side and client-side deduplication work.

Server-Side Deduplication

In server-side deduplication, the data is deduplicated by the storage server once it receives the write request with data from the client.

So, when a client needs to write a data block to storage:

The client sends the data block and related metadata to the storage server
The storage server
1. Computes the fingerprint of the data block
2. Looks up the fingerprint in a hash store, and
  1. If the fingerprint matches then it means that it is a deduplicated block. In such case, the server adds the metadata to point to the existing shared block
  2. If no matches are found, it means the block is non-deduplicated. In this scenario, the server
    1. Commits the block to storage
    2. Adds the metadata to point to the newly committed block
    3. Adds the previously unseen fingerprint to the hash store

It’s important to remember that when data deduplication is implemented on the server side, there are additional resources that are consumed:

A CPU overhead from fingerprint calculation can limit the scalability of the storage.
A network overhead for shared blocks are sent to storage but not committed.

Client-Side Deduplication

In client-side deduplication, the client presents the fingerprint of the data in a pre-commit request. Based on the fingerprint, the storage server decides whether the client needs to commit the block or can reference an existing block (with the same fingerprint).

So, when a client needs to write a data block to storage:

The client computes the fingerprint of the data block and sends the fingerprint to the storage server
The storage server
1. Frist looks up the fingerprint in a hash store and sends a reference to the existing shared block if it is able to find a match
2. Otherwise sends a reference to the to-be-committed block, and a pre-signed URL for the client to commit the block to the data store
Then, the client
1. For a deduplicated block, sends the shared block reference and related metadata to the server
2. And for a non-deduplicated block, commits the block to the data store using the pre-signed URL and send the committed block reference and related metadata to the server
Finally, the server
1. For a deduplicated block, adds the metadata to point to the existing shared block
2. And for non-deduplicated block, adds the metadata to point to the newly committed block and adds the previously unseen fingerprint to the hash store

The data deduplication process has several benefits for the client:

As fingerprint calculation is off-loaded to the storage, CPU overhead is reduced
Network overhead is reduced as shared blocks are detected without being sent to storage
Much lesser data processing as data is committed directly to the data store

However, as the client claims to have the data, as well as its fingerprint, and never presents the actual data to the storage server, the server cannot ascertain if the client really has the data that it claims.

So, the server will have to inherently trust the client to follow the protocol and not make any false claims about the fingerprints or the data that it owns.

The Trust Problem with Client-Side Deduplication

A rogue client can conceivably attack a deduplicated storage in several ways to either corrupt or gain access to data owned by other clients.

Fingerprint Fishing

A rogue client can fish for data blocks by presenting random fingerprints and falsely claiming ownership of the data block that matches the fingerprint. And if the hash store contains the fingerprint, the rogue client will have gained shared read access to the data block.

Fingerprint Squatting

A rogue client can prevent storage from storing actual data blocks by presenting random fingerprints and claiming to have uploaded the data when the hash store doesn’t contain the fingerprint. Any future writes or copies of that file/block will automatically point to non-existent data.

Fingerprint Ghosting

A rogue client can hide or corrupt specific files before their first backup by calculating the fingerprint of the actual data and by falsely claiming to have uploaded the data without actually doing so. Any future writes or copies of that file/block will automatically point to non-existent data.

Fingerprint Scrubbing

A rogue client can scrub a specific file before it is first committed to storage by calculating the fingerprint of the actual data and uploading spurious data instead. Any future writes or copies of that file/block will automatically point to the scrubbed data.

Fingerprint Forgery

A rogue client can forge a specific file before it is first committed to storage by calculating the fingerprint of the actual data and uploading forged data instead. Any future writes or copies of that file/block will automatically point to the forged data.

The Solution to the Trust Problem

Verification is a good method to get rid of rogue clients. The storage server needs a way to verify the claims made by the storage client by presenting a fingerprint as part of the pre-commit request. This is done by requiring the client to corroborate its claim by presenting a second fingerprint of the data that is uniquely modified.

With the modified client-side deduplication scheme, when a client needs to write a data block to the storage:

The client sends an auth token to the server
The server then decrypts the customer encryption key (cek) using the auth token
The client now computes the fingerprint (let’s name this f1) of the data block and sends the fingerprint to the storage server
Next, the server computes an encryption key (ekey) from the fingerprint and the customer encryption key (cek) and sends the ekey to the client
The client then computes the fingerprint (let’s name this f2) after encrypting the data block with the ekey and sends both fingerprints (f2, f1) to the storage server
Next, the server
- Looks up the fingerprint (f2) in a hash store, and
  - If found and doesn’t match the associated f1
    - Reads the block from storage and decrypts the block
    - Computes the fingerprint of the decrypted data block
  - If the fingerprint matches f1 (error: bad request)
    - Deems the current request as rogue
    - Else (non-deduplicated block)
      - Deems the stored block as rogue
      - Deletes the fingerprint pair (f2, f1) from the hash store
      - Treats it as a non-deduplicated block request (see below)
  - If found and matches the associated f1 (deduplicated block)
    - Sends the block reference and related metadata to the server
  - If not found (non-deduplicated block)
    - Sends a reference to the to-be-committed block, and
    - A pre-signed URL for the client to commit the block
The client
- For a deduplicated block
  - Sends the shared block reference and related metadata to the server
- For a non-deduplicated block
  - Commits the block to storage using the pre-signed URL
  - Sends the committed block reference and related metadata to the server
The server
- For a deduplicated block
  - Adds the metadata to point to the existing shared block
- For a non-deduplicated block
  - Adds the metadata to point to the newly committed block
  - Adds the previously unseen fingerprint pair (f2, f1) to the hash store

The modified data deduplication scheme implemented on the client side treats clients as untrusted entities and requires them to corroborate their claims on data presented by way of fingerprints. The data encryption and the subsequent fingerprint computation introduce additional overhead for the client. This data encryption overhead and additional request response roundtrip per write can be reduced by computing the second fingerprint over a fixed deterministic transformation of data like compression or bit inversion.

Conclusion

Client-side deduplication helps reduce server and network overhead to reduce backup duration. Also, with the use of pre-signed URLs, the storage architecture is now split into separate control and data services allowing for new and multiple data stores to be supported in the backend. Not only has this made Druva’s backups more secure, but also less costly.

This is a testament to Druva's work culture that allows innovative solutions to challenging problems. If this interests you, take a look at our careers page for your opportunity to join Druva.

Data Deduplication Among Untrusted Entities

Server-Side Deduplication

Client-Side Deduplication

The Trust Problem with Client-Side Deduplication

Fingerprint Fishing

Fingerprint Squatting

Fingerprint Ghosting

Fingerprint Scrubbing

Fingerprint Forgery

The Solution to the Trust Problem

Conclusion

Druva Blog: Cloud Technology & Data Protection Articles

Druva Data Security Cloud

The Druva Platform

Data Protection

Cyber Response & Recovery

eDiscovery & Compliance

Use Cases

Key Technologies

Customers

Resources

Partners

Company