This is the third in our ongoing patent blog series. Read the previous one here.
Most storage systems implement data deduplication as a feature to reduce total storage capacity. Deduplication is a method of storing a single copy for each unique block of data even when a block appears multiple times in a single file and/or across multiple files.
When such a data block is shared across multiple files, the files themselves could belong to different owners or storage clients. This sharing of blocks across clients in a deduplicated storage can be a source for data security breaches if not appropriately addressed by the storage protocol.
Our newly awarded patent solves this security problem.
Before understanding the problem in detail, let’s understand how server-side and client-side deduplication work.
Server-Side Deduplication
In server-side deduplication, the data is deduplicated by the storage server once it receives the write request with data from the client.
So, when a client needs to write a data block to storage:
The client sends the data block and related metadata to the storage server
The storage server
Computes the fingerprint of the data block
Looks up the fingerprint in a hash store, and
If the fingerprint matches then it means that it is a deduplicated block. In such case, the server adds the metadata to point to the existing shared block
If no matches are found, it means the block is non-deduplicated. In this scenario, the server
Commits the block to storage
Adds the metadata to point to the newly committed block
Adds the previously unseen fingerprint to the hash store
It’s important to remember that when data deduplication is implemented on the server side, there are additional resources that are consumed:
A CPU overhead from fingerprint calculation can limit the scalability of the storage.
A network overhead for shared blocks are sent to storage but not committed.
Client-Side Deduplication
In client-side deduplication, the client presents the fingerprint of the data in a pre-commit request. Based on the fingerprint, the storage server decides whether the client needs to commit the block or can reference an existing block (with the same fingerprint).
So, when a client needs to write a data block to storage:
The client computes the fingerprint of the data block and sends the fingerprint to the storage server
The storage server
Frist looks up the fingerprint in a hash store and sends a reference to the existing shared block if it is able to find a match
Otherwise sends a reference to the to-be-committed block, and a pre-signed URL for the client to commit the block to the data store
Then, the client
For a deduplicated block, sends the shared block reference and related metadata to the server
And for a non-deduplicated block, commits the block to the data store using the pre-signed URL and send the committed block reference and related metadata to the server
Finally, the server
For a deduplicated block, adds the metadata to point to the existing shared block
And for non-deduplicated block, adds the metadata to point to the newly committed block and adds the previously unseen fingerprint to the hash store
The data deduplication process has several benefits for the client:
As fingerprint calculation is off-loaded to the storage, CPU overhead is reduced
Network overhead is reduced as shared blocks are detected without being sent to storage
Much lesser data processing as data is committed directly to the data store
However, as the client claims to have the data, as well as its fingerprint, and never presents the actual data to the storage server, the server cannot ascertain if the client really has the data that it claims.
So, the server will have to inherently trust the client to follow the protocol and not make any false claims about the fingerprints or the data that it owns.
The Trust Problem with Client-Side Deduplication
A rogue client can conceivably attack a deduplicated storage in several ways to either corrupt or gain access to data owned by other clients.
Fingerprint Fishing
A rogue client can fish for data blocks by presenting random fingerprints and falsely claiming ownership of the data block that matches the fingerprint. And if the hash store contains the fingerprint, the rogue client will have gained shared read access to the data block.
Fingerprint Squatting
A rogue client can prevent storage from storing actual data blocks by presenting random fingerprints and claiming to have uploaded the data when the hash store doesn’t contain the fingerprint. Any future writes or copies of that file/block will automatically point to non-existent data.
Fingerprint Ghosting
A rogue client can hide or corrupt specific files before their first backup by calculating the fingerprint of the actual data and by falsely claiming to have uploaded the data without actually doing so. Any future writes or copies of that file/block will automatically point to non-existent data.
Fingerprint Scrubbing
A rogue client can scrub a specific file before it is first committed to storage by calculating the fingerprint of the actual data and uploading spurious data instead. Any future writes or copies of that file/block will automatically point to the scrubbed data.
Fingerprint Forgery
A rogue client can forge a specific file before it is first committed to storage by calculating the fingerprint of the actual data and uploading forged data instead. Any future writes or copies of that file/block will automatically point to the forged data.
The Solution to the Trust Problem
Verification is a good method to get rid of rogue clients. The storage server needs a way to verify the claims made by the storage client by presenting a fingerprint as part of the pre-commit request. This is done by requiring the client to corroborate its claim by presenting a second fingerprint of the data that is uniquely modified.
With the modified client-side deduplication scheme, when a client needs to write a data block to the storage:
The client sends an auth token to the server
The server then decrypts the customer encryption key (cek) using the auth token
The client now computes the fingerprint (let’s name this f1) of the data block and sends the fingerprint to the storage server
Next, the server computes an encryption key (ekey) from the fingerprint and the customer encryption key (cek) and sends the ekey to the client
The client then computes the fingerprint (let’s name this f2) after encrypting the data block with the ekey and sends both fingerprints (f2, f1) to the storage server
Next, the server
Looks up the fingerprint (f2) in a hash store, and
If found and doesn’t match the associated f1
Reads the block from storage and decrypts the block
Computes the fingerprint of the decrypted data block
If the fingerprint matches f1 (error: bad request)
Deems the current request as rogue
Else (non-deduplicated block)
Deems the stored block as rogue
Deletes the fingerprint pair (f2, f1) from the hash store
Treats it as a non-deduplicated block request (see below)
If found and matches the associated f1 (deduplicated block)
Sends the block reference and related metadata to the server
If not found (non-deduplicated block)
Sends a reference to the to-be-committed block, and
A pre-signed URL for the client to commit the block
The client
For a deduplicated block
Sends the shared block reference and related metadata to the server
For a non-deduplicated block
Commits the block to storage using the pre-signed URL
Sends the committed block reference and related metadata to the server
The server
For a deduplicated block
Adds the metadata to point to the existing shared block
For a non-deduplicated block
Adds the metadata to point to the newly committed block
Adds the previously unseen fingerprint pair (f2, f1) to the hash store
The modified data deduplication scheme implemented on the client side treats clients as untrusted entities and requires them to corroborate their claims on data presented by way of fingerprints. The data encryption and the subsequent fingerprint computation introduce additional overhead for the client. This data encryption overhead and additional request response roundtrip per write can be reduced by computing the second fingerprint over a fixed deterministic transformation of data like compression or bit inversion.
Conclusion
Client-side deduplication helps reduce server and network overhead to reduce backup duration. Also, with the use of pre-signed URLs, the storage architecture is now split into separate control and data services allowing for new and multiple data stores to be supported in the backend. Not only has this made Druva’s backups more secure, but also less costly.
This is a testament to Druva's work culture that allows innovative solutions to challenging problems. If this interests you, take a look at our careers page for your opportunity to join Druva.