Data De-duplication

Data De-duplication

The Gartner Report (here) says storage data de-duplication and virtualization are two main technologies driving innovation in storage management software this year. This makes sense, considering the fact that corporate data is increasing at a whooping 60% annual rate. (Microsoft Report says here).

Server Backup

Data is very rarely common between production servers of different types. Its not difficult to imagine that Exchange email server may not have same content as Oracle database server. But data is largely duplicate within file-servers, exchange server and say a bunch of ERP servers (development and test). This duplication creates potential bottlenecks for bandwidth and storage used for backup.

Existing players have offered two solutions to this problem –

  1. Traditional single-instancing at backup server to filter out common content e.g Microsoft Single Instance Service (in Data center edition). This saves the just storage cost, depending upon at what level to filter commonalities – file / block / byte. A big player in this space is Data-Domain. These solutions don’t have a client component, they just save storage space.
  2. New innovative solutions like Avamar (now with EMC) and PureDisk (now with Veritas) which try filter content at backup server level before the data goes to the (remote) store. This makes these solutions much better suited for remote-office backups. They save bandwidth and storage.

But, there are two unsolved problems with both these approaches as well ( Which also, explains a poor response for these products in the market )-

  1. most of the times simple block checksum matching fails to figure out common data, as it may not fall on block boundaries . Eg. if you insert a simple byte in a file, the whole file changes and all the blocks shift. And the block checksum approach fails.
  2. Checksum calculation is very costly and makes backups CPU exhaustive.
  3. These approaches are targeting storage cost, not time/bandwidth which is more critical.

PC Backups

The problem is much more complex at PC level, as duplicated data is distributed among users and is as high as 90% in some cases. Emails / documents and similar file formats create large pool of duplicate data between users.



Also, since 50% of PC backup is mainly large email files, this is problem is particularly difficult to solve using simple file based de-duplication techniches used by servers.

Druvaa inSync v2.0 uses a on-wire (distributed) de-duplication technique which senses duplicate data before the backup starts and hences skips it from the backup. This is transparent to the user, all he notices is a 10 times boost in backup speed with over 90% reduction in bandwidth and storage usage.

How it works

This technology creates and maintains a Global “Single Instance” File System at backup server. Each time a user wants to backup a file, the insync clients prepares a file-fingerprint (using linear polynomial based hash) and compares it with the server. After the server sends a response, the backup happens only for the “unique” data within the file.



The (patent pending) advance file-fingerprinting makes it computationally very easy to filter common content like – same paragraphs in different documents, a same CCed email, media rich corporate presentations etc. This cuts down time for backup by 10 times and reduces bandwidth and storage utilization by 90%.

Other Interesting Features

Another good use of the Gobal Single Instance File System is – Continuous Data protection. The user after starting the restore can see how his files changes over time. Which gives him an option to restore point-in-time data from any point in the past. The marketing name for the feature is – “Eternity. Never lose a file. Ever.” A long name, but serves its meaning :)

Business Opportunities

The same technology/product can be stripped down to backup PDAs and scaled up to backup servers. A good use case would be to reduce time for backup of bunch of related remote servers.


Jaspreet Singh

Jaspreet bootstrapped the company while defining the product, sales and marketing strategies that have resulted in Druva's early and impressive success. Prior to founding Druva, Jaspreet was a member of the storage foundation group at Veritas.


  1. Ankur P 8 years ago


    Your claims about Avamar technology are not correct. Their technique is neither block-based, nor is their checksum computation “expensive”. They use a technique they call “sticky byte factoring”, which uses simple rolling checksums to generate variable sized chunks of data and to find differences. See their patent application for details: You can see that their technique is highly suited for remote office kind of situations. PureDisk used to do fixed-size chunking, but I had heard that they have also started moving towards an algorithm very much like Avamar’s.

    Also, when you say “linear polynomial based hash”, I guess you mean Rabin fingerprinting. Rsync also uses similar fingerprinting to reduce latencies.

    – Ankur.

  2. Jaspreet 8 years ago


    Thanks for the information on Avamar. Yup, I also came to know about that some time back.

    The approach followed by rsync, avamar and pure disk is to narrow down the delta change from the previous backup of the same data.

    We are creating fingerprints which can be matched with any “similar” data from any source. The important step is to –

    1. choose block boundaries which can be checked for hash against data from other sources.
    2. compute this and check it against server in real time.

    This makes sense for Laptop backups where duplicates are between machines. The pitch is for time not storage space.

  3. Davish Bhardwaj 7 years ago

    Your blog is very intresting.
    I began my research on data deduplication after reading your blog.
    But after I’m done with my research on Dedupe,I want to know where to showcase my research.
    Actually I have developed an Algorithm for Variable size block Deduplication which I think is best in the storage domain industry in terms of Deduplication ration with an additional blend of fast processing.
    Can Anybody there help me out.

    Thanks in advance.

  1. Pingback: 10x Faster Enterprise PC Backup at VentureWoods - India's leading venture capital community
  2. Pingback: Men’s Game » Blog Archive » 10x Faster Enterprise PC Backup
  3. Pingback: Types Of Erp
  4. Pingback: Data De-duplication | My Blog