News/Trends, Product, Tech/Engineering

Data De-duplication

June 15, 2008
Jaspreet Singh, Founder and CEO

The Gartner Report (here) says storage data de-duplication and virtualization are two main technologies driving innovation in storage management software this year. This makes sense, considering the fact that corporate data is increasing at a whooping 60% annual rate. (Microsoft Report says here).

Server Backup

Data is very rarely common between production servers of different types. Its not difficult to imagine that Exchange email server may not have same content as Oracle database server. But data is largely duplicate within file-servers, exchange server and say a bunch of ERP servers (development and test). This duplication creates potential bottlenecks for bandwidth and storage used for backup.

Existing players have offered two solutions to this problem –

  1. Traditional single-instancing at backup server to filter out common content e.g Microsoft Single Instance Service (in Data center edition). This saves the just storage cost, depending upon at what level to filter commonalities – file / block / byte. A big player in this space is Data-Domain. These solutions don’t have a client component, they just save storage space.
  2. New innovative solutions like Avamar (now with EMC) and PureDisk (now with Veritas) which try filter content at backup server level before the data goes to the (remote) store. This makes these solutions much better suited for remote-office backups. They save bandwidth and storage.

But, there are two unsolved problems with both these approaches as well ( Which also, explains a poor response for these products in the market )-

  1. most of the times simple block checksum matching fails to figure out common data, as it may not fall on block boundaries . Eg. if you insert a simple byte in a file, the whole file changes and all the blocks shift. And the block checksum approach fails.
  2. Checksum calculation is very costly and makes backups CPU exhaustive.
  3. These approaches are targeting storage cost, not time/bandwidth which is more critical.

PC Backups

The problem is much more complex at PC level, as duplicated data is distributed among users and is as high as 90% in some cases. Emails / documents and similar file formats create large pool of duplicate data between users.

 

Also, since 50% of PC backup is mainly large email files, this is problem is particularly difficult to solve using simple file based de-duplication techniches used by servers.

Druvaa inSync v2.0 uses a on-wire (distributed) de-duplication technique which senses duplicate data before the backup starts and hences skips it from the backup. This is transparent to the user, all he notices is a 10 times boost in backup speed with over 90% reduction in bandwidth and storage usage.

How it works

This technology creates and maintains a Global “Single Instance” File System at backup server. Each time a user wants to backup a file, the insync clients prepares a file-fingerprint (using linear polynomial based hash) and compares it with the server. After the server sends a response, the backup happens only for the “unique” data within the file.

 

The patented advance file-fingerprinting makes it computationally very easy to filter common content like – same paragraphs in different documents, a same CCed email, media rich corporate presentations etc. This cuts down time for backup by 10 times and reduces bandwidth and storage utilization by 90%.

Other Interesting Features

Another good use of the Gobal Single Instance File System is – Continuous Data protection. The user after starting the restore can see how his files changes over time. Which gives him an option to restore point-in-time data from any point in the past. The marketing name for the feature is – “Eternity. Never lose a file. Ever.” A long name, but serves its meaning 🙂

Business Opportunities

The same technology/product can be stripped down to backup PDAs and scaled up to backup servers. A good use case would be to reduce time for backup of bunch of related remote servers.