News/Trends

Why you shouldn’t compare backup vendors on their deduplication ratios

W. Curtis Preston, Chief Technology Evangelist

Comparing the deduplication ratios of different backup vendors is an incredibly invalid way of comparing a product’s ability to deduplicate data. The reason deduplication ratios from different vendors cannot be compared directly with one another is that inputs, methods, and reporting are all different from product to product. This fact has been true since the very beginning of deduplication, but it is now more true than ever. In this blog, we’ll take a deeper look into why that is, and describe the proper way to compare vendors’ deduplication capabilities. Let’s take a look.

The inputs are different

The concept of deduplication ratios was born in the early days of target deduplication. You purchased a product like a Data Domain appliance and sent hundreds of terabytes of backups to an NFS mount, after which the appliance would deduplicate the data. You compared the volume of data sent by the backup product to the amount of disk used on the appliance, and that ratio was used to justify this new type of product.

Even in those early days, however, you couldn’t compare the advertised deduplication ratio of different products, because you had no idea how they created that number. The biggest reason for this was that you had no idea what type of backups each vendor sent to their appliance, or the change rates that they introduced after each backup — if any. If they wanted to make their deduplication ratio look better, they would simply perform a full backup every time with no change rate. Perform 100 full backups with no change and you have a 100:1 dedupe ratio!

Even if a vendor attempts to mimic a real production environment — with a mix of structured and unstructured data, and a reasonable amount of change — it will not match the ratios and change rate of your environment. You have a different mixture of structured and unstructured data and a different change rate. Some of your data might even be encrypted, which renders deduplication completely ineffective.

Finally, since every vendor uses a different set of test data, how can you possibly compare the results when the ingredients are completely different? The answer is you can’t.

The methods are different

How and where dedupe is performed also affects the ratio. Some vendors perform occasional full backups, others perform full-file incremental backups forever (i.e. a changed byte in a file will cause the entire file to be backed up), while others perform block-level incremental backups forever (i.e. backing up only the actual bytes that changed). 

A product that is deduplicating backup data that has occasional full backups is going to appear to have a better deduplication ratio than one that is deduplicating backups that uses an incremental forever approach. In the same way, a product that deduplicates full-file incremental backups is going to appear to have a greater deduplication ratio than one deduplicating block-level incremental forever backups. Products that perform occasional full backups simply create more duplicate data, and full-file incremental backups create more duplicate data than block-level incremental backups.

This is, of course, why I am writing this blog post in the first place. We often get asked by potential customers and analysts what our average deduplication ratio is. Because Druva performs block-level incremental backups forever, most of the redundant data is eliminated before our deduplication system ever sees it. Therefore, our deduplication ratios will be significantly less than those of competitors that do things differently — and yet, we actually usually use less disk than our competitors. 

The reporting is different

How a vendor chooses to describe their backup and deduplication processes can affect their deduplication ratios to the point of incredulity. As an example, I’m thinking of a particular source-side deduplication vendor. They say that each of their backups behaves like a full backup from a restore perspective. Therefore, they consider each block-level incremental backup the equivalent of a full backup. (I have no problem with that claim, as it’s something we would also say.) The problem is that they would count each backup as a full from the deduplication ratio perspective. This is why this particular vendor advertised deduplication ratios of 400:1! Comparing that ratio to any typical target deduplication vendor (which typically advertise dedupe ratios in the neighborhood of 20:1) was therefore obviously completely worthless. This is why I’m saying that how a vendor chooses to report its numbers also drastically affects any ratios you might see.

The correct way to compare deduplication systems

There are two essential elements to a proper comparison of deduplication systems: perform your own tests, and compare the size of your full backup to the amount of disk actually used after all backups. Let me explain.

First, you absolutely cannot compare the efficacy of different products simply by looking at the seemingly random numbers each vendor gives you. I explained why that’s the case in the previous three sections. This means you have to do your own testing with your own data. If you’re going to perform multiple tests and compare the results of those tests, you must ensure that the inputs are the same; the only thing you are changing is the dedupe product. The best way to do this is to create a test set of data that mimics your production data, and then automate changes to that data. If each time you perform the test you have the same data in the full backup and subsequent incremental backups, comparing the results of those tests would be a very valid way to test the efficacy of the two different products.

Finally — and most importantly — the only thing that matters in the end is how much disk is actually used by the deduplication system. If you have a 100 TB data set and back it up for 90 days, how much disk was actually used by the backup system? If one system used 150 TB of disk and another system used 75 TB of disk, the latter system is twice as good at eliminating duplicate data as the former system.

So for goodness sake, don’t ask a vendor what their dedupe ratio is. One thing you can do is to tell each prospective vendor what your data set looks like: how big it is, the structured and unstructured data ratio, and your typical daily change rate on your incremental backups. Then, ask them how much disk they would allocate as a starting point for your backup system. This will give a ballpark estimate of the efficacy of their dedupe system. Then, of course, do your own testing to verify that number. There is really no substitute for that hard work. 

To learn more about deduplication and its importance for your cloud backup strategy, watch Druva’s video below.