Platform
- Data Security Cloud
  Data Security Cloud
  Fully managed data security across enterprise, cloud, SaaS, and end user.
- Data Protection
  Data Protection
  Modernize data protection to reduce costs and complexity
- Cyber Response & Recovery
  Cyber Response & Recovery
  Bounce back from cyber attacks with data that is always safe and ready.
- eDiscovery & Compliance
  eDiscovery & Compliance
  Secure, protect, and streamline data governance.
- Meet Dru - Your Copilot for Data Security
Solutions
- Use Cases
  Use Cases
  Learn how Druva helps you accelerate key business initiatives
- Key Technologies
  - Public Cloud
    Public Cloud
    Protect native AWS and Azure deployments with secure backups without the cost and complexity
    
    AWS
    
    Azure
  - Hybrid Workloads
    Hybrid Workloads
    Transform data center backup and disaster recovery for virtual environments
    
    VMware
    
    Hyper-V
    
    Nutanix
    
    Oracle
    
    MS SQL
    
    SAP HANA
    
    NAS/files
  - Endpoint and SaaS Apps
    Endpoint and SaaS Apps
    Enterprise Cloud Backup and data management across edge, on-premises and cloud workloads
    
    End User Protection
    
    Microsoft 365
    
    Salesforce
    
    Google Workspace
    
    Microsoft Entra ID
- Free Trial
Customers
- Explore All Customer Stories
  We are trusted by the world's leading organizations to protect their data. Explore customer success stories to see how your peers are using Druva.
- Ransomware recovery ready
  Learn why Medallia chose Druva
  
  SaaS data protection across the enterprise
  See why Regeneron partnered with Druva
Resources
- Druva vs. Veeam TCO Calculator
  Find the hidden costs of legacy backup
  
  Data Resiliency for Dummies
  Get your guide to data resiliency
Partners
- Programs
  Programs
  Learn how you can profit with Druva and a cloud-first SaaS selling motion. Explore partner programs, access resources, and discover the benefits of partnering with Druva.
- Strategic Partners
  Strategic Partners
  Learn about Druva's strategic capabilities across platform, OEM, and other partnerships. Find out how Druva accelerates and protects customers' cloud journeys.
  - Dell Technologies
  - AWS
  - VMware
  - Nutanix
- Become a Partner
Company
- - Company
  - Leadership
  - Investors
  - Careers
  - Contact Us
  - Newsroom
  - Awards
  - Events
  - Diversity, Equity & Inclusion
  - Blog
- Get in touch with us
  Contact Us
  
  News, product innovations, and more
  Blog
Get Started
Support
Login
Language
- English
- Deutsch

Deduplication

Deduplication Definition

Deduplication refers to a method of eliminating a dataset’s redundant data. In a secure data deduplication process, a deduplication assessment tool identifies extra copies of data and deletes them, so a single instance can then be stored.

Data deduplication software analyzes data to identify duplicate byte patterns. In this way, the deduplication software ensures the single-byte pattern is correct and valid, then uses that stored byte pattern as a reference. Any further requests to store the same byte pattern will result in an additional pointer to the previously stored byte pattern.

What is deduplication?

Data deduplication allows users to reduce redundant data and more effectively manage backup activity, as well as ensuring more effective backups, cost savings, and load balancing benefits.

Data deduplication meaning

There is more than one kind of data deduplication. In its most basic form, the deduplication process happens at the level of single files, eliminating identical files. This is also called single instance storage (SIS) or file-level deduplication.

At the next level, deduplication identifies and eliminates redundant segments of data that are the same, even when the files they’re in are not entirely identical. This is called block-level deduplication or sub-file deduplication, and it frees up storage space. When most people say deduplication, they are referring to block-level deduplication. If they are referring to file-level deduplication, they will use that modifier.

Most block-level deduplication occurs at fixed block boundaries, but there is also variable-length deduplication or variable block deduplication, where data is split up at non-fixed block boundaries. Once the dataset has been split into a series of small pieces of data, referred to as chunks or shards, the rest of the process usually remains the same.

The deduplication system runs each shard through a hashing algorithm, such as SHA-1, SHA-2, or SHA-256, which creates a cryptographic alpha-numeric (referred to as a hash) for the shard. The value of that hash is then checked against a hash table or hash database to see if it’s ever been seen before. If it has never been seen before, the new shard is written to storage and the hash is added to the hash table/database; if not, it is discarded and an additional reference added to the hash table/database.

What are the benefits of deduplication in data backup?

Imagine how many times you make a tiny change to a document. An incremental backup will back up the entire file, even though you may have changed only one byte. Every critical business asset has the potential to hold duplicate data. In many organizations, up to 80 percent of corporate data is duplicate.

A customer using target deduplication (also called target-side deduplication), where the deduplication process runs inside a storage system once the native data is stored there, can save a lot of money on storage, cooling, floor space, and maintenance. A customer using source deduplication (also called source-side deduplication, or client-side deduplication), where redundancy is identified at the source before being sent across the network, can save money both on storage and network bandwidth. This is because the redundant segments of data are identified before being transmitted.

Source deduplication works very well with cloud storage and can improve backup speed notably. By reducing the amount of data and network bandwidth backup processes demand, deduplication streamlines the backup and recovery process. To decide when to use deduplication, consider if your business could benefit from these improvements.

What is a real-life deduplication example?

Imagine the manager of a business sends out 500 copies of the same 1 MB file, a financial outlook report with graphics, to the whole team. The company’s email server is now storing all 500 copies of that file. If all email inboxes then use a data backup system, all 500 copies are saved, eating up 500 MB of server space. Even a basic file-level data duplication system would save just one instance of the report. Every other instance just refers back to that single stored copy. This means the end bandwidth and storage burden on the server is only 1 MB from the unique data.

Another example is what happens when companies perform full-file incremental backups of files, where only a few bytes have changed, and occasionally perform full backups due to age-old design challenges in backup systems. A 10 TB file server would create 800 TB of backups just from eight weekly fulls, and probably another 8 TB or so of incremental backups over the same amount of time. A good deduplication system can reduce this 808 TB down to less than 100 TB – without lowering restore speed.

How does deduplication ratio to percentage work?

The deduplication ratio refers to the ratio of the amount of data that would be transmitted or stored without deduplication, vs the amount stored with deduplication. Deduplication can have a great impact on the backup size, reducing it by up to 25:1 in a standard enterprise backup setting. Obviously this depends on how much duplicative data exists and how efficient the file deduplication algorithm is.

However, a customer’s deduplication ratio can represent an inaccurate picture of the effectiveness of a dedupe system. If you backed up the same file 400 times, you would get a dedupe ratio of 400:1, but that speaks more to the inefficiency of your storage system, vs saying anything about how good your dedupe system is. When comparing different dedupe

What is post-process deduplication?

Post-process deduplication (PPD) characterizes a system in which deduplication software identifies and deletes redundant data only after it resides in a target deduplication data storage system. This technique may be necessary if it is not feasible or efficient to delete duplicate data during transfer or beforehand. This is also sometimes referred to as asynchronous deduplication, as the dedupe process is often performed as backups are being written, but each segment is only deduped after it is first written to storage.

How to implement deduplication

The best way to implement data deduplication technology will change depending on the user’s data protection goals, the data deduplication vendors used, and the sort of deduplication application in question. For example, a backup deduplication appliance or storage solution often includes deduplication technology and therefore has a much different implementation process than a freestanding deduplication software tool.

However, document deduplication technology is generally deployed either at the target or at the source. The differences here concern not just where, but when — before storage in the backup system or after the data is already there — the deduplication process takes place.

How does deduplication encryption work?

There is an intimate relationship between deduplication and encryption because a tool can only detect duplicate data and delete it if it can read that data. This means that any encryption must always happen after the dedupe process. If it were to happen before the dedupe process, no duplicate data would be found.

Druva data deduplication solutions

Druva defines its patented approach to global data deduplication using these four unique qualities:

Global deduplication. Druva compares all of a given customer’s backup data against all other data from that customer, even from other locations. This reduces duplicate data more than any other vendor.
Source-side deduplication (i.e., client-side deduplication). Druva’s deduplication process starts at the client, not the backup system, reducing how much data must be transmitted over the network. However, Druva does use its service running in the cloud to do most of the work, in order to reduce the load on the client.
Block-level, sub-file analysis. This level of deduplication allows the tool to identify duplicate data within files.
Awareness of data-generating applications. Druva inSync searches inside application data for duplicate data.
Scalable performance. Druva’s deduplication locates and deletes duplicate data far beyond a single user, scaling across multiple devices and users.

Discover Druva's innovative products to optimize data storage, improve network performance, and accelerate backup on the deduplication page of the website.

Related terms

Now that you’ve learned about deduplication, brush up on these related terms with Druva’s glossary:

Deduplication

Deduplication Definition

What is deduplication?

Data deduplication meaning

What are the benefits of deduplication in data backup?

What is a real-life deduplication example?

How does deduplication ratio to percentage work?

What is post-process deduplication?

How to implement deduplication

How does deduplication encryption work?

Druva data deduplication solutions

Related terms

Druva Data Security Cloud

The Druva Platform

Data Protection

Cyber Response & Recovery

eDiscovery & Compliance

Use Cases

Key Technologies

Customers

Resources

Partners

Company