Deduplication

Deduplication Definition

Deduplication is a method of reducing redundant data by identifying duplicate pieces of information and storing only a single, unique instance. Instead of saving repeated copies, a deduplication system keeps one copy and replaces the rest with lightweight references (pointers) back to the stored original.

Deduplication is widely used in backup and storage systems because it can dramatically lower storage consumption and reduce the amount of data that needs to be transferred across networks—especially in environments where the same files, email attachments, operating system images, or repeated versions of data appear across many users and devices.

What is deduplication?

Data deduplication allows users to reduce redundant data and more effectively manage backup activity, as well as ensuring more effective backups, cost savings, and load balancing benefits.

 

 

Data deduplication explained

Data deduplication comes in different forms. The simplest is file-level deduplication, also known as single instance storage (SIS), which removes identical files. A more advanced form is block-level deduplication, where redundant parts within files are identified and eliminated even if the files are not exactly the same.

This block-level process is what most people refer to when they talk about deduplication. Blocks can be fixed or variable in size. The data is split into chunks, each chunk is hashed using algorithms like SHA-1 or SHA-256, and these hashes are checked against a database to determine if the chunk has been stored before. If it’s new, it’s saved; if not, only a reference is added.

There’s more than one way to deduplicate data. The main difference is how granularly the system looks for duplicates.

File-level deduplication (single instance storage)

In its simplest form, deduplication happens at the level of entire files—eliminating identical files. This is often called single instance storage (SIS) or file-level deduplication.

Block-level (sub-file) deduplication

At the next level, deduplication identifies and removes redundant segments of data even when the overall files aren’t identical. This is known as block-level deduplication (or sub-file deduplication). When most people say “deduplication,” they’re referring to block-level deduplication because it typically provides much greater storage reduction.

Fixed vs. variable block deduplication

Most block-level deduplication uses fixed block boundaries, where files are split into equal-sized chunks. Some systems use variable-length (variable block) deduplication, where the system splits data into chunks at non-fixed boundaries.

After splitting data into chunks (sometimes called shards), the system creates a unique fingerprint for each chunk using a hashing algorithm (such as SHA-1, SHA-2, or SHA-256). If that fingerprint has been seen before, the system stores a reference to the existing chunk instead of writing it again. If it’s new, the chunk is stored and added to the index.

 

What are the benefits of deduplication in data backup?

Deduplication can have a major impact on storage and backup efficiency because a large percentage of corporate data is often duplicated. For example, many organizations store repeated versions of documents, common operating system files, and duplicated email attachments across many users and devices.

Key benefits include:

Lower storage consumption

By storing only unique data, deduplication reduces the capacity required to retain backups over time—especially in environments with frequent backups and long retention.

Reduced bandwidth and faster backups (with source-side dedupe)

If deduplication happens at the source (before data is sent across the network), the system can transmit only unique data. This reduces network load and can improve backup speed—particularly useful for cloud storage and remote offices.

Lower costs across infrastructure

Less stored data often means lower spending on:

  • Storage hardware or cloud storage

  • Cooling and floor space (for on-prem systems)

  • Maintenance and operational overhead

Improved backup and recovery efficiency

Because backups are smaller, deduplication can streamline backup workflows and make it easier to meet backup windows and retention goals.

 

What is a real-life deduplication example?

Imagine the manager of a business sends out 500 copies of the same 1 MB file, a financial outlook report with graphics, to the whole team. The company’s email server is now storing all 500 copies of that file. If all email inboxes then use a data backup system, all 500 copies are saved, eating up 500 MB of server space. Even a basic file-level data duplication system would save just one instance of the report. Every other instance just refers back to that single stored copy. This means the end bandwidth and storage burden on the server is only 1 MB from the unique data.

Another example is what happens when companies perform full-file incremental backups of files, where only a few bytes have changed, and occasionally perform full backups due to age-old design challenges in backup systems. A 10 TB file server would create 800 TB of backups just from eight weekly fulls, and probably another 8 TB or so of incremental backups over the same amount of time. A good deduplication system can reduce this 808 TB down to less than 100 TB – without lowering restore speed.

 

 

How does deduplication ratio to percentage work?

A deduplication ratio compares:

  • How much data would be stored or transferred without deduplication vs…

  • How much data is stored or transferred with deduplication

For example, a 10:1 dedupe ratio suggests that 10 units of original data were reduced to 1 unit of stored data.

Why ratios can be misleading

A ratio can look impressive for reasons that don’t reflect real-world efficiency. For instance, if you back up the same file 400 times, you might see a 400:1 ratio—but that may say more about how repetitive your backups are than how strong your dedupe algorithm is.

Practical tip: when evaluating deduplication, consider ratios alongside actual outcomes like backup windows, restore performance, bandwidth savings, and long-term storage growth.

 

 

What is post-process deduplication?

Post-process deduplication (PPD) describes systems that remove redundant data only after data has already landed in the target storage system. It may be used when it isn’t feasible or efficient to dedupe during transfer.

Post-process dedupe is sometimes referred to as asynchronous deduplication because the dedupe step occurs after initial write—often while backups are still being ingested, but only deduped once each segment is first stored.

 

 

How to implement deduplication

The best way to implement deduplication depends on your goals and your environment, especially whether you’re implementing dedupe inside a backup storage system, as an appliance, or as part of a broader software platform.

In general, deduplication is deployed in one of two places:

  • At the source (before data is sent): best when bandwidth is a constraint, or you want faster cloud backups.

  • At the target (after data is received): best when you want storage savings without changing what is transmitted.

When deciding which approach to use, consider:

  • Network bandwidth and remote office connectivity

  • Data change rate (how much data changes day-to-day)

  • Retention requirements and storage growth

  • Restore performance and recovery objectives (like RPO and RTO)

  • Operational complexity and scale

How does deduplication encryption work?

Deduplication and encryption are closely related because deduplication systems must be able to read data to detect duplicates. If data is encrypted before deduplication occurs, the encrypted output will look different—even if the original plaintext was identical—so the system won’t find duplicates.

As a result, in many architectures deduplication must occur before encryption. After the dedupe process identifies and stores unique data, encryption can then be applied to protect the stored content.

 

 

Druva’s approach to data deduplication

Druva uses a global approach to deduplication designed to reduce storage and bandwidth across a customer’s entire environment, not just within a single device or site.

Key elements include:

Global deduplication

Druva compares a customer’s backup data across locations and sources to reduce duplicates at a broader scope than systems limited to local dedupe.

Source-side deduplication

Druva deduplication starts at the client to reduce how much data must be transmitted over the network. To reduce load on endpoints, Druva’s cloud service performs much of the dedupe processing.

Block-level (sub-file) analysis

Block-level dedupe enables detection of duplicates within files—not only identical full files—typically delivering higher storage reduction.

Awareness of data-generating applications

Application awareness helps identify redundancy inside application-generated data, improving dedupe effectiveness in common enterprise data sources.

Scalable performance

Druva is designed to scale across many users, devices, and locations to support enterprise-wide optimization.

Next steps

FAQs

What is deduplication?

Deduplication is a process that reduces redundant data by storing one unique instance and replacing duplicates with references to the original.

What is data deduplication used for?

Data deduplication is commonly used in backups and storage systems to reduce storage consumption, lower bandwidth usage, and improve backup efficiency.

What is the difference between file-level and block-level deduplication?

File-level deduplication removes duplicate entire files (single instance storage). Block-level deduplication removes duplicate segments within files, even when the full files aren’t identical.

What is source-side deduplication?

Source-side deduplication finds and eliminates duplicates before data is transmitted to backup storage. This can reduce both network usage and storage consumption, making it useful for cloud backups and remote offices.

What is target-side deduplication?

Target-side deduplication removes duplicates after data has already been stored in the target storage system. It reduces storage usage but typically does not reduce the bandwidth required to transmit data.

What is post-process deduplication?

Post-process deduplication removes duplicates after data has landed in storage, often asynchronously. This is useful when deduplication during transfer isn’t feasible.

What is a deduplication ratio?

A deduplication ratio compares how much data would be stored without deduplication to how much is stored with deduplication. Higher ratios indicate more reduction, but ratios can be misleading depending on how repetitive the data is.

How do deduplication and encryption work together?

Because deduplication must read data to identify duplicates, encryption typically happens after deduplication. If data is encrypted first, duplicates are harder to detect because encrypted data appears unique.

Does Druva offer deduplication?

Yes. Druva uses global, source-side, block-level deduplication designed to reduce storage and bandwidth across a customer’s environment, while scaling across many devices and locations.

 

Related terms

Now that you’ve learned about deduplication, brush up on these related terms with Druva’s glossary: