What is data deduplication?
If you work in IT and are responsible for backing up or transferring large amounts of data, you’ve probably heard the term data deduplication. In this blog, we’ll be providing a clear definition of what “data duplication” means, and why it is a fundamental requirement in migrating your organization’s data to the cloud.
First, the basics
At its simplest definition, data deduplication refers to a technique for eliminating redundant data in a data set. In the process of deduplication, extra copies of the same data are deleted, leaving only one copy to be stored. The data is analyzed to identify duplicate byte patterns and ensure the single instance is indeed the only file. Then, duplicates are replaced with a reference that points to the stored chunk.
Given that the same byte pattern may occur dozens, hundreds, or even thousands of times — think about the number of times you make only small changes to a PowerPoint file — the amount of duplicate data can be significant. In some companies, 80% of corporate data is duplicated across the organization. Reducing the amount of data to transmit across the network can save significant money in terms of storage costs and backup speed — in some cases, up to 50 percent.
A real-world example
Consider an email server that contains 100 instances of the same 1 MB file attachment, for example a sales presentation with graphics sent to everyone on the global sales staff. Without data duplication, if everyone backs up his email inbox, all 100 instances of the presentation are saved, requiring 100 MB of storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is referenced back to the one saved copy, reducing storage and bandwidth demand to only 1 MB.
Data deduplication solutions evolve to meet the need for speed
While data deduplication is a common concept, not all deduplication techniques are the same. Early breakthroughs in data deduplication were designed for the challenge of the time — reducing storage capacity and bringing more reliable data backup to servers and tape. One example is Quantum’s use of file-based or fixed-block-based storage, which focused on reducing storage costs. Appliance vendors like Data Domain further improved on storage savings by using target-based- and variable block-based techniques that only required backing up changed data segments rather than all segments. This provided yet another layer of efficiency to maximize storage savings.
As data deduplication efficiency improved, new challenges arose. How do you backup more and more data across the network without impacting overall network performance? Avamar addressed this challenge with variable block deduplication and source-based deduplication, compressing data before it ever left the server and reducing network traffic, the amount of data stored on disk, and the time to back up. With this step forward, deduplication became more than simply storage savings; it addressed overall performance across networks, ensuring that even in environments with limited bandwidth, data had a chance to be backed up in a reasonable time.
Another step function improvement to data deduplication was achieved by Druva when it addressed data redundancies at object level (versus file level), and solved for deduplication across distributed users at a global scale.
Advances in global data deduplication to manage massive volumes of data
By the early 2000s, business data was moving global, real-time, and mobile. IT teams were challenged to back up and protect massive volumes of corporate data across a range of endpoints and locations with increased efficiency and scale. To address this challenge, Druva pioneered the revolutionary concept of “app-aware” deduplication, which analyzes data at the file object level to identify file duplicates in attachments, emails, or even down to their origin folder. The approach added significant gains in accuracy and performance for data backups, lowering the barrier for companies to efficiently manage and protect large volumes of data.
Data deduplication offers a new foundation for data governance
Today, as cloud adoption reaches a tipping point and companies are increasingly moving their data storage to virtual cloud environments, data deduplication plays a more strategic role than simply saving on storage costs. In combination with cloud-based object storage architecture, efficient data deduplication opens up new opportunities to do more with stored data.
A key example is data governance. With global data deduplication techniques, massive volumes of data can be backed up and stored in the cloud, and made available to IT (and the C-Suite) to address compliance, data regulation, and real-time business insights. This is done by creating a time-index file system that stores only the unique data required using metadata. The time-indexed view of data means that you now have historical context for information, and data is always indexed and ready for forensics teams. This is a radical departure from the traditional “backup to the graveyard” approach, which is written as a serial stream of incremental or full backups. Additionally, being able to understand and analyze data in common among a set of users helps IT understand data usage patterns and further optimize data redundancies across users in distributed environments.
Today, advanced data deduplication is helping address two competing forces that threaten to impede fast-growing enterprise businesses today — managing the massive increase in corporate data created outside the traditional firewall, and solving for the growing need to govern data across its lifecycle by time zone, user, devices, and file types.
Why Druva leads in its approach to global data deduplication
Druva’s patented global data deduplication approach has four unique attributes:
- It is performed on the client (versus the server), thereby reducing the amount of data needed to be shipped over the network
- The analysis is done at the sub-file or block-level to find duplicate data within a file
- It is aware of the applications from which data is generated, looking inside files to find duplicate data
- Druva’s deduplication scales beyond a single user to find duplicate data (say, an email sent to an entire organization) across multiple users and devices