News/Trends, Product, Tech/Engineering

Understanding Data Deduplication — and Why It’s Critical for Moving Data to the Cloud

If you work in IT and are responsible for backing up or transferring large amounts of data, you’ve probably heard the term data deduplication. Here’s a clear definition of what “data duplication” means, and why it is a fundamental requirement in moving data to the cloud.

First, the basics

At its simplest definition, data deduplication refers to a technique for eliminating redundant data in a data set. In the process of deduplication, extra copies of the same data are deleted, leaving only one copy to be stored. Data is analyzed to identify duplicate byte patterns to ensure the single instance is indeed the single file. Then, duplicates are replaced with a reference that points to the stored chunk.

Given that the same byte pattern may occur dozens, hundreds, or even thousands of times — think about the number of times you make only small changes to a PowerPoint file or share another important business asset — the amount of duplicate data can be significant. In some companies, 80% of corporate data is duplicated across the organization. Reducing the amount of data to transmit across the network can save significant money in terms of storage costs and backup speed — in some cases, savings up to 90%.

A real-world example

Consider an email server that contains 100 instances of the same 1 MB file attachment, say, a sales presentation with graphics that was sent to everyone on the global sales staff. Without data duplication, if everyone backs up his email inbox, all 100 instances of the presentation are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy, reducing storage and bandwidth demand to only 1 MB.

Data deduplication evolves to meet the need for speed

While data deduplication is a common concept, not all deduplication techniques are the same. Early breakthroughs in data deduplication were designed for the challenge of the time: reducing storage capacity required and bringing more reliability to data backup to servers and tape. One example is Quantum’s use of file-based or fixed-block-based storage which focused on reducing storage costs. Appliance vendors like Data Domain further improved on storage savings by using target-based- and variable-blockbased techniques that only required backing up changed data segments rather than all segments, providing yet another layer of efficiency to maximize storage savings.

As data deduplication efficiency improved, new challenges arose. How do you backup more and more data across the network, without impacting overall network performance? Avamar addressed this challenge with variable block deduplication and source-based deduplication, compressing data before it ever left the server, thereby reducing network traffic, the amount of data stored on disk, and the time it took to backup. With this step forward, deduplication became more than simply storage savings; it addressed overall performance across networks, ensuring that even in environments with limited bandwidth, data had a chance to be backed up in a reasonable time.

Another step function improvement to data deduplication was achieved by Druva when it addressed data redundancies at object level (versus file level) and solved for deduplication across distributed users at a global scale.

Advances in data deduplication to manage massive volumes of data

By the early 2000’s, business data was moving global, real-time and mobile. IT team were challenged to backup and protect massive volumes of corporate data across a range of endpoints and locations with increased efficiency and scale. To address this challenge, Druva pioneered a revolutionary concept of app-aware” deduplication which analyzes data at the file object level to identify file duplicates in attachments, emails, or even down to the folder from which they originate. The approach added significant gains in accuracy and performance for data backups, lowering the barrier for companies to efficiently managing and protecting large volumes of data.

Data deduplication offers a new foundation for data governance

Today, as cloud adoption reaches a tipping point and companies have begun moving their data storage to a virtual cloud environment, data deduplication plays a more strategic role than simply saving on storage costs. In combination with cloud-based object storage architecture, efficient data deduplication is opening up new opportunities to do more with stored data.

One example is data governance. With global deduplication techniques, massive volumes of data can be backed up and stored in the cloud, and made available to IT (and the C-Suite) to address compliance, data regulation and real-time business insights. This is done by creating a time-index file system which stores only the unique data  required using meta data. The time indexed view of data means that you now have historical context for information, and data is always indexed and ready for forensics teams. This is a radical departure from the traditional “backup to the graveyard” approach which is written as a serial stream of incremental or full backups. Additionally, being able to understand and analyze data in common among a set of users helps IT understand data usage patterns and further optimize data redundancies across users in distributed environments.

Today advanced data deduplication is helping address two competing forces that threaten to impede fast-growing enterprise businesses today: managing the massive increase in corporate data created outside the traditional firewall and solving for the growing need to govern data across its lifecycle by timezone, user, devices and file types.

Why Druva leads in its approach to data deduplication

Druva’s patented global data deduplication approach has four unique attributes:

  • It is performed on the client (versus the server), thereby reducing the amount of data needed to be shipped over the network.
  • The analysis is done at the sub-file or block-level to find duplicate data within a file.
  • It is aware of the applications from which data is generated. That is, Druva inSync look insides files such as an Outlook email file leveraging MAPI, to find duplicate data in email attachments.
  • Druva’s deduplication scales beyond a single user to find duplicate data (say, an email sent to an entire organization) across multiple users and devices.

Recommended Resources: Druva 2017 VMware Cloud Migration Survey

Druva powers World Backup Day with 2.5 billion backups | Learn more sticky-promo-icon-carrot How Druva powers World Backup Day