Product, Tech/Engineering

Green-ness of Data De-duplication

The Storage Hunger

Sale of disk-bases storage system has already crossed 2500 Petabytes in 2008 and up by 58.1% YOY (One petabyte = 1 Million GBs). These figures do not include the direct attached storage which comes pre-loaded with PCs or servers.[1] This is understandable as 1TB (1000GB) storage NAS/SAN devices are now commodity. The top three vendors in this space are HP, IBM and EMC with market share of approximately 29%, 20% and 14% respectively.[2] The overall consumption doubles when this storage is backed up 🙂

Energy Consumption

On an average a datacenter consumes 100 Watts/sq-feet of energy and the best solid state storage consumes about 5 watts for 1MB IOPs.[3] This puts the total cost for maintaining (cooling + power) for 1 TB disk array about USD $2,500/annually. (16c for KWh, and 20 GB average daily usage). This makes the annual energy consumption of newly bought storage = USD 5 Billion !!! And backing this 5 Billion dollar inventory surely adds couple of more billions.

Data De-duplication

The data de-duplication technology saves single copy of duplicate data. There are two important aspects of any data de-duplication solution/product –

  1. Scope of duplicate discovery – File-level / Sub-File level / Block level
  2. Point of duplicate discovery – Source / Target

Most of the storage vendors which use data de-duplication provide block-level duplicate removal at target (i.e. when the data reached the storage). But, its not very difficult to image that source level removal of sub-file or block level duplicates would be much better for two reasons –

  1. Sending lesser/de-duplicated data saves time and bandwidth (apart from storage)
  2. Duplicate discovery would be much better as you have access to the structured data

Considering Microsoft’s report on de-duplicate assessment [4], –

  1. 20-30% data duplicates are easily visible even in unstructured data source like ERP databases
  2. 40-80% data duplicates can be seen in file-servers and mail servers.
  3. 60-90% data duplicates can be seen between different PCs. (Just my observation and opinion)

On an average a conservative 30% data duplicate removal can save $1.6B on storage energy and $2B on bandwidth costs and backups.

De-duplication and Druvaa

We see Druvaa inSync as a product/platform to provide de-duplicated (at source) backup for PCs, PDAs and servers. The current version is available for just PCs and we can easily see up to 90% savings for time and cost (bandwidth and storage) for enterprises. I just don’t see a reason why all storage and backup vendors wouldn’t do it. EMC and Netapp have already announced de-duplication as additionally licensable technology on their arrays (target based). No major vendor except for EMC has announced agent/source based de-dup though.[6] Surely, Druvaa has a good lead and cashing on it 🙂