Well, first let me confess that inSync v3.1 took much more time than we planned. We had initially planned to release inSync by July 09 and Phoenix public beta by Sep 09.
In Short – We are working on a new storage engine codename Blackbird (based on the SR-71 legend). The new engine will use application specific deduplication technology to improve performance and bandwidth/storage savings.
Initially planned for inSync v3.1 and Phoenix v1.0 , this now will be available in next major releases.
The longer version – For the past two years, we have been doing experiments on various different algorithms for global source based data deduplication. While releasing inSync v2.0 we finalized on chunk based or variable-block based data deduplication, because of the simple fact that it was tough to find similar data blocks at natural block boundaries across different users. We also worked on the performance which gradually improved over time.
While the approach was reasonably accurate, there was a scope of significant improvement. We realized that 90% of the backup data on customer PCs comes from the documents and PST files, hence something totally focussed on PST files can dramatically improve the deduplication performance.
Also, while working on Phoenix, we came across a bigger challenge of finding duplicates across different data sources within the enterprise. We soon realized that simple block based approach will not take us too far. We also realized that most of the vendors use fixed and variable block/chunk based hashing techniques. This works well for them, because they have been treating backups as “byte streams”, and the only way to remove duplicates is fixed or variable size data deduplication.
Looking at various data types and possible ways improve, we could clearly see two fundamental changes in our approach which could bring paradigm shift in data deduplication –
- For accuracy – Application aware data deduplication
- For performance – Hierarchical block based deduplication
Application aware deduplication, can actually pin point duplicates across PST file attachments and normal office documents.
On the PC side, majority of the data is office documents and Email files. This makes it simpler to introduce the new approach, but still a lot of work needs to be done to productise it. For Phoenix, the problem is much bigger and would take some more time to solve.
The new engine should be ready soon. It would be shipped first in inSync v4.0 early next year and then in Phoenix v2.0 . In the next few posts, I will try and get some benchmark data.