News/Trends, Tech/Engineering

Filesystems vs. Databases

Chaitra Chowta, Druva Team

It’s interesting to see how databases have come a long way and have clearly out-shadowed file-systems for storing structured or unstructured information.Technically, both of them support the basic features necessary for data access.

For example, both of them:

  • Ensure data is managed to ensure its integrity and quality
  • Allow shared access by a community of users
  • Use well-defined schema for data-access
  • Support a query language

Download Executive Brief: Addressing Data Management Risks For The Public Cloud Era

But filesystems seriously lack some of the critical features necessary for managing data. Let’s take a look at some of these features.

Transaction support

Atomic transactions guarantee complete failure or success of an operation. This is especially needed when there is concurrent access to the same data set. This is one of the basic features provided by all databases.

But most filesystems don’t have this feature. Only the lesser known filesystems, such as Transactional NTFS(TxF)Sun ZFS, and Veritas VxFS, support this feature. Most of the popular open source filesystems (including ext3, xfs, reiserfs) are not even POSIX compliant.

Fast Indexing

Databases allow indexing based on any attribute or data property (i.e. SQL columns). This helps fast retrieval of data, based on the indexed attribute. This functionality is not offered by most filesystems, e.g. you can’t quickly access “all files created after 2 PM today.”

The desktop search tools like Google desktop or MAC spotlight offer this functionality. But for this, they have to scan and index the complete filesystem and store the information in an internal relational database.

Snapshots

Snapshot is a point-in-time copy/view of the data. Snapshots are needed for backup applications, which need consistent point-in-time copies of data.

The transactional and journaling capabilities enable most of the databases to offer snapshots without shopping access to the data. Most filesystems however, don’t provide this feature (ZFS and VxFS being only exceptions). The backup software has to either depend on the running application or underlying storage for snapshots.

Clustering

Advanced databases like MySQL also offer clustering capabilities. MySQL offers shared-nothing clusters using synchronous replication. This helps the databases scale up and support larger & more fault tolerant production environments.

Filesystems still don’t support this option. The only exceptions are Veritas CFS and GFS (Open Source).

Replication

Replication is a commodity with databases and forms the basis for disaster-recovery plans. File-systems still have to evolve to handle it.

Relational View of Data

Filesystems store files and other objects only as a stream of bytes, and have little or no information about the data stored in the files. Filesystems also provide only a single way of organizing the files, namely via directories and file names. The associated attributes are also limited in number, e.g. type, size, author, creation time, etc. This does not help in managing related data, as disparate items do not have any relationships defined.

Databases, on the other hand, offer easy means to relate stored data. It also offers a flexible query language (SQL) to retrieve the data. For example, it is possible to query a database for “contacts of all persons who live in Acapulco and sent emails yesterday”, but impossible in the case of a filesystem.

Filesystems need to evolve and provide capabilities to relate different data sets. This will help the application writers to make use of native filesystem capabilities to relate data. A good effort in this direction has been Microsoft WinFS.

Conclusion

There are features that databases have that filesystems could truly benefit from, but there is no reason why filesystems in the future will not borrow those features.  In fact, they already are in many cases.

Find out more about to pros and cons of different architecture models by accessing this report: Choosing the Right Model for Enterprise Backup & Recovery

Disclosure

Druva inSync uses a proprietary filesystem to store and index the backed up data. The meta-data for the filesystem is stored in an embedded MySQL database. The database-driven model was chosen to store additional identifiers with each block – size, hash and time. This helps the filesystem with:

  • Block Size: Divide files into variable sized blocks
  • Data deduplication: Store single copy of duplicate blocks
  • Temporal Filesystem: Store time information with each block. This enables faster time-based restores.