Tech/Engineering

The real big data — protecting your NAS systems

Stephen Manley, CTO

Big data is changing the world, just not in the way we expected. When big data hype swept through the IT landscape, the market deified new tools, even as customers struggled to figure out how and where they would use them. Meanwhile, the size of core business data sets exploded across virtually every industry, so IT had to process, store, and protect exponentially larger amounts of business-critical data.

A decade of hypergrowth has broken traditional unstructured data management, so organizations need a new approach to protect the real “big data” — surveillance video, medical images, genomics research, geologic exploration, and more.

Big data, not big tools

While everybody was trying to figure out which tool could make sense of their data lake, their NAS systems exploded.

At the peak of the big data hype cycle, organizations expected that tools like Hadoop would revolutionize their businesses. One prominent customer’s team believed that if they aggregated enough data into a Hadoop cluster, the system would spring to life and deliver answers to questions they didn’t know to ask. While the crowd flocked to ever-newer tools (Hadoop to Cassandra to MongoDB to Apache Spark), they didn’t know what to do with them.

Under the radar, the size of data in virtually every industry has exploded in the last decade. From high-resolution 2D seismic images to 3D medical scans to 4K HD media, core business applications are processing more data than ever. Most of that data is stored on network attached storage (NAS) systems, which has fueled exponential data growth in the scale-out NAS market. Today, customers find themselves storing petabytes of data that drives their business and connects them with their customers.

Big data, big challenges

The challenge of managing “big data” on these NAS systems begins with the size, but it doesn’t end there.

For most applications, the data has a short half-life of value with long retention periods. It initially needs high-performance processing, but once the processing is complete, it is likely not to be accessed again. Still, the data must be retained for compliance or internal governance. Many companies hold the data for 10, 20, and even 99+ years. Unfortunately, most companies store all of the data on NAS systems that offer neither the cost/performance nor cost/capacity they need. With such a large volume of data “one-size-fits-all” is not acceptable.

Unfortunately, the data is also not well suited to movement. Most of the data is video, image, or machine-generated binary data, which means traditional compression and deduplication techniques are ineffective. Without the ability to shrink the data, customers must move GB to TB-sized files around their environment. Most environments lack the internal bandwidth for moving such a large amount of data.

Given the challenges of managing the active data, IT struggles to even consider backup. Some store snapshots to protect against accidental errors and hope that nothing happens to the system with their only copy of data. Others double the cost of their storage and replicate the data to a second NAS system. Rarely do they follow the 3-2-1 rule (3 copies, 2 types of media, 1 offsite).

Requirements for success

In the past 18 months, an increasing number of customers, rather than attempting to overcome their challenges with brute force, have rethought their approach to managing their big data applications.

They begin with setting their requirements because it reveals how far their current design is from meeting the business’s needs:

  • Storage: Enable high-performance processing while minimizing the cost of storing data that is not being used.
  • Retrieval: When older data needs to be retrieved, find it based on characteristics of the data vs. arbitrary items like file names, which are usually automatically assigned.
  • Protection: Follow the 3-2-1 rule to protect against common failures — user/application errors, system failures, external disasters, and cyberthreats.

The patches and workarounds on their legacy systems fail to meet even lowered expectations, which opens the architectural discussion for the first time in a decade.

Production architecture for big data

Customers successfully managing big data share an emerging architectural pattern.

Storage caching replaces storage tiering. With the growing disparity between high performance (Storage Class Memory) and high capacity (HDD) storage, there is little value in middle tiers (e.g. SSD) for these data sets. The data should be stored in high capacity storage. Then, when needed, it can be pulled into the performance storage, processed, then deleted. That way the high-performance need not waste performance on moving data through the tiers. The advent of faster networks (e.g. 400GbE) enables that movement at scale.

They separate metadata and data. Data access always begins with metadata, as applications search for and load a file by name or tag. Unfortunately, by co-locating metadata and data, traditional file systems constrain how much metadata developers can store and how quickly they can access it. By separating the metadata, it simplifies enriching content (e.g., video and images), categorizing data, and searching for what matters.

While customers are implementing this architecture with different components, the principles remain the same:

  • High-performance edge platforms — Central GPU-driven processing with storage-class memory and/or distributed edge devices
  • Inexpensive, scalable, centralized storage — scale-out NAS or object
  • Metadata stores — NoSQL databases holding core and/or extended data attributes
  • Location — On-premises, but shifting to the cloud

Protection architecture for big data

There is no silver bullet for protecting vast unstructured data repositories. Successful organizations use a multi-tier protection strategy to meet RPO/RTO targets, retention requirements, and cost targets.

Protection begins with snapshots and replicas. Customers satisfy urgent recoveries — user and application errors — from the snapshots. Meanwhile, they run a cost/benefit analysis to determine which applications and datasets require the rapid recovery of a disaster recovery replica.

They meet compliance and long-term retention requirements with offsite backups, which prioritize cost-optimization over performance. First, the backups minimize the load on the production storage. Second, the backups are stored on the lowest cost, coldest storage. Finally, the backup offers rich metadata access, so customers can identify only what they need without needing to access the data.

Not surprisingly, the backup architecture looks like the production architecture:

  • High-performance edge platforms — local snapshots
  • Inexpensive, scalable centralized storage — cold object storage
  • Metadata stores — databases to store the backup catalog
  • Location — the cloud

Conclusion

While everybody was waiting for Hadoop, Cassandra, or Spark to turn their big data dreams into a reality, unstructured big data became a core part of their business.

In virtually every industry, core business applications use PBs of data that is stored on NAS systems. After years of scaling without investing, organizations find themselves struggling to store, process, and protect their most important data.

As companies adopt new unstructured data architectures — centralized storage, distributed compute, and separated data and metadata — they must also address their protection challenges. Short-term recoveries should be handled locally, while the long-term retention (10, 20, 99+ years) demands a backup architecture that can deliver the lowest cost.

It is possible to manage and protect your unstructured data — it’s time to expect more.

Learn how Druva simplifies NAS backup with the cloud.