Platform
- Data Resiliency Cloud
  Data Resiliency Cloud
  Enterprise Cloud Backup and data management across edge, on-premises and cloud workloads
- Data Protection
  Data Protection
  Modernize data protection to reduce costs and complexity
- Cyber Resiliency
  Cyber Resiliency
  Be ready for cyber attacks with data that is always safe, always ready
  - Accelerated Ransomware Recovery
  - Security Posture & Observability
- Governance & Compliance
  Governance & Compliance
  Secure, protect, and streamline data governance for all your critical data, wherever it lives
  - eDiscovery and Legal Hold
  - Sensitive Data Management
- Take a Tour
Solutions
- Business Drivers
  Business Drivers
  Learn how Druva helps you accelerate key business initiatives
- SaaS Applications
  SaaS Applications
  Druva provides comprehensive data protection that supports multiple SaaS applications from a single platform. Discover the Druva difference today.
- Enterprise Workloads
  - Virtualization
    Virtualization
    Transform data center backup and disaster recovery for virtual environments
    
    VMware
    
    Nutanix
  - Databases
    Databases
    Reduce the cost and complexity of data protection for enterprise databases
    
    Oracle
    
    MS SQL
    
    SAP HANA
  - Files
    Files
    Discover a more cost-efficient way to protect on-premises and cloud NAS
    
    NAS/files
  - Public Cloud
    Public Cloud
    Protect native AWS and Azure deployments with secure backups without the cost and complexity
    
    AWS
    
    Microsoft Azure
- Enterprise Endpoints
  Enterprise Endpoints
  Unify SaaS apps and end-user device protection to reduce data risks. Improve cyber resilience and compliance by protecting enterprise workloads and assets.
- Free Trial
Customers
- Explore All Customer Stories
  We are trusted by the world's leading organizations to protect their data. Explore customer success stories to see how your peers are using Druva.
- Ransomware recovery ready
  Learn why Medallia chose Druva
  
  SaaS data protection across the enterprise
  See why Regeneron partnered with Druva
Resources
- 2023 Gartner® Magic Quadrant™
  See why Druva is recognized as a Visionary
  
  Data Resiliency for Dummies
  Get your guide to data resiliency
Partners
- Strategic Partners
  Strategic Partners
  Learn about Druva's strategic capabilities across platform, OEM, and other partnerships. Find out how Druva accelerates and protects customers' cloud journeys.
  - Dell Technologies
  - AWS
  - VMware
  - Nutanix
- Programs
  Programs
  Learn how you can profit with Druva and a cloud-first SaaS selling motion. Explore partner programs, access resources, and discover the benefits of partnering with Druva.
- Become a Partner
Company
- - Company
  - Leadership
  - Investors
  - Careers
  - Contact Us
  - Newsroom
  - Awards
  - Events
  - Blog
  - Diversity, Equity & Inclusion
- Get in touch with us
  Contact Us
  
  News, product innovations, and more
  Blog
Get Started
Support
Login
Language

Tech/Engineering

The real big data — protecting your NAS systems

June 10, 2020 Stephen Manley, CTO

Big data is changing the world, just not in the way we expected. When big data hype swept through the IT landscape, the market deified new tools, even as customers struggled to figure out how and where they would use them. Meanwhile, the size of core business data sets exploded across virtually every industry, so IT had to process, store, and protect exponentially larger amounts of business-critical data.

A decade of hypergrowth has broken traditional unstructured data management, so organizations need a new approach to protect the real “big data” — surveillance video, medical images, genomics research, geologic exploration, and more.

Big data, not big tools

While everybody was trying to figure out which tool could make sense of their data lake, their NAS systems exploded.

At the peak of the big data hype cycle, organizations expected that tools like Hadoop would revolutionize their businesses. One prominent customer’s team believed that if they aggregated enough data into a Hadoop cluster, the system would spring to life and deliver answers to questions they didn’t know to ask. While the crowd flocked to ever-newer tools (Hadoop to Cassandra to MongoDB to Apache Spark), they didn’t know what to do with them.

Under the radar, the size of data in virtually every industry has exploded in the last decade. From high-resolution 2D seismic images to 3D medical scans to 4K HD media, core business applications are processing more data than ever. Most of that data is stored on network attached storage (NAS) systems, which has fueled exponential data growth in the scale-out NAS market. Today, customers find themselves storing petabytes of data that drives their business and connects them with their customers.

Big data, big challenges

The challenge of managing “big data” on these NAS systems begins with the size, but it doesn’t end there.

For most applications, the data has a short half-life of value with long retention periods. It initially needs high-performance processing, but once the processing is complete, it is likely not to be accessed again. Still, the data must be retained for compliance or internal governance. Many companies hold the data for 10, 20, and even 99+ years. Unfortunately, most companies store all of the data on NAS systems that offer neither the cost/performance nor cost/capacity they need. With such a large volume of data “one-size-fits-all” is not acceptable.

Unfortunately, the data is also not well suited to movement. Most of the data is video, image, or machine-generated binary data, which means traditional compression and deduplication techniques are ineffective. Without the ability to shrink the data, customers must move GB to TB-sized files around their environment. Most environments lack the internal bandwidth for moving such a large amount of data.

Given the challenges of managing the active data, IT struggles to even consider backup. Some store snapshots to protect against accidental errors and hope that nothing happens to the system with their only copy of data. Others double the cost of their storage and replicate the data to a second NAS system. Rarely do they follow the 3-2-1 rule (3 copies, 2 types of media, 1 offsite).

Requirements for success

In the past 18 months, an increasing number of customers, rather than attempting to overcome their challenges with brute force, have rethought their approach to managing their big data applications.

They begin with setting their requirements because it reveals how far their current design is from meeting the business’s needs:

Storage: Enable high-performance processing while minimizing the cost of storing data that is not being used.
Retrieval: When older data needs to be retrieved, find it based on characteristics of the data vs. arbitrary items like file names, which are usually automatically assigned.
Protection: Follow the 3-2-1 rule to protect against common failures — user/application errors, system failures, external disasters, and cyberthreats.

The patches and workarounds on their legacy systems fail to meet even lowered expectations, which opens the architectural discussion for the first time in a decade.

Production architecture for big data

Customers successfully managing big data share an emerging architectural pattern.

Storage caching replaces storage tiering. With the growing disparity between high performance (Storage Class Memory) and high capacity (HDD) storage, there is little value in middle tiers (e.g. SSD) for these data sets. The data should be stored in high capacity storage. Then, when needed, it can be pulled into the performance storage, processed, then deleted. That way the high-performance need not waste performance on moving data through the tiers. The advent of faster networks (e.g. 400GbE) enables that movement at scale.

They separate metadata and data. Data access always begins with metadata, as applications search for and load a file by name or tag. Unfortunately, by co-locating metadata and data, traditional file systems constrain how much metadata developers can store and how quickly they can access it. By separating the metadata, it simplifies enriching content (e.g., video and images), categorizing data, and searching for what matters.

While customers are implementing this architecture with different components, the principles remain the same:

High-performance edge platforms — Central GPU-driven processing with storage-class memory and/or distributed edge devices
Inexpensive, scalable, centralized storage — scale-out NAS or object
Metadata stores — NoSQL databases holding core and/or extended data attributes
Location — On-premises, but shifting to the cloud

Protection architecture for big data

There is no silver bullet for protecting vast unstructured data repositories. Successful organizations use a multi-tier protection strategy to meet RPO/RTO targets, retention requirements, and cost targets.

Protection begins with snapshots and replicas. Customers satisfy urgent recoveries — user and application errors — from the snapshots. Meanwhile, they run a cost/benefit analysis to determine which applications and datasets require the rapid recovery of a disaster recovery replica.

They meet compliance and long-term retention requirements with offsite backups, which prioritize cost-optimization over performance. First, the backups minimize the load on the production storage. Second, the backups are stored on the lowest cost, coldest storage. Finally, the backup offers rich metadata access, so customers can identify only what they need without needing to access the data.

Not surprisingly, the backup architecture looks like the production architecture:

High-performance edge platforms — local snapshots
Inexpensive, scalable centralized storage — cold object storage
Metadata stores — databases to store the backup catalog
Location — the cloud

Conclusion

While everybody was waiting for Hadoop, Cassandra, or Spark to turn their big data dreams into a reality, unstructured big data became a core part of their business.

In virtually every industry, core business applications use PBs of data that is stored on NAS systems. After years of scaling without investing, organizations find themselves struggling to store, process, and protect their most important data.

As companies adopt new unstructured data architectures — centralized storage, distributed compute, and separated data and metadata — they must also address their protection challenges. Short-term recoveries should be handled locally, while the long-term retention (10, 20, 99+ years) demands a backup architecture that can deliver the lowest cost.

It is possible to manage and protect your unstructured data — it’s time to expect more.

Learn how Druva simplifies NAS backup with the cloud.

The real big data — protecting your NAS systems

Big data, not big tools

Big data, big challenges

Requirements for success

Production architecture for big data

Protection architecture for big data

Conclusion

Blog

Druva Data Resiliency Cloud

Cloud Backup & Recovery

Data Protection

Governance & Compliance

Cyber Resilience

Business drivers

Workloads

Partners

Customers

Resources

Company