Platform
- Data Security Cloud
  Data Security Cloud
  Fully managed data security across enterprise, cloud, SaaS, and end user.
- Data Protection
  Data Protection
  Modernize data protection to reduce costs and complexity
- Cyber Response & Recovery
  Cyber Response & Recovery
  Bounce back from cyber attacks with data that is always safe and ready.
- eDiscovery & Compliance
  eDiscovery & Compliance
  Secure, protect, and streamline data governance.
- Meet Dru - Your Copilot for Data Security
Solutions
- Use Cases
  Use Cases
  Learn how Druva helps you accelerate key business initiatives
- Key Technologies
  - Public Cloud
    Public Cloud
    Protect native AWS and Azure deployments with secure backups without the cost and complexity
    
    Amazon EC2
    
    Amazon RDS
    
    Azure
  - Hybrid Workloads
    Hybrid Workloads
    Transform data center backup and disaster recovery for virtual environments
    
    VMware
    
    Hyper-V
    
    Nutanix
    
    Oracle
    
    MS SQL
    
    SAP HANA
    
    NAS/files
  - Endpoint and SaaS Apps
    Endpoint and SaaS Apps
    Enterprise Cloud Backup and data management across edge, on-premises and cloud workloads
    
    End User Protection
    
    Microsoft 365
    
    Salesforce
    
    Google Workspace
    
    Microsoft Entra ID
    
    Microsoft Dynamics 365
- Free Trial
Customers
- Explore All Customer Stories
  We are trusted by the world's leading organizations to protect their data. Explore customer success stories to see how your peers are using Druva.
- Ransomware recovery ready
  Learn why Medallia chose Druva
  
  SaaS data protection across the enterprise
  See why Regeneron partnered with Druva
Resources
- Druva vs. Veeam TCO Calculator
  Find the hidden costs of legacy backup
  
  Forrester: Total Economic Impact of Druva 2024
  Customers see 224% ROI: Find out how
Partners
- Programs
  Programs
  Learn how you can profit with Druva and a cloud-first SaaS selling motion. Explore partner programs, access resources, and discover the benefits of partnering with Druva.
- Strategic Partners
  Strategic Partners
  Learn about Druva's strategic capabilities across platform, OEM, and other partnerships. Find out how Druva accelerates and protects customers' cloud journeys.
  - Dell Technologies
  - AWS
  - VMware
  - Nutanix
- Become a Partner
Company
- - Company
  - Leadership
  - Investors
  - Careers
  - Contact Us
  - Newsroom
  - Awards
  - Events
  - Diversity, Equity & Inclusion
  - Blog
- Get in touch with us
  Contact Us
  
  News, product innovations, and more
  Blog
Get Started
Support
Login
Language
- English
- Deutsch

Innovation Series

Realistic Synthetic Data at Scale: Safely Using Production Data to Generate Models

April 27, 2023 Mehul Sheth, Sr. Principal Engineer

“But the lab results were promising! We could handle more load than this in our test environments!”

No one wants to hear such excuses after a production issue. Actually, no one wants a production issue in the first place! One of the many reasons for leaking defects to production is the poor quality of test data. If the test data is not representative of production data certain edge test cases are missed, the behavior of the product in production differs from expected behavior, and this may result in outages or data loss. This blog describes the process of generating realistic synthetic data at scale influenced by production data, highlights the challenges in generating meaningful data, and how these challenges were overcome at Druva.

Druva offers 100% SaaS data protection and management as a service. Our products handle petabyte-scale customer data and ensure that our products scale in production based on their needs. This blog details our approach and also explains the thought process behind the decisions taken so you can get a glimpse of our journey along with the final outcome. At the end of the blog, you will be able to understand how to create a machine learning model around the production data without exposing it and implement it to generate synthetic test data; production-like, realistic, and at scale.

Process

Fig. 1: Process

As shown in Figure 1, the characteristics of data that impact the product behavior were identified, i.e. the criteria which need to be studied such as the size of files being backed up, the name of the file and its path, etc. Then, we analyze the production data and create a model based on the criteria identified. This model was used to generate data at scale.

Taking a peek at prod data, will we create a fancy model and will generate test data based on this for ages to come? No! This process is a continuous one, we’ll need to analyze prod data at reasonable intervals and update our model. This interval can be time-based like once a month or quarter, or sporadically like when a new significant client is added or when the business seems fit to update.

Outcome

As a result of these efforts we saw the following key results:

Better insights into how our customers use our products, what type of data they back up (of course with customers' data privacy as our top priority), and how they access this data. For those who could improve the use of our products, we raised action items to proactively help these customers or change policies, etc.
Better architectural decisions based on production findings rather than assumptions. Decisions included optimized storage for certain file types, i.e. videos, as video content in our customer data has grown significantly. In addition, software-imposed checks or limitations can be introduced to avoid exploitation of the product.

Production Data Modeling

Fig. 2: Production Data Modeling

The major challenge in mining production data was how to get access without breaking legal contracts with customers. In this implementation, no customer files were opened even in code. Code only reads meta-data, such as file size and extensions. The next challenge was to select "enough" data and scale the sample to extrapolate key findings.

An example of scaling could be if a new prospect wants to back up a Google shared drive with a TB of data — but production has so far only observed a maximum drive size of a few hundred GBs. How do we generate 10x of this observed production load?

Fig 3: Sample Selections based on Statistical Criteria

The first step is to identify which data needs to be parsed. Each blue dot in the image represents a folder or file system. The X-axis denotes the total size of the entire dataset and Y denotes the total number of files in that set. It is observed that, as the total data set size increases, the more files are in it and the range of total files also increases. Four points from this plot were selected, as highlighted — some from around the median, some at 90% on both X and Y, and the two extremes.

Fig. 4: Density plots

On a 2D density graph, three more spots of interest were observed. The darker area shows where more sets are concentrated and where the majority of data points lie (the mode of the data). A few samples from around these seven areas were selected, we read the metadata and created a model for all the criteria that are important to us.

DIRS_PER_DIR = {0.5:(0,1),…}

FILES_PER_DIR = {0.2:(0,3),0.5:(3,5),0.75:(5,12)…1:(60,401)}

Fig. 5: Dictionary representation of sample features of our model

For example, files per directory are read as 20% of directories in our data set will have between 0 and 2 files; 30% (point 5 minus point 2) between 3 and 4; 25% between 5 and 11… and a max of 400 files per directory.

Fig. 6: Sample features - Visualization

If you observe closely, most of the small files are text-based like JSON or text (even binary files like png can have a small size) and the large files are all binary. Look into the middle bucket, it is a mix of text files like CSS, JSON, and binary like pdf and xlsx.

Look at this plot of files at a given depth — 10% of the files were at depth of 1 and only 5% of the files were at a depth greater than 15. Similarly, it is seen that half the directories have no subdirectories and 96% of the directories have less than 9 subdirectories.

The model is used to generate arrays with 100 files for simplicity, which will get 5 files at a depth of 15 (minimum). Now, fitting the other models of files per directory and directories per directory, it is very unlikely that there will be any directories at depth 15, so we will need to tweak the distributions slightly where at least one subdirectory at level 10 has 10 files in it (accounting for the last two buckets combined). The code makes such modifications at runtime and comes up with an output that best matches the distribution in the model. It takes more iterations to come up with this balance for smaller data sets and it takes more time per iteration if the data set to be generated is larger.

Synthetic Data Generation

Fig. 7: Synthetic Data Generation

Now that we have a model in place, we want to generate data mimicking production for a new feature release, for example, to move data to a different storage model. The requirements also mention that since this model is offering storage at a very cheap price, it is expected clients will store more data here and it will need to be tested with 2x production load.

You start with deciding the size of the target filesystem, this is the input to the code
The code divides the total size into files of different sizes fitting the distribution
Depending on the total number of files, the directory distribution is defined, that is total directories at each level
Files are distributed in these directories
File names are generated
Extensions are mapped to the files

At each step, arrays are validated. If validations fail the process repeats itself and modifies the input model.

Finally, the best-fit arrays are chosen to be passed to the next step i.e. data generation.

These are a few algorithms that were explored and applied for data generation.

Text generation

Random text

This is a very simple and fast approach — random text is generated from ASCII characters. The disadvantage is that the data is garbled — it doesn't make sense and isn’t meaningful words you can generate like the examples in bold. The advantage is you can very accurately control the size of data that you need (useful for generating file names).

Obfuscated using Caeser's cipher

This is also simple and fast, however, you will need a corpus set, i.e. a selection of words of different lengths. If you want a word with three characters and from your bag of words you select "MAN" and shift each character by three, then MAN becomes P-D-Q; ZEBRA becomes E-J-G-W-F with a shift of five. This way you weakly protect the original corpus as it is very easy to guess the original word even if you don't know the shift for a limited vocabulary. However, if your corpus is public data, you need not spend time even applying the cipher, simply select the words from the bag which match the size.

Statistical model

If you have a large corpus of words, you can parse it and save the distribution with probabilities. Example:

Corpus: THE, THIN, THIS (only 3 words for simplicity)

Prediction:

1st Character: T (100% Probability)
2nd Character: H (100% Probability)
3rd Character: E (33% Probability) or I (67% Probability)
4th Character: - (33%) or N (33%) or S (33%)

Possible predictions: THE, THI, THEN, THIN, THES, THIS

Some of the predictions (highlighted in bold) could be not from the corpus and may be non-dictionary words like “THES.”

Markov chain

The last algorithm implemented is similar to the previous one, with the exception that if the third character is selected as I, there will be a fourth character which will be N or S with 50% probability. Again the implementation is based on words, not characters.

Also, there are three variations of the algorithm — Unigram is a one-word sequence; bi-gram is a two-word sequence; trigram is a three-word sequence. Unigram does not depend on the previous word. For example, “statistics” is a unigram (n = 1), “machine learning” is a bigram (n = 2), and “natural language processing” is a trigram (n = 3). If the combination selected doesn’t match any known pattern, you select random words. The same approach is used to initialize the chain. These algorithms are for textual matters, like notepad files or file or directory names.

For binary data, i.e. files like videos, images or .pdf

Random bytes are generated and packed into the file of a desired size and compression is controlled by selecting the range of random numbers — it is a simple but really effective solution. Range and compression have an inverse relationship, i.e. the higher the range, the lesser compression ratio you get.

Conclusion

In order to have production-like test data, relevant characteristics that impact product behavior were identified. These were analyzed from production and modeled. The model was then used to generate test data for lower environments. The data sets were used in feature testing, scale testing, and performance testing.

Based on the use case in different industries and problems, similar characterization of production workloads can be executed to better test data and generated based on production data.

Next Steps

Learn more about the technical innovations and best practices powering cloud backup and data management on the Innovation Series section of Druva’s blog archive.

Realistic Synthetic Data at Scale: Safely Using Production Data to Generate Models

Process

Outcome

Production Data Modeling

DIRS_PER_DIR = {0.5:(0,1),…}

FILES_PER_DIR = {0.2:(0,3),0.5:(3,5),0.75:(5,12)…1:(60,401)}

Synthetic Data Generation

Text generation

Random text

Obfuscated using Caeser's cipher

Statistical model

Markov chain

For binary data, i.e. files like videos, images or .pdf

Conclusion

Next Steps

Druva Blog: Cloud Technology & Data Protection Articles

Druva Data Security Cloud

The Druva Platform

Data Protection

Cyber Response & Recovery

eDiscovery & Compliance

Use Cases

Key Technologies

Customers

Resources

Partners

Company