Platform
- Data Security Cloud
  Data Security Cloud
  Fully managed data security across enterprise, cloud, SaaS, and end user.
- Data Protection
  Data Protection
  Modernize data protection to reduce costs and complexity
- Cyber Response & Recovery
  Cyber Response & Recovery
  Bounce back from cyber attacks with data that is always safe and ready.
- eDiscovery & Compliance
  eDiscovery & Compliance
  Secure, protect, and streamline data governance.
- Meet Dru - Your Copilot for Data Security
Solutions
- Use Cases
  Use Cases
  Learn how Druva helps you accelerate key business initiatives
- Key Technologies
  - Public Cloud
    Public Cloud
    Protect native AWS and Azure deployments with secure backups without the cost and complexity
    
    Amazon EC2
    
    Amazon RDS
    
    Azure
  - Hybrid Workloads
    Hybrid Workloads
    Transform data center backup and disaster recovery for virtual environments
    
    VMware
    
    Hyper-V
    
    Nutanix
    
    Oracle
    
    MS SQL
    
    SAP HANA
    
    NAS/files
  - Endpoint and SaaS Apps
    Endpoint and SaaS Apps
    Enterprise Cloud Backup and data management across edge, on-premises and cloud workloads
    
    End User Protection
    
    Microsoft 365
    
    Salesforce
    
    Google Workspace
    
    Microsoft Entra ID
    
    Microsoft Dynamics 365
- Free Trial
Customers
- Explore All Customer Stories
  We are trusted by the world's leading organizations to protect their data. Explore customer success stories to see how your peers are using Druva.
- Ransomware recovery ready
  Learn why Medallia chose Druva
  
  SaaS data protection across the enterprise
  See why Regeneron partnered with Druva
Resources
- Druva vs. Veeam TCO Calculator
  Find the hidden costs of legacy backup
  
  Forrester: Total Economic Impact of Druva 2024
  Customers see 224% ROI: Find out how
Partners
- Programs
  Programs
  Learn how you can profit with Druva and a cloud-first SaaS selling motion. Explore partner programs, access resources, and discover the benefits of partnering with Druva.
- Strategic Partners
  Strategic Partners
  Learn about Druva's strategic capabilities across platform, OEM, and other partnerships. Find out how Druva accelerates and protects customers' cloud journeys.
  - Dell Technologies
  - AWS
  - VMware
  - Nutanix
- Become a Partner
Company
- - Company
  - Leadership
  - Investors
  - Careers
  - Contact Us
  - Newsroom
  - Awards
  - Events
  - Diversity, Equity & Inclusion
  - Blog
- Get in touch with us
  Contact Us
  
  News, product innovations, and more
  Blog
Get Started
Support
Login
Language
- English
- Deutsch

Tech/Engineering, Innovation Series

Data mining the Druva Data Resiliency Cloud

April 22, 2020 Gaurav Dhadiwal, Principal Engineer, Druva Labs

The Druva Data Resiliency Cloud gives our customers a single, unified repository of all of their data so that they can do more with their backup data. From analytics to compliance, our federated search capabilities are built into the core of our platform, and leverage keyword extraction techniques to optimize your results.

Keyword extraction or identification is a technique to identify the important terms or phrases in a document that help describe the content of the document. The output terms or phrases are often known as keywords, key phrases, key terms, or key segments.

Keyword extraction assists with understanding the context of the entire document. More than 75% of the data generated is unstructured data, and keyword extraction enables us to classify unstructured data based on the similarity of keywords. For a data scientist to find articles, papers, or any web blogs in a huge data lake is a trivial job. Keyword extraction can help to identify the dataset of the relevant document and can be indexed and served on keywords.

A huge data lake or data residing in the cloud can be indexed and clustered into a dataset based on keywords extracted from document content which help search the document faster. Keyword extraction helps to understand inappropriate comments or feedback or chat messages which can help administrators block them in the first place.

A typical keyword extraction algorithm has two main phases: candidate selection and scoring/ranking candidates.

Candidate selection

The crux of keyword extraction lies in selecting the right keywords. This is the first step to identifying all possible words, terms, and phrases that can potentially be keywords. There are a number of ways to identify keyword candidates. One could be a brute-force method that considers all words from external knowledge bases like WordNet/Wikipedia as a reference source of either good or bad keyphrases.

Other common heuristics methods include frequency calculation/removing stop words (like “a,” “and,” “but,” “how,” “or,” “what,” etc.) and punctuation to identify keywords candidates. Examples of these algorithms are n-grams, TF-IDF, word2vec, TextRank, RAKE, etc. These algorithms need little to no knowledge of the language of the underlying document. This is in comparison to other methods that need knowledge of the language of the underlying document. These methods identify the keywords based on parts of speech (POS) tagging — after POS noun phrases or verb phrases are selected as keyword candidates. Libraries like spaCy and NLTK are used in POS tagging.

Scoring/ranking candidates

The final phase of keyword extraction includes ranking/scoring of keywords selected in the candidate selection phase. Various methods are used for ranking and scoring the keywords and then drop the keywords which lie far from the scale. Methods like the use of minimum and maximum frequency threshold or degree of words are used to decide the rank of each keyword.

In the next section, we will elaborate on one POS-based technique (spaCy) and one non-POS based technique (RAKE).

spaCy

spaCy is a free, open-source library for Natural Language Processing (NLP) in Python. spaCy provides various features like tokenization, lemmatization, part-of-speech tagging, sentence recognition, and word-to-vector transformations. One can use POS tagging for keyword candidate selection. POS is the process of marking grammatical positions (e.g., noun, verb, adverb, adjective, etc.) of words. Once you have marked a sentence into POS, you can select adjective and noun combination as good candidates for keywords. One can also take a verb list to form candidates for keywords selection.

For the text “The cloud has changed everything,” POS tagging will look like:

Text	POS	Tag
The	DET	DT
cloud	NOUN	NN
has	AUX	VBZ
changed	VERB	VBN
everything	PRON	NN

If we select nouns as candidates for keywords, then cloud and everything become keywords.

Though POS tagging is the correct approach for selecting the candidate for keywords, POS tagging is a compute-intensive and time-consuming process. spaCy should be used for applications that need more language-specific keywords identification, and text size is small such as tweet analysis where the input tweets have fixed character length limits.

Rapid automatic keyword extraction (RAKE)

Rapid automatic keyword extraction (RAKE) is an efficient method to extract keywords from individual documents. RAKE breaks the documents based on ‘stop’ words and punctuation marks. Word segments that remain after removal of stop words and punctuation become candidates for extracted phrases. The final result is further refined based on word degree. Word degree for the phrase is the sum of word degree of individual words it contains. Individual word degree is calculated as (number of times a word appears + number of additional words it appears with) a word divided by its frequency (number of times it appears).

For the text, “The cloud has changed everything. Now it is changing the way you protect and manage your data. Druva delivers data protection and management for the cloud era,” the RAKE will break the text based on punctuation and ‘stop’ words which results in the following candidates for keywords:

cloud, changed everything, changing, way, protect, manage, data, Druva delivers data protection, management, and cloud era

Then, based on the degree and frequency of each word in the phrases, a score is given to each keyword identified. This will help rank the keyword to select a few of the phrases identified.

RAKE provides a fast method to identify the keywords, however, limited sets of ‘stop’ words make keywords lengthy. One can increase the ‘stop’ words list based on the application they develop for better keyword results. RAKE is suitable for applications that do indexing and searching on thousands of large documents.

Conclusion

The success of the keyword extraction algorithm depends upon the right keyword candidate selection. In our experiments, we observed that for 10kb of a text document, POS tagging by spaCy took around 200ms, while the total time of phrase extraction using spaCy took around 300ms for the same text. RAKE took around 35ms.

If you consider the computation time, RAKE is the clear winner. However, keywords generated by spaCy are closer to the desired results. POS tagging provides better control over selecting noun phrases, however, this method is compute-intensive and cost-intensive. Splitting text based on ‘stop’ words and heuristics to form phrases, resulted in a faster method to identify keywords in our experiments. The algorithm like the one RAKE uses, limited a set of ‘stop’ words to form keywords. One way to improve the result is to enhance the ‘stop’ word list based on the application.

Learn more about how Druva Labs is building highly scalable and optimized solutions. Additionally, discover how the Druva Data Resiliency Cloud’s patented cloud architecture provides centralized management and a consolidated view of all your data.

Data mining the Druva Data Resiliency Cloud

Candidate selection

Scoring/ranking candidates

spaCy

Rapid automatic keyword extraction (RAKE)

Conclusion

Druva Blog: Cloud Technology & Data Protection Articles

Druva Data Security Cloud

The Druva Platform

Data Protection

Cyber Response & Recovery

eDiscovery & Compliance

Use Cases

Key Technologies

Customers

Resources

Partners

Company