Tech/Engineering, Innovation Series

Data mining the Druva Data Resiliency Cloud

Gaurav Dhadiwal, Principal Engineer, Druva Labs

The Druva Data Resiliency Cloud gives our customers a single, unified repository of all of their data so that they can do more with their backup data. From analytics to compliance, our federated search capabilities are built into the core of our platform, and leverage keyword extraction techniques to optimize your results.

Keyword extraction or identification is a technique to identify the important terms or phrases in a document that help describe the content of the document. The output terms or phrases are often known as keywords, key phrases, key terms, or key segments.

Keyword extraction assists with understanding the context of the entire document. More than 75% of the data generated is unstructured data, and keyword extraction enables us to classify unstructured data based on the similarity of keywords. For a data scientist to find articles, papers, or any web blogs in a huge data lake is a trivial job. Keyword extraction can help to identify the dataset of the relevant document and can be indexed and served on keywords.

A huge data lake or data residing in the cloud can be indexed and clustered into a dataset based on keywords extracted from document content which help search the document faster. Keyword extraction helps to understand inappropriate comments or feedback or chat messages which can help administrators block them in the first place.

A typical keyword extraction algorithm has two main phases: candidate selection and scoring/ranking candidates.

Candidate selection

The crux of keyword extraction lies in selecting the right keywords. This is the first step to identifying all possible words, terms, and phrases that can potentially be keywords. There are a number of ways to identify keyword candidates. One could be a brute-force method that considers all words from external knowledge bases like WordNet/Wikipedia as a reference source of either good or bad keyphrases.

Other common heuristics methods include frequency calculation/removing stop words (like “a,” “and,” “but,” “how,” “or,” “what,” etc.) and punctuation to identify keywords candidates. Examples of these algorithms are n-grams, TF-IDF, word2vec, TextRank, RAKE, etc. These algorithms need little to no knowledge of the language of the underlying document. This is in comparison to other methods that need knowledge of the language of the underlying document. These methods identify the keywords based on parts of speech (POS) tagging — after POS noun phrases or verb phrases are selected as keyword candidates. Libraries like spaCy and NLTK are used in POS tagging.

Scoring/ranking candidates

The final phase of keyword extraction includes ranking/scoring of keywords selected in the candidate selection phase. Various methods are used for ranking and scoring the keywords and then drop the keywords which lie far from the scale. Methods like the use of minimum and maximum frequency threshold or degree of words are used to decide the rank of each keyword.

In the next section, we will elaborate on one POS-based technique (spaCy) and one non-POS based technique (RAKE).

spaCy

spaCy is a free, open-source library for Natural Language Processing (NLP) in Python. spaCy provides various features like tokenization, lemmatization, part-of-speech tagging, sentence recognition, and word-to-vector transformations. One can use POS tagging for keyword candidate selection. POS is the process of marking grammatical positions (e.g., noun, verb, adverb, adjective, etc.) of words. Once you have marked a sentence into POS, you can select adjective and noun combination as good candidates for keywords. One can also take a verb list to form candidates for keywords selection.

For the text “The cloud has changed everything,” POS tagging will look like:

TextPOSTag
TheDETDT
cloudNOUNNN
hasAUXVBZ
changedVERBVBN
everythingPRONNN

If we select nouns as candidates for keywords, then cloud and everything become keywords.

Though POS tagging is the correct approach for selecting the candidate for keywords, POS tagging is a compute-intensive and time-consuming process. spaCy should be used for applications that need more language-specific keywords identification, and text size is small such as tweet analysis where the input tweets have fixed character length limits.

Rapid automatic keyword extraction (RAKE)

Rapid automatic keyword extraction (RAKE) is an efficient method to extract keywords from individual documents. RAKE breaks the documents based on ‘stop’ words and punctuation marks. Word segments that remain after removal of stop words and punctuation become candidates for extracted phrases. The final result is further refined based on word degree. Word degree for the phrase is the sum of word degree of individual words it contains. Individual word degree is calculated as (number of times a word appears + number of additional words it appears with) a word divided by its frequency (number of times it appears).

For the text, “The cloud has changed everything. Now it is changing the way you protect and manage your data. Druva delivers data protection and management for the cloud era,” the RAKE will break the text based on punctuation and ‘stop’ words which results in the following candidates for keywords:

cloud, changed everything, changing, way, protect, manage, data, Druva delivers data protection, management, and cloud era

Then, based on the degree and frequency of each word in the phrases, a score is given to each keyword identified. This will help rank the keyword to select a few of the phrases identified.

RAKE provides a fast method to identify the keywords, however, limited sets of ‘stop’ words make keywords lengthy. One can increase the ‘stop’ words list based on the application they develop for better keyword results. RAKE is suitable for applications that do indexing and searching on thousands of large documents.

Conclusion

The success of the keyword extraction algorithm depends upon the right keyword candidate selection. In our experiments, we observed that for 10kb of a text document, POS tagging by spaCy took around 200ms, while the total time of phrase extraction using spaCy took around 300ms for the same text. RAKE took around 35ms.

If you consider the computation time, RAKE is the clear winner. However, keywords generated by spaCy are closer to the desired results. POS tagging provides better control over selecting noun phrases, however, this method is compute-intensive and cost-intensive. Splitting text based on ‘stop’ words and heuristics to form phrases, resulted in a faster method to identify keywords in our experiments. The algorithm like the one RAKE uses, limited a set of ‘stop’ words to form keywords. One way to improve the result is to enhance the ‘stop’ word list based on the application.

Learn more about how Druva Labs is building highly scalable and optimized solutions. Additionally, discover how the Druva Data Resiliency Cloud’s patented cloud architecture provides centralized management and a consolidated view of all your data.