Use Cases
- AI Resilience
  - AI Resilience
  - AI
    - AI
    - Claude
    - Copilot
    - MCP
  - Endpoints
    - Endpoints
    - Endpoints
- Cloud Native
  - Cloud Native
  - AWS
    - AWS
    - Amazon EC2
    - Amazon RDS
    - Amazon S3
    - Amazon EFS
  - Microsoft & Azure
- Data Center
  - Data Center
  - Virtualization
    - Virtualization
    - VMware
    - Hyper-V
    - Nutanix
  - Databases
  - Unstructured Data
    - Unstructured Data
    - NAS
- SaaS Apps
- Adopt AI with Confidence
  Recover, govern, defend, and accelerate AI data, workflows, and operations
  
  Accelerate Cyber Resilience
  Reduce costs, accelerate cyber recovery and simplify management
  
  Secure Multi-Cloud Environments
  Secure data within AWS/Azure or across clouds without hardware headaches
  
  Modernize Data Protection
  Data protection for data centers, cloud workloads, SaaS apps, and edge devices
Why Druva
- The Druva Difference
  The Druva Difference
- About Druva
  About Druva
- Explore
  Explore
  - Customers
  - Careers
  - Events
  - Newsroom
  - Blog
- Customer Spotlight
  
  ZS Associates cuts recovery from days to just hours
  Case Study
  
  Contact Us
  
  Our experts are here to help.
  Reach out
Products
- The Resilience Cloud
  The Resilience Cloud
  Fully managed data security across enterprise, cloud, SaaS, and end user.
  Dru AI
  Ensure backup health and trends, accelerate troubleshooting using Agentic AI
  
  Dru Metagraph
  
  Dru SRE Agent
- Dru AI
  Dru AI
  Ensure backup health and trends, accelerate troubleshooting using Agentic AI
  - Dru Metagraph
  - Dru SRE Agent
- AI Resilience
  AI Resilience
  Enterprise Cloud Backup and data management across edge, on-premises and cloud workloads
- Identity Resilience
  Identity Resilience
  Enterprise Cloud Backup and data management across edge, on-premises and cloud workloads
- eDiscovery & Compliance
  eDiscovery & Compliance
  Ensure compliance and accelerate eDiscovery with Druva’s cloud-native SaaS. Instantly search backup data, apply legal holds, and simplify governance.
  - eDiscovery & Legal Hold
  - Compliance & Sensitive Data Governance
- Data Resilience
  Data Resilience
  Discover Druva's data resilience solutions to protect, backup, and recover your enterprise data effortlessly in the cloud. Ensure business continuity with secure, scalable, and automated data protection solutions.
- Cyber Resilience
  Cyber Resilience
  Explore Druva's cyber resilience framework featuring real-time threat insights and 24/7 managed data detection
Learning Center
- Resource Library
  Resource Library
- Explore
- Product Resources
- Druva is a 2026 Gartner® Magic Quadrant™ Leader
  Get the Report
  
  Switch to Druva, Reduce TCO by up to 40%
  Calculate Your Savings
Partners
- Alliances
  Alliances
  - AWS
  - Dell
  - Microsoft
- Ecosystem
  Ecosystem
  - Security Integrations
  - Technology Partners
- Value Added Resellers
  Value Added Resellers
- Managed Service Providers
  Managed Service Providers
- Partner Portal
  - Partner Portal Login
  - Managed Service Center
- Join Our Partner Network
  
  Deliver cyber resilience with ZERO hardware, ZERO infrastructure, ZERO hassle
  Apply now
  
  Druva Marketplace
  
  Discover trusted integrations to extend Druva and simplify your cyber resilience workflows.
  Explore the Marketplace
Get Started
Search queries sent to third parties.
Support
Login

Tech/Engineering

NER Natural Language Processing Model - Which is Best?

November 04, 2020 Bhagyashri Shitole, Software Engineer, Druva Labs

What is NER?

In a text document, some terms represent specific entities that are more informative and have a unique context. Named Entity Recognition (NER) is a method of information extraction which automatically identifies and classifies named entities into predefined categories, such as people, location, organization, time, quantities, percentage, monetary values, etc. NER is used in many applications of Natural Language Processing (NLP), and helps to address questions such as the following:

Which organization is mentioned in the article?
Which person is referred to in the email?
Which location is referred to in a review?

How does NER work?

We humans naturally recognize named entities including people, locations, organizations, and so on. For example, “Druva, a data protection company headquartered in California, was founded by Jaspreet Singh and Milind Borate.”

PERSON(s): Jaspreet Singh, Milind Borate

ORGANIZATION: Druva

LOCATION: California

For computers, however, recognizing entity types in human languages is not that simple. NLP is a subfield of Artificial Intelligence (AI) that helps machines process human language. NLP studies the structure and rules of the language and constructs intelligent systems capable of analyzing text or speech.

Here are my takeaways on diving further into open-source NLP libraries.

Natural Language Toolkit (NLTK)

NLTK provides all components of NLP to build an NER pipeline.

The raw text of the document is split into sentences using the sentence segmenter.
Sentences are divided into words using a tokenizer.
Sentences are assigned part-of-speech (POS) tags which are helpful in entity detection.

Noun Phrase Chunking (np-chunking) is used for entity identification. Using POS tagged sentences, np-chunking divides sentences into individual noun phrases as shown in the diagram.

NP: Noun Phrase (DT: Determiner, JJ: Adjective, NN: Noun)
VBD: Verb, past tense
IN: Preposition, Conjunction

Noun Phrase (NP) represents one of the entity types.

NLTK provides a classifier trained to identify named entities. It is accessed using the function nltk.ne_chunk() and labels such as PERSON, LOCATION, ORGANIZATION, etc.

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

data = "Druva a data protection company headquartered in California was founded by Jaspreet Singh and Milind Borate."

tokens = nltk.word_tokenize(data)
pos_tags = nltk.pos_tag(tokens)

chunks = nltk.ne_chunk(pos_tags)
for chunk in chunks:
if hasattr(chunk, 'label'):
print(' '.join(c[0] for c in chunk), chunk.label())

Druva GPE
California GPE
Jaspreet Singh PERSON
Milind Borate PERSON

Spacy

Spacy is a Python framework that is fast and easy to use. You can use its pre-trained models whose predictions are strongly dependent on the examples on which it was trained. As such, they might need some tuning as per your use case.

While processing, Spacy first tokenizes the raw text, assigns POS tags, identifies the relation between tokens like subject or object, labels named ‘real-world’ objects like persons, organizations, or locations, and finally returns the processed text with linguistic annotations with entities from the text. Spacy does not use the output of tagger and parser for NER, so you can skip these pipelines while processing, as shown below.

import spacy

data = "Druva a data protection company headquartered in California was founded by Jaspreet Singh and Milind Borate."

nlp = spacy.load('en_core_web_sm', pipeline=["ner"])
for ent in nlp(data).ents:
print(ent.text, ent.label_)

Druva GPE
California GPE
Jaspreet Singh PERSON
Milind Borate ORG

Stanford Core NLP (Stanza)

The Stanford NER classifier is also called the Conditional Random Field (CRF) classifier. This provides a general implementation of a linear chain CRF model.

NER processing pipeline:

Tokenizer splits the raw text into sentences and words.
The Multi-Word Token (MWT) expansion module expands the token into multiple syntactic words. This pipeline is specific to languages with multi-word token, like French or German. Languages such as English do not support it.
NER classifier receives the annotated data and assigns labels to an entity, like PERSON, ORGANIZATION, LOCATION.

import stanza
stanza.download("en")

data = "Druva a data protection company headquartered in California was founded by Jaspreet Singh and Milind Borate."
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')

doc = nlp(data)
for sent in doc.sentences:
for ent in sent.ents:
print(ent.tex,t ent.type)

Druva ORG
California GPE
Jaspreet Singh PERSON
Milind Borate PERSON

Polyglot

Polyglot NER does not use human-annotated training datasets. Rather, it uses huge unlabeled datasets (like Wikipedia) with automatically inferred entities using the hyperlinks.
The following example shows how to identify entities by cross-linking with Wikipedia.

<ENTITY url="https://en.wikipedia.org/wiki/Michael_I._Jordan"> Michael Jordan </ENTITY> is a professor at <ENTITY url="https://en.wikipedia.org/wiki/University_of_California,_Berkeley"> Berkeley </ENTITY>

Polyglot's object-oriented implementation simplifies its use for NLP features.

Applying the model on raw text, Polyglot provides a processed text with data including sentences, words, entities, and POS tags.

from polyglot.text import Text

data = "Druva, a data protection company headquartered in California was founded by Jaspreet Singh and Milind Borate."

text = Text(data, hint_language_code='en')
for each in text.entities:
print(' '.join(each), each.tag)

California I-LOC
Jaspreet Singh I-PER
Milind Borate I-PER

Comparison

1. Performance

For experiments, the input text file used was 250KB, and the test machine was used with configurations of 2 Cores and a 4GB Memory.

	NLTK	Spacy	Stanza	Polyglot
Time (sec)	14	3.15	184	7.6
CPU	1 core 100%	1 core 100%	2 core 100%	1 core 100%
Memory	340 MB	1.1 GB	1.6 GB	150 MB

2. Comparison

The table below provides guidelines on when to consider using specific models.

	NLTK	Spacy	Stanza	Polyglot
Beginner	yes	yes	yes	yes
Multi-language support	yes	yes	yes	yes
Entity categories	7	18	3/4/7	3
CPU efficient application	yes	yes	no	yes
Model	Supervised	Supervised	Supervised	Semi-Supervised
Programming Language	Python	Python	Python/Java	Python

3. Accuracy:

There are accuracy variations of NER results for given examples as pre-trained models of libraries used for experiments.

Conclusion

These observations are for NLTK, Spacy, CoreNLP (Stanza), and Polyglot using pre-trained models provided by open-source libraries. There are many other open-source libraries which can be used for NLP.

NLTK is one of the oldest, and most widely adopted methods for research and educational purposes. Spacy is object-oriented with customizable options, works fast, and is considered the current industry-standard. Stanford CoreNLP is slow for NLP production usage, but can integrate with NLTK to boost CPU efficiency. Polyglot is a lesser-known library, but is efficient, straightforward, and works fast. Using Polyglot is similar to using Spacy and a good choice for projects involving language which Spacy does not support. Unlike other libraries, Polyglot works better at processing unusual or informal text/speech where natural language rules are not followed.

Explore the many ways Druva’s innovative solutions are enabling a range of next-generation cloud-based applications, such as those for neural networks, in the Tech/Engineering section of the blog archive.

NER Natural Language Processing Model - Which is Best?

What is NER?

How does NER work?

Natural Language Toolkit (NLTK)

Spacy

Stanford Core NLP (Stanza)

Polyglot

Comparison

1. Performance

2. Comparison

3. Accuracy:

Conclusion

Druva Blog: Cloud Technology & Data Protection Articles

The Druva Platform

Use Cases

Industries

Druva vs. Competitors

Company

Druva is a Gartner® Magic Quadrant™ Leader — Again.