nervaluate — The Ultimate way for Benchmarking NER Models

Ruman
13 min readMar 31, 2024

--

Evaluating a machine learning model is really important. We need to make sure it works well and understand its limitations. This is especially true for NER models, whether they’re based on transformers, LSTMs, Spacy, or anything else. That’s where “nervaluate” comes in handy.

In this blog, we’ll explore “nervaluate,” a great Python package for evaluating NER models.

Photo by PNW Production

Content Outline

  • Why you should even care about this ?
  • Let’s look at some code
  • Model Evaluation Result
  • Deep Dive to nervaluate Metrics
  • Conclusion

Why should you even care about this?

Photo by Francesco Ungaro

Let me share a personal experience.

Few months back, few folks of our ML team were solving the same NER problem. Some team members were experimenting with Transformer-based models, while others were using Spacy, and a few were developing custom models.

When it came to evaluating these models, we relied heavily on metrics like F1 score, precision, recall at the entity level, and overall accuracy.

However, we encountered a significant issue. Even if, for example, a Transformer-based model and a Spacy model both boasted an F1 score of 0.95, we discovered vastly different characteristics between the two models when debugging to identify limitations. These nuances were not evident from a single F1 score metric.

Consequently, we found ourselves in need of a more unified approach to assess these models comprehensively. After conducting research, we stumbled upon a remarkable Python package that offered exactly what we were looking for.

Also, I’m writing on this package because there isn’t much content available about it. I believe that tools like this, which significantly simplify the lives of ML practitioners, deserve more recognition. So, let’s dive into it!

Let’s Look at some Code

The installation process is straightforward —

pip install nervaluate

Using “nervaluate” is equally straightforward as well. You’ll need to provide the ground truth, model predictions, tags, and loader.

Here’s an example:

from nervaluate import Evaluator

# gt_labels and pred_labels both need to be in same format
gt_labels = [] # ground truth
pred_labels = [] # model prediction

evaluator = Evaluator(gt_labels, pred_labels, tags=['LOC','PER'], loader="default")

# call the evaluator to get model evaluation result
results, results_by_tag = evaluator.evaluate()

Output of results and result_by_tag will be explored under the "Model Evaluation Result" section.

To create an Evaluator object, you need to provide the following arguments:

  • gt_labels: The ground truth labels.
  • pred_labels: The predicted labels from your NER model.
  • tags: A list containing all the entity labels or classes.
  • loader: Specifies the format of gt_labels and pred_labels, which can be "default", "list", or "conll".

Support for Different Formats

As mentioned earlier, we can use this package to evaluate different NER models, be it Transformers-based, Spacy, or others.

nervaluate gives us the flexibility to use data in different formats to evaluate model performance. We can define the format using the loader argument when creating the Evaluator object.

What our team has done that we have written a script that converts every model’s predictions to one format. We prefer the list of entity span prodigy-style format, and we use that to evaluate the model.

Here are three different formats —

i. Spacy (prodigy style lists of spans)

Use loader="default" for this format.

gt_labels = [
[{"label": "PER", "start": 2, "end": 4}],
[{"label": "LOC", "start": 1, "end": 2},
{"label": "LOC", "start": 3, "end": 4}]
]

pred_labels = [
[{"label": "PER", "start": 2, "end": 4}],
[{"label": "LOC", "start": 1, "end": 2},
{"label": "LOC", "start": 3, "end": 4}]
]

ii. Transformers based token-level (Nested lists containing NER labels)

Use loader="list" for this format.

gt_labels = [
['O', 'O', 'B-PER', 'I-PER', 'O'],
['O', 'B-LOC', 'I-LOC', 'B-LOC', 'I-LOC', 'O'],
]

pred_labels = [
['O', 'O', 'B-PER', 'I-PER', 'O'],
['O', 'B-LOC', 'I-LOC', 'B-LOC', 'I-LOC', 'O'],
]

iii. CoNLL-style tab-delimited strings

Use loader="conll" for this format. Never heard of conll before 🤷🏼‍♂️

gt_labels = "word\tO\nword\tO\B-PER\nword\tI-PER\n"

pred_labels = "word\tO\nword\tO\B-PER\nword\tI-PER\n"

Model Evaluation Result

Photo by Zeeshaan Shabbir

To evaluate the model’s performance, we can invoke the evaluate method of the evaluator object, which returns two dictionaries. For your reference, here's the code example :

# call the evaluator to get model evaluation result
results, results_by_tag = evaluator.evaluate()

results dictionary contains the overall evaluation result and result_by_tag provides a detailed entity level evaluation result.

Lets look at both of them —

i. Overall Evaluation Result

This provides an overview of the overall result, offering a general or average assessment.

{
"ent_type": {
"correct": 19944,
"incorrect": 115,
"partial": 0,
"missed": 1050,
"spurious": 50,
"possible": 21109,
"actual": 20109,
"precision": 0.9917947187826346,
"recall": 0.9448102705007343,
"f1": 0.9677325440341598
},
"partial": {
"correct": 19788,
"incorrect": 0,
"partial": 271,
"missed": 1050,
"spurious": 50,
"possible": 21109,
"actual": 20109,
"precision": 0.9907752747525983,
"recall": 0.9438391207541806,
"f1": 0.9667378329855887
},
"strict": {
"correct": 19692,
"incorrect": 367,
"partial": 0,
"missed": 1050,
"spurious": 50,
"possible": 21109,
"actual": 20109,
"precision": 0.9792630165597493,
"recall": 0.9328722345918803,
"f1": 0.9555048765102625
},
"exact": {
"correct": 19788,
"incorrect": 271,
"partial": 0,
"missed": 1050,
"spurious": 50,
"possible": 21109,
"actual": 20109,
"precision": 0.9840369983589438,
"recall": 0.9374200577952532,
"f1": 0.9601630355669852
}
}

ii. Entity Evaluation Result

This result offers a comprehensive and granular analysis, providing insights at the entity level.

{
"LOC": {
"ent_type": {
"correct": 2112,
"incorrect": 27,
"partial": 0,
"missed": 32,
"spurious": 14,
"possible": 2171,
"actual": 2153,
"precision": 0.9809568044588945,
"recall": 0.9728235836020267,
"f1": 0.9768732654949122
},
"partial": {
"correct": 2139,
"incorrect": 0,
"partial": 0,
"missed": 32,
"spurious": 14,
"possible": 2171,
"actual": 2153,
"precision": 0.9934974454249884,
"recall": 0.9852602487333026,
"f1": 0.9893617021276596
},
"strict": {
"correct": 2112,
"incorrect": 27,
"partial": 0,
"missed": 32,
"spurious": 14,
"possible": 2171,
"actual": 2153,
"precision": 0.9809568044588945,
"recall": 0.9728235836020267,
"f1": 0.9768732654949122
},
"exact": {
"correct": 2139,
"incorrect": 0,
"partial": 0,
"missed": 32,
"spurious": 14,
"possible": 2171,
"actual": 2153,
"precision": 0.9934974454249884,
"recall": 0.9852602487333026,
"f1": 0.9893617021276596
}
},
"PER": {
"ent_type": {
"correct": 1764,
"incorrect": 0,
"partial": 0,
"missed": 64,
"spurious": 0,
"possible": 1828,
"actual": 1764,
"precision": 1,
"recall": 0.9649890590809628,
"f1": 0.9821826280623608
},
"partial": {
"correct": 1764,
"incorrect": 0,
"partial": 0,
"missed": 64,
"spurious": 0,
"possible": 1828,
"actual": 1764,
"precision": 1,
"recall": 0.9649890590809628,
"f1": 0.9821826280623608
},
"strict": {
"correct": 1764,
"incorrect": 0,
"partial": 0,
"missed": 64,
"spurious": 0,
"possible": 1828,
"actual": 1764,
"precision": 1,
"recall": 0.9649890590809628,
"f1": 0.9821826280623608
},
"exact": {
"correct": 1764,
"incorrect": 0,
"partial": 0,
"missed": 64,
"spurious": 0,
"possible": 1828,
"actual": 1764,
"precision": 1,
"recall": 0.9649890590809628,
"f1": 0.9821826280623608
}
}
}

The results and result_by_tag dictionaries provide us with ten different evaluation metrics :

"correct"
"incorrect"
"partial"
"missed"
"spurious"
"possible"
"actual"
"precision"
"recall"
"f1"

These ten metrics are calculated in four distinct evaluation methods :

"ent_type" 
"partial"
"strict"
"exact"

To much to process 🥱??

If you’re feeling 😕 right now, don’t worry, I’ll make metrics clear in the next👇🏼 section.

Deep Dive to nervaluate’s Metrics

Photo by Ayşe İpek

So far we have looked at the sample code, the formats for providing ground truth data and model predictions, and the evaluation results generated by the nervaluate package, which performs overall and entity-level evaluation.

Now, we will explore the metrics calculated by nervaluate, which are similar for both overall and entity-level evaluation.

In the results, we observed ten metrics measured in four different ways (four different evalution methods).

The ten Metrics are:

Putting memes so that you guys/girls are not bored
  • Correct (COR): The model prediction exactly matches the ground truth.
  • Incorrect (INC): The model prediction does not match the ground truth at all.
  • Partial (PAR): The model prediction is somewhat similar to the ground truth, but not an exact match.
  • Missing (MIS): The model failed to predict an entity that is present in the ground truth (false negative).
  • Spurious (SPU): The model predicted an entity that is not present in the ground truth (false positive).
  • Possible (POS): The total number of predictions made by the model.
  • Actual (ACT): The total number of entities available in the ground truth.
  • Precision: The precision metric, calculated differently depending on the evaluation method.
  • Recall: The recall metric, also calculated differently based on the evaluation method.
  • F1: The F1 score, calculated in the typical way.

Among the ten metrics, five serve as key metrics for categorizing errors and evaluating the NER model. These five metrics are Correct (COR), Incorrect (INC), Partial (PAR), Missing (MIS), and Spurious (SPU)

Before moving ahead, it’s important to understand how two metrics, Possible (POS) and Actual (ACT), are computed. Let’s see their formula:

Formula to calculate Actual Entity Counts
Formula to calculate Possible Entity Counts

The metrics Actual (ACT) and Possible (POS) play a crucial role in calculating Precision and Recall depending on the chosen evaluation schema/method.

It’s important to note that the treatment of Missing (MIS) and Spurious (SPU) predictions remains consistent across all evaluation schemas. Regardless of whether the Strict, Exact, Partial, or Type Evaluation schema/method is employed.

Four Evolution Methods for above Metrics

Strict Evaluation Method

According to the Strict Evaluation method, a model prediction is considered Correct (COR) only when both the predicted entity label and the predicted entity string match the ground truth exactly otherwise Incorrect (INC).

In other words, for a prediction to be counted as Correct, the model must accurately identify the entity type (label) as well as the exact character span (string) of the entity, without any deviations from the ground truth data.

In the Strict Evaluation schema, Partial matches are considered as Incorrect(INC) when computing precision and recall metrics.

nervaluate strict evaluation method

For example, In scenario 4, the prediction is Correct (COR) since both the “DRUG” label and “phenytoin” string are correct. However, in scenario 2, despite the “DRUG” label being right, the partial string match “of warfarin” vs “warfarin” results in an Incorrect (INC) prediction under this strict criteria.

Any deviation from the ground truth, even partially, is deemed incorrect in this rigorous evaluation method.

Precision and Recall under Strict Evaluation schema:

nervaluate Strict Precision and Recall Formula

Let’s consider a real-world scenario:

Strict Evaluation schema would be ideal for evaluating NER model for document AI use case where NER model have to recognise various entities in the document.

For example, address extraction from documents is crucial for database verification. The entire address “1600 Amphitheatre Parkway, Mountain View, CA 94043, USA” must be recognized as a single entity (ADDRESS).

Strict evaluation is essential to assess the performance of NER models in applications where accurate predictions of both entity labels and entity boundaries are crucial, and false positives (FP) are costly.

Exact Evaluation Method

The Exact Evaluation schema focuses solely on the accuracy of the predicted entity string boundaries, disregarding the entity label or type.

A model prediction is marked as Correct (COR) if the predicted entity string matches the ground truth entity string span precisely, regardless of whether the predicted entity label is correct or not.

In the Exact Evaluation schema, Partial matches are considered as Incorrect(INC) when computing precision and recall metrics.

nervaluate exact evaluation method

For example, in scenario 3, the prediction is Correct (COR) despite the incorrect “BRAND” label because the “propranolol” string is right.

However, like the Strict schema, the Exact method does not allow partial string matches, the scenario 2 is Incorrect (INC) as “of warfarin” doesn’t exactly match “warfarin”. Only the entity string boundaries matter in this evaluation.

Precision and Recall Formulas for the Exact Evaluation schema:

nervaluate Exact Precision and Recall Formula

Let’s consider a real-world scenario:

Exact Evaluation schema would be ideal for evaluating a NER model for entity linking in knowledge graphs (e.g, Wikipedia is knowledge graph) use case. For entity linking systems, precisely identifying the boundaries of entity mentions in text is crucial, regardless of whether the entity type is predicted correctly.

For example, if a knowledge graph contains the entity “Albert Einstein”, a NER model must recognize any mention of his full name “Albert Einstein” as the same entity, in order to correctly link it to the existing node in the knowledge graph. Even if the predicted type is “Person” instead of “Scientist”, as long as the character span of “Albert Einstein” is identified accurately, it can still be correctly linked.

Partial Evaluation Method

The Partial Evaluation Method combines aspects of the Strict and Exact Evaluation methods. Unlike the Strict and Exact methods, which consider partial matches as Incorrect (INC) when computing precision and recall, the Partial Evaluation Method accounts for these partial matches.

Under this method, a model prediction is considered Correct (COR) if the predicted entity string exactly matches the ground truth entity string, irrespective of the predicted entity label or type (similar to the Exact method). However, the Partial Evaluation Method goes a step further by introducing a new category called Partial (PAR). If the predicted entity string partially overlaps with the ground truth entity string, regardless of the entity label, the prediction is classified as Partial (PAR) instead of being marked as Incorrect (INC).

nervaluate partial evaluation schema

For example, scenarios 2 and 5 are marked as Partial (PAR) due to the predicted entity strings (“of warfarin” and “oral contraceptives”) partially overlapping with the respective ground truth Entity strings. Unlike the Strict and Exact schemas which would consider these as Incorrect (INC), the Partial method accounts for such partial overlaps by categorizing them as Partial (PAR) instead of considering a completely incorrect predictions.

Precision and Recall Formulas for the Partial Evaluation schema:

nervaluate Partial Precision and Recall Formula

During partial evaluation, only 50% of the total partial predictions are considered.

Let’s consider a real-world scenario:

A case where the Partial Evaluation schema would be ideal is for a named entity recognition applied to clinical notes in electronic health records. In clinical text, doctors may refer to diseases, medications, or procedures using shorthand, abbreviations or synonyms. The Partial Evaluation schema accounts for partial matches between predictions and ground truth labels, which is important in this domain.

For example if a clinical note refers to “congestive heart failure”, and the NER model recognizes the entity as just “heart failure”, this would be considered a partial match under the Partial Evaluation schema. Even though the model did not predict the full entity string, it did identify part of the entity mention (“heart failure”) correctly. This allows the model to still receive partial credit during evaluation.

Considering partial matches is valuable for NER models used in domains like clinical NLP, where language used can be non-standard and abbreviations are common. The Partial Evaluation schema provides a more robust evaluation approach for such use cases.

Type Evaluation Method

In the Type Evaluation method, a prediction is considered as correct (COR) only if the predicted entity type (label) exactly matches the ground truth and there is a degree of overlap between the entity strings.

In simpler terms, partial overlaps are counted as correct (COR) when both the predicted entity label and the ground truth entity label match.

nervaluate type evaluation schema

For example, in the screenshot provided above, in scenario-2, the prediction is considered as correct (COR) because there is an overlap between the strings (“warfarin” and “of warfarin”), and the predicted entity labels are identical. However, in scenarios 3 and 5, despite the overall similarity between the entity strings, they are considered incorrect (INC) because the entity labels do not match.

Precision and Recall Formulas for the Type Evaluation schema:

nervaluate Type Precision and Recall Formula

The formula resembles the Strict and Exact Evaluation Method.

Summing up all the Evaluation method in one table—

nervaluate different evaluation methods
  • You can see, treatment of Missing (MIS) and Spurious (SPU) predictions remains consistent across all evaluation schemas. Regardless of whether the Strict, Exact, Partial, or Type Evaluation is performed.
  • In the Partial Evaluation Schema, Partial (PAR) predictions are the only ones taken into account, whereas in other schemas, they are considered as Incorrect (INC).

If you’ve read this far, 🙌 🫡, and yes, it’s over…

Conclusion

In conclusion, evaluating NER models is crucial to ensure their effectiveness and understand their limitations. “Nervaluate” emerges as a valuable tool in this regard, offering a unified approach to assess model performance.

Through our exploration of “nervaluate,” we’ve seen its straightforward implementation and flexibility across different model types and data formats. By providing detailed evaluation metrics at both the overall and entity levels, it offers insights into model strengths and weaknesses that go beyond simple metrics like F1 score.

We thoroughly looked at the evaluation results provided by “nervaluate,” and went through both the overall and entity-level assessments. Later looked into the various metrics provided in these results and explored different evaluation methods, including Strict, Exact, Partial, and Type Evaluation. Each method offers unique insights into model performance, considering factors such as exact matches, partial overlaps, missing entities, correctness of entity types, etc.

Ultimately, “nervaluate” offers ML practitioners with the means to do a rigorous NER model evaluation, enabling informed decisions while doing model iteration. With its comprehensive evaluation metrics and flexible usage, it stands as an invaluable resource in the benchmarking of NER models.

If you’re interested in learning how to fine-tune a PyTorch model, feel free to explore my blog on the topic.

Ultimate Guide to Fine-Tuning in PyTorch

4 stories

And this too

Reference

--

--

Ruman

Senior ML Engineer | Sharing what I know, work on, learn and come across :) | Connect with me @ https://www.linkedin.com/in/rumank/