BERT Score Explained

Ruman
9 min readMay 17, 2024

--

Let’s dive deep to Bert Score, a widely used metric for evaluating the quality of text generated by language models.

Outline

  • What is BertScore ?
  • How BertScore is Computed ?
  • Why BertScore over BLEU or ROUGE ?
  • When to use BertScore ?
  • Conclusion

What is BERT Score ?

BERTScore is a metric used to evaluate the quality of text generated by language models or machine translation systems. It computes a similarity score between the generated text and one or more reference texts, indicating how well the generated text captures the semantics of the references.

The ingenious idea behind BERTScore is its innovative use of the BERT model to calculate semantic similarities between reference and generated texts. BERT, a state-of-the-art language model, introduced a paradigm shift in natural language processing by providing contextual embeddings for words in a sentence.

What do I mean by contextual embedding?

Unlike traditional static word embeddings, which assign the same vector representation to a word regardless of its context, BERT generates unique embeddings for the same word depending on the surrounding words and the sentence’s overall context.

For example, consider the following reference and generated texts:

reference_text = "I lvoe my dog."
generated_text = "Dog is running in the garden."

In this case, BERT will produce two completely different embeddings for the word “dog” in the two sentences, reflecting the distinct contexts in which it appears. In the reference text, “dog” refers to a beloved pet, while in the generated text, it signifies an animal engaged in an activity (running) in a particular location (garden).

By leveraging these contextual embeddings, BERTScore can capture deep semantic similarities between words and phrases, even when their surface forms differ. It accomplishes this by calculating the cosine similarities between the contextual embeddings of words in the reference and generated texts, and then aggregating these similarities into overall precision, recall, and F1 scores.

How BERT Score is Computed

Bert Score Computation — Step by Step process

The BERT score is calculated by computing the cosine similarity between the embeddings (vector representations) of the ground truth (reference) text and the generated text, using the pre-trained BERT model. The higher the cosine similarity, the more similar the generated text is to the reference text, and the higher the BERT score.

Here’s a step-by-step process for calculating the BERT score:

i. Tokenize the input texts

Both the reference text and the generated text are tokenized using the BERT tokenizer, which splits the text into subword units (tokens) that are recognized by the BERT model.

ii. Obtain Contextual Embeddings

The generated and reference text tokens are passed through a pre-trained BERT model to obtain contextual embeddings for each token in the sequences.

At the core of BERTScore lies the use of contextual word embeddings from a pre-trained BERT model. Unlike static word embeddings like Word2Vec, unlike static word embeddings like Word2Vec, BERT produces different embeddings for the same word based on the context it appears in, allowing it to capture rich semantic and syntactic information.

iii. Greedy Matching

This matching process is called “greedy” in the sense that for each token, it greedily selects the most similar counterpart token in the other sentence based on the cosine similarity between their contextual embeddings.

Here’s how it work :

1. For each token t_i in the generated sentence, find the token t_j in the reference sentence that has the maximum cosine similarity with t_i based on their contextual embeddings.

  • Match t_i to t_j
  • Calculate the cosine similarity score for this match sim(t_i, t_j)

2. For each token t_j in the reference sentence, find the token t_i in the generated sentence that has the maximum cosine similarity with t_j.

  • Match t_j to t_i
  • Calculate sim(t_j, t_i)

3. The greedy matching process continues by always matching the current token to the most similar remaining token in the other sentence, until all tokens are matched.

4. The similarity scores sim(t_i, t_j) and sim(t_j, t_i) for all the greedily matched token pairs are accumulated.

This greedy matching is all about “maximizing the matching similarity score” by always choosing the most similar token match at each step, in both directions between the sentences.

The accumulated similarity scores from this greedy bidirectional matching are then used to calculate the BERTScore precision and recall values.

So in essence, instead of allowing crossed matches, the greedy process locks each token to its highest similarity counterpart in a sequential fashion, maximizing the overall similarity score across all matched token pairs.

iv. Precision, Recall and F1

Let’s first take an example :

Let x = [x1, x2, …, x|x|] be the tokens of the reference sentence

Let x̂ = [x̂1, x̂2, …, x̂|x̂|] be the tokens of the generated sentence

x_i and x̂_j represent the contextual embeddings of the i-th token in x and j-th token in x̂ respectively.

Recall Calculation (Rbert)

Mathematical Formulation:

BERT Score Recall

For each token x_i in the reference sentence x:

  • Find the token x̂_j in the generated sentence x̂ that has the maximum cosine similarity x_i^T x̂_j with x_i.
  • Accumulate this maximum similarity score

Recall is the average of these maximum similarity scores, normalized by the length |x| of the reference

So recall captures how well the reference tokens are covered/matched by the generated tokens on average.

Few use cases where higher recall is desirable:

  • Summarization: When evaluating summarization systems, we generally prefer higher recall. The generated summary should ideally cover as many of the key points/semantics from the reference summary as possible, even if it uses somewhat different wording.
  • Open-ended Generation: For open-ended language generation tasks like story/article writing, we typically want high recall with respect to the reference texts. The generated output should capture most of the core meaning and concepts, even if done with some variation.
  • Translation: For machine translation evaluation, higher recall can be preferred. The translation should preserve most of the meaning from the reference, even if there are some semantic drifts.

Precision Calculation (Pbert)

Mathematical Formulation:

BERT Score Precision

For each token x̂_j in the generated sentence x̂:

  • Find the token x_i in the reference x that has the maximum cosine similarity x_i^T x̂_j with x̂_j.
  • Accumulate this maximum similarity score

Precision is the average of these maximum similarity scores, normalized by the length |x̂| of the generated sentence

So precision captures how well the generated tokens are matched/covered by the reference tokens on average.

Few use cases where higher precision is desirable:

  • Grammatical Error Correction: When evaluating a system that corrects grammatical errors in text, we want high precision. This means the corrections made by the system should closely match the reference/ground truth corrections. Even a small semantic drift is undesirable.
  • Code Generation: If we are evaluating code generation systems, high precision is important. The generated code needs to very closely match the reference implementation, as even small deviations can lead to errors.
  • Question Answering: For factual question answering systems, the generated answers should have high precision w.r.t the reference answers to be considered correct.

F1 Calculation

Mathematical Formulation:

BERT Score F1

Harmonic mean of precision and recall.

So in summary,

BERT Score computes recall by greedy matching from reference to generated, and precision by greedy matching from generated to reference, using the maximum cosine similarity between contextual embeddings.

Why BERT Score ?

Photo by Maxence Pira on Unsplash

BERTScore was created to address some of the limitations of traditional evaluation metrics like BLEU and ROUGE for tasks involving text generation or machine translation.

The main issues with metrics like BLEU were:

  • They rely on strict surface-level matching of n-grams between the generated and reference texts, without considering contextual similarities in meaning.
  • They can correlate poorly with human judgments of quality, especially for longer sequences.
  • They struggle to account for legitimate variations in word choice, order, etc. that preserve the same meaning.

Whereas BERT score instead uses contextualized word embeddings from a pretrained BERT model to capture semantic similarities. This allows it to go beyond just surface-level matching and better align with how humans understands semantic similarites in text.

So in essence, BERTScore provided a more meaningful evaluation by prioritizing semantic equivalence over shallow surface matching, better representing how humans evaluate language generation quality.

For example :

generated_text = "The dog quickly ran across the park." 
reference_text = "A quick brown dog raced through the park."

BLEU :

BLEU is based on the modified n-gram precision. It calculates the precision of matching n-grams between the generated and reference texts. For this example, the BLEU score would be relatively low since there is limited word overlap.

ROUGE:

ROUGE is based on the longest common subsequence between the generated and reference texts. For this example, the ROUGE score would be higher than BLEU since “dog” and “park” are in common.

BERT Score:

BERT Score uses contextual embeddings from a pre-trained BERT model to calculate semantic similarity scores. For this example, BERT Score is likely to give a relatively high score since the generated and reference texts convey essentially the same meaning, despite the different word choices and order.

So while BLEU and ROUGE heavily penalize the generated text for not having high surface-level similarity, BERT Score accounts for the underlying semantic equivalence better, giving a score more aligned with how a human might evaluate it.

The scores could look something like:

BLEU = 0.25 
ROUGE = 0.67
BERTScore = 0.92

This example shows how BERTScore provides a more meaningful evaluation by prioritizing meaning over surface form compared to the older n-gram matching metrics.

When to use Bert Score

BERT Score is important metric for any text generation task where strict lexical matching is insufficient, and accounting for semantic similarities aligned with human judgments is crucial.

Here are some key use cases where employing BERTScore can be beneficial :

  • Machine Translation: BERTScore provides a more meaningful evaluation of machine translation outputs compared to older metrics like BLEU, as it can account for legitimate variations in word choice and order while preserving the core meaning.
  • Text Summarization: For evaluating abstractive text summarization systems, BERTScore is well-suited as it can capture whether a summary preserves the key semantics of the original text despite using different surface forms.
  • Open-ended Text Generation: In tasks like story/article writing, dialog generation, or other open-ended language generation scenarios, BERTScore allows for a robust assessment by focusing on meaning retention rather than just lexical overlap with references.
  • Image Captioning: BERTScore has been effectively used to evaluate the quality of captions generated by image captioning models, by measuring their semantic similarity with ground truth captions.
  • Data-to-Text Generation: For tasks that involve generating natural language descriptions from structured data (e.g., weather reports, sports summaries), BERTScore can help assess how well the generated texts capture the intended meanings.
  • Grammatical Error Correction: BERTScore’s precision variant (BERTScore-P) provides a useful signal for evaluating grammatical error correction systems by capturing semantic equivalence between the original and corrected sentences.
  • and more…..

If you enjoyed this article, your applause would be greatly appreciated!

This article is a continuation of the “Benchmarking AI Models” series. Stay tuned for more articles like this! 😊

Benchmarking AI Models

3 stories

Conclusion

Photo by Anastasiia Krutota on Unsplash

BERTScore is a novel evaluation metric that has gained significant traction in the field of NLG and understanding. Its core strength lies in its ability to capture semantic similarities between generated and reference texts, going beyond surface-level lexical matching employed by traditional metrics like BLEU and ROUGE. By leveraging contextual embeddings from large pre-trained language models like BERT, BERTScore can effectively model semantic equivalences that align with human judgments, even in the presence of variations in word choice and order.

This capability makes BERTScore particularly valuable for a wide range of text generation tasks, including machine translation, summarization, open-ended generation, image captioning, and data-to-text generation, among others.

While not a perfect metric, BERTScore’s ability to capture meaningful semantic similarities has made it a popular choice in the field, driving progress in the development of more human-like language generation systems.

--

--

Ruman

Senior ML Engineer | Sharing what I know, work on, learn and come across :) | Connect with me @ https://www.linkedin.com/in/rumank/