Precision, Recall and F1 Explained [With 10 ML Use case]

Ruman
8 min readApr 15, 2024

--

Explore precision, recall & F1 metrics. Explore 10 ML use cases where prioritizing precision or recall is crucial for reliable models.

Photo by NASA on Unsplash

Outline

  • Introduction
  • Confusion Matrix
  • Precision, Recall and F1 with Code
  • Precision in 5 Different ML Use case
  • Recall in 5 Different ML Use case
  • Conclusion

Introduction

When building and deploying machine learning models, it’s crucial to have the right performance metrics to measure their effectiveness. While business metrics are crucial for assessing the overall impact, for classification metric metrics like precision, recall, and F1-score provide deeper insights into the model’s behaviuor.

Precision and Recall score is widely used by all the ML practitioners to evaluate the performance of their classification models. However, optimizing for one metric over another can lead to different tradeoffs, and the choice often depends on the specific business objectives.

In this article, we’ll dive deep into understanding precision, recall, and F1-score, how they are interpreted, and when to prioritize each metric. We’ll also explore 10 real-world machine learning use cases and see how these performance metrics are applied in practice.

Confusion Matrix

The confusion matrix is kinda tool for understanding and visualizing the performance of a machine learning model. It provides a clear breakdown of how the model is performing across different classes or categories.

The confusion matrix is a 2x2 table (for binary classification, n*n for multi class) that captures four key metrics:

  1. True Positives (TP): These are the cases where the model correctly identified the positive instances.
  2. True Negatives (TN): These are the cases where the model correctly identified the negative instances.
  3. False Positives (FP): These are the cases where the model incorrectly identified the negative instances as positive.
  4. False Negatives (FN): These are the cases where the model incorrectly identified the positive instances as negative.

The confusion matrix can be represented as follows:

https://www.evidentlyai.com/classification-metrics/confusion-matrix

For multi-class classification problems, the confusion matrix takes on a different structure. In a multi-class setting, the confusion matrix becomes an N x N table, where N is the number of unique classes in the dataset. The multi class confusion matrix can be represented as follows:

https://www.researchgate.net/figure/Confusion-matrix-for-multi-class-classification-The-confusion-matrix-of-a_fig7_314116591

Understanding the values in the confusion matrix is crucial, as they form the basis for calculating the precision, recall, and F1-score which we’ll be discussing next.

Precision, Recall and F1

Photo by Guillermo Ferla on Unsplash

Precision

Precision is a measure of the model’s ability to correctly identify positive instances. It answers the question like :

Out of all the instances the model identified as positive, how many were actually positive?

Mathematically, precision is calculated as:

Image by Author

Precision gives you an idea of how reliable your model’s positive predictions are. A high precision indicates that when the model predicts a positive outcome, it is likely to be correct.

Recall

Recall, on the other hand, measures the model’s ability to detect all the positive instances correctly. It answers the question like:

Out of all the actual positive instances, how many did the model correctly identify?

Recall is calculated as:

Image by Author

Recall tells you how good your model’s positive predictions are. A high recall means the model is able to identify most of the positive instances in the dataset.

F1-score

The F1-score is the harmonic mean of precision and recall, a method of calculating an average that rightly penalizes extreme values. It provides a balanced metric that considers both precision and recall. The F1-score ranges from 0 to 1, with 1 being the best.

The formula for F1-score is:

Image by Author

The F1-score is useful when you want to have a single metric that gives you an overall sense of the model’s performance, especially when precision and recall are both important for your use case.

F1-score maintains a balance between precision and recall, which means it is more sensitive to the lower of the two values, and will not give a high score if there is a significant imbalance between the precision and recall.

Using SkLearn for computing these metrics

It’s a piece of cake 😄

import numpy as np
from sklearn.metrics import precision_recall_fscore_support


def calculate_metrics(y_true, y_pred):
"""
Calculates precision, recall, and F1-score.

Args:
y_true (list or numpy.ndarray): Ground truth (correct) target values.
y_pred (list or numpy.ndarray): Estimated target values.

Returns:
precision (float): Precision score.
recall (float): Recall score.
f1 (float): F1-score.
"""
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
return precision, recall, f1


# Example usage
y_true = [0, 1, 0, 1, 0, 1]
y_pred = [0, 0, 1, 1, 0, 1]

precision, recall, f1 = calculate_metrics(y_true, y_pred)
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")

This above code is for calculating precision, recall, and F1-score using the precision_recall_fscore_support function from the scikit-learn library.

The function takes the ground truth labels (y_true) and the predicted labels (y_pred) as input, and returns the precision, recall, and F1-score.

Precision in 5 Different ML Use Case

Photo by Isaac Smith on Unsplash

Spam Detection

In spam detection, precision is a crucial metric. The goal is to identify as many spam emails as possible, but it’s also important to avoid flagging legitimate emails as spam. High precision ensures that when the model predicts an email as spam, it is highly likely to be an actual spam email.

This is important to maintain user trust and avoid frustration from false positive spam detections.

Credit Card Fraud Detection

For credit card fraud detection, precision is a key performance indicator. Banks want to accurately identify fraudulent transactions while minimizing the number of legitimate transactions that are incorrectly flagged as fraud.

High precision in this use case helps reduce customer friction and maintain a positive user experience.

Content Moderation

In content moderation systems, such as those used by social media platforms, precision is vital. The goal is to accurately identify and remove harmful or inappropriate content, while ensuring that legitimate content is not mistakenly taken down.

High precision helps maintain platform health and user trust.

Medical Diagnosis

In medical diagnosis, precision is crucial when screening for rare or serious conditions. For example, in early-stage cancer detection, high precision helps ensure that patients who receive a positive diagnosis are truly positive cases, minimizing unnecessary stress and additional testing.

Autonomous Driving

In autonomous driving systems, precision is paramount when it comes to object detection and classification. The model needs to precisely identify pedestrians, other vehicles, and obstacles to make safe driving decisions.

False positives (incorrectly identifying an object) can lead to unnecessary and potentially dangerous actions by the self-driving car.

Maintaining high precision ensures that the model’s positive predictions are reliable and can be trusted, which is essential for delivering valuable and trustworthy machine learning model.

By focusing on precision, you can build models that minimize the number of false positives, which is often the primary concern in these high-stakes applications.

Recall in 5 Different ML Use Case

Photo by Sigmund on Unsplash

Loan Approval

In loan approval systems, recall is a important metric to consider. The goal is to identify as many qualified applicants as possible, even if it means accepting a higher number of loan applications that may eventually be rejected.

High recall ensures that the model does not miss out on potentially creditworthy borrowers, which can be crucial for the business’s growth and profitability.

Predictive Maintenance

In predictive maintenance applications, recall is a key performance indicator. The objective is to identify as many malfunctioning or soon-to-fail assets as possible, to enable proactive maintenance and prevent unplanned downtime.

High recall ensures that the model catches a larger proportion of the faulty units, even if it means occasionally flagging some healthy assets as well.

Patient Diagnosis

In medical diagnosis, particularly for serious or life-threatening conditions, recall is of utmost importance. Physicians want to ensure that the diagnostic model identifies as many positive cases as possible, even if it means having a higher number of false positives that require further investigation.

Minimizing the number of missed diagnoses (false negatives) is crucial in this context.

Cybersecurity Threat Detection

In cybersecurity applications, such as intrusion detection systems, recall is a critical metric. The goal is to identify as many potential cyber threats as possible, even if it means having a higher number of false alarms.

High recall ensures that the model does not overlook any potentially malicious activities, which can have severe consequences for the organization’s security.

Recommendation Systems

In recommendation systems, recall can be an important metric, depending on the use case. For example, in a music streaming service, the recommendation model should aim to surface as many relevant song suggestions as possible, even if it means including some less-than-optimal recommendations.

High recall ensures that users are exposed to a wide range of potentially appealing content, which can improve user engagement and satisfaction.

In these use cases, recall is prioritized because the primary objective is to minimize the number of false negatives, even if it leads to a higher number of false positives.

Maintaining high recall ensures that the model is able to identify as many positive instances as possible, which is crucial for applications where it is important to minimize the number of false negatives. This is essential for delivering a comprehensive and reliable machine learning model that can be trusted to detect all relevant cases, even if it means occasionally identifying some false positives.

The examples above illustrate how recall can be a critical performance indicator, depending on the specific needs and constraints of the machine learning problem at hand.

This article is a continuation of the “Benchmarking AI Models” series. Stay tuned for more articles like this! 😊

3 stories

Conclusion

We explored the confusion matrix and how to calculate precision, recall, and F1-score. We looked at different use cases where optimizing precision or recall is more important. Precision ensures reliable positive predictions, while high recall identifies more positive instances — the appropriate metric depends on the specific requirements of the application.

If you enjoyed this article, your applause would be greatly appreciated!

--

--

Ruman

Senior ML Engineer | Sharing what I know, work on, learn and come across :) | Connect with me @ https://www.linkedin.com/in/rumank/