📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Evidently

Evidently 0.1.8: Machine Learning Performance Reports for Classification Models

Last updated:

January 23, 2025

Published:

February 25, 2021

contents‍

Start testing your AI systems today

Get demo

A new report in the Classification Quality family: now, you can use Evidently open-source Python library to generate a Classification Report. It helps summarize and explore the quality of classification models. It works for probabilistic and non-probabilistic classifiers, binary and multi-class alike.

What is it?

Like the Regression Performance Report, this new Classification report helps understand machine learning performance in production. You can run it whenever you have the ground truth labels available to evaluate the classification quality.

The report answers the following questions:

How well does the model perform in production?
Did it change since the model training or a particular past period?
Where does the model assign the wrong labels? Can specific features explain the misclassification?

If the actual labels are not yet known, you can use the Data Drift or Numerical Target Drift reports instead to analyze the model quality using only leading indicators.

You are reading a blog about an early Evidently release. This functionality has since been improved, and additional Reports and Test Suites are available. You can browse available reports and check out the current documentation for details.

How does it work?

To analyze your classification model's performance, you need to prepare model logs as a pandas DataFrame. You should include the input features, predicted labels (or probabilities), and actual labels.

Features, predictions, actual class, quality

You can run the report with a single DataFrame to analyze the performance of a chosen model.

If you want to compare the current model performance with the training or past period, you should prepare two DataFrames. You can also use a single DataFrame and indicate the corresponding rows that belong to the "Reference" and "Current" ("Production") data.

‍Evidently will treat "Reference" data as a basis for comparison.

Two datasets as a input to generate the report. — In the updated version of the library, we refer to the second dataset as "Current" instead of "Production".

Once you prepare the data, call the report in the Jupyter notebook and include the ClassificationPreset. You can browse example notebooks in the documentation.

How to work with the performance report

First, the report shows a few standard quality metrics relevant for classification models: Accuracy, Precision, Recall, and F-1 Score. For probabilistic classification, it also indicates ROC AUC and LogLoss.

By default, it assigns the label with the maximum predicted probability. In binary classification, it is a class with a predicted probability of 0,5 and higher. In multi-class, it is the top predicted one. You can also set a custom classification threshold as a parameter to recalculate the metrics.

Quality Metrics appear both for Reference and Current datasets. This way, you can compare the performance and understand how it changed.

As an example, we will use a probabilistic classification model built using the Breast Cancer Prediction dataset.

Here we can quickly notice that the performance remains great in Production:

Next, the report generates plots to detail the model performance and the types of errors the model makes.

‍Class Representation plots show the number of objects of each class in the Reference and Production datasets. It quickly highlights if any of the classes became more or less frequent. Such change often explains performance decay.

Class representation plots for two models

Next is the Confusion Matrix, which visualizes the classification errors and their type. Our sample model is highly accurate, so nothing draws our attention here.

Quality metrics by class. We also show each of the model quality metrics for the individual classes.

Sometimes, high overall accuracy can hide that the model fails on some rare classes. This plot helps quickly reveal the difference in performance.

Then, we have a Class Separation plot for the probabilistic models. It shows how well the model distinguishes each given class from the rest, taking into account predicted probabilities.

The plot helps notice specific outliers in the predictions and visually understand if the model gets confused on a particular class.

The idea is to show both the model accuracy and the quality of its calibration.

Well-calibrated model VS the model that needs to be calibrated.

Let's imagine that most of the correct predictions for a given class concentrate at a 0,7 probability level. There are no more objects above this threshold. And, there are no misclassified objects of a different class in the vicinity. It seems like we did a great job assigning the labels!

But in fact, this suggests we can further improve the model calibration. In an ideal world, we should observe a high density of correct predictions close to 1.

‍You can also refer to this plot to choose the best probability threshold for each class. Depending on the use case, error tolerance and costs vary. You might want to assign a given label not at the default 50% but at a different probability level.

In our example, the model is well-calibrated, but the split is not clear-cut. For the "malignant" class, there is some confusion. The model makes a few errors in the "middle" of the probability range.

‍Depending on our business process, we might make different choices regarding the threshold.‍

Suppose the model is used to prioritize cases for expert review by the physician. A false negative means that we will miss a malignant tumor. To avoid this, we might set the threshold as low as 35%. Then, we label all predictions with a probability higher than 35% as suspicious and send them for the immediate manual expert review.

You can also compare how this changes over time and make adjustments to your business logic.

We have more classic ROC curves and Precision-Recall curves to aid in this analysis. They are not very informative in our example but can provide more insight for less accurate models.

How to work with the Precision-Recall table

Next, we have a dedicated Precision-Recall table for probabilistic models.

As we just discussed, the trade-off between precision and recall is a classic choice required to design a business process around your model output.

In some cases, you want to increase the recall. This way, you will find more instances of the target class at the cost of dealing with some false positives. Just like in the examples of malignant tumors above.

In other cases, you would prefer to minimize these false-positive errors.‍

For example, you are sending out marketing campaigns and building a model to classify those likely to convert. According to your contact policy, you do not want to spam your clients with irrelevant offers. You prefer to send fewer offers but only to those who are indeed likely to accept. In this case, you can set the threshold higher to increase precision at the cost of the recall.

No model is perfect, and you usually have to discuss this trade-off with business stakeholders.

‍The Precision-Recall table helps make and interpret this choice. The table shows possible combinations for different probability thresholds and prediction coverage. It provides better visibility and understanding of the implications beyond looking at the standard curves.

You can see the relative share (top%) and absolute number (Count) of predictions that remain after the cut-off at a certain probability threshold (Prob). The table then shows the corresponding model quality metrics for each combination. It includes Precision, Recall, the share of True Positives (TP), and False Positives (FP).

How to work with the Classification Quality table

The report includes one more table: Classification Quality By Feature.

It follows similar goals as the Error Bias table in the Regression Performance Report: help to understand if specific feature values can explain the model errors.

For each feature, we visualize how well the model performs. We plot the distribution of our True Positives, True Negatives, False Positives, and False Negatives alongside the feature value range.

This way, we can visually explore if a specific type of misclassification is connected to the particular values of a given feature. It helps reveal low-performance segments and analyze if you can address or minimize the error through model retraining or output post-processing.

In this case, both our Reference and Production models perform well. We do not notice any pattern that explains the rare errors they make.

For probabilistic classification, we show a set of plots that map predicted probabilities alongside feature values. It helps understand not only the type of error but also explore the exact distribution of the model output.

When should I use this report?

Like with Regression, we recommend using this report whenever you can or want to check on your models. To get a full picture, you can also combine it with the Data Drift and Target Drift report.

Here are our suggestions on when to use it:

‍‍To analyze the results of the model test, trial run, or A/B test.
To generate regular reports on the model performance in production both for data scientists and business stakeholders.
To trigger model retraining and/or explore if retraining is likely to improve the performance.
To debug model performance in production or analyze its improvement potential.

To implement model quality checks as part of the pipeline, you can use Test Suites. They allow explicitly stating the expectations on model quality to get a clear pass/fail result.

How can I try it?

Go to GitHub, and explore the tool in action.

For the most recent update on the functionality, explore the docs.

And if you like it, spread the word!

For any questions, contact us via hello@evidentlyai.com. That's an early release, so let us know of any bugs! You can also open an issue on Github.