July 28, 2022
Last Updated:
September 19, 2022

Evidently 0.1.8: Machine Learning Performance Reports for Classification Models

Evidently
more blogs liKe this
Thank you! Please check your email to confirm subscription!
Oops! Something went wrong while submitting the form.

A new report in the performance tabs family: now, you can use Evidently to summarize and explore the quality of classification models. It works for probabilistic and non-probabilistic classifiers, binary and multi-class alike.

What is it?

Like our Regression Performance Report, this new report helps understand machine learning performance in production. You can run it whenever you have the ground truth labels available to evaluate the classification quality.

The report answers the following questions:

  • How well does the model perform in production?
  • Did it change since the model training or a particular past period?
  • Where does the model assign the wrong labels? Can specific features explain the misclassification?

If the actual labels are not yet known, you can use the Data Drift or Numerical Target Drift reports instead to analyze the model quality using only leading indicators.

How does it work?

To analyze your classification model's performance, you need to prepare model logs as a pandas DataFrame. You should include the input features, predicted labels (or probabilities), and actual labels.

Features, predictions, actual class, quality

You can run the report with a single DataFrame to analyze the performance of a chosen model.

If you want to compare the current model performance with the training or past period, you should prepare two DataFrames. You can also use a single DataFrame and indicate the corresponding rows that belong to the "Reference" and "Production" ("Current") data.

Evidently will treat "Reference" data as a basis for comparison.

Two datasets as a input to generate the report.
In the updated version of the library, we refer to the second dataset as "Current" instead of "Production".

Once you prepare the data, call the report in the Jupyter notebook as ClassificationPerformanceTab, or ProbClassificationPerformanceTab for probabilistic classification.

How to work with the performance report

First, the report shows a few standard quality metrics relevant for classification models: Accuracy, Precision, Recall, and F-1 Score. For probabilistic classification, it also indicates ROC AUC and LogLoss.

By default, we assign the label with the maximum predicted probability. In binary classification, it is a class with a predicted probability of 0,5 and higher. In multi-class, it is the top predicted one. Right now, you cannot choose a custom threshold to recalculate the metrics. We will add these levers later on.

Quality Metrics appear both for Reference and Production datasets.
This way, you can compare the performance and understand how it changed.

As an example, we will use a probabilistic classification model built using the Breast Cancer Prediction dataset (see the Jupyter notebook here).

Here we can quickly notice that the performance remains stable in Production:

Model quality metrics for two models.

Next, the report generates plots to detail the model performance and the types of errors the model makes.

Class Representation plots show the number of objects of each class in the Reference and Production datasets. It quickly highlights if any of the classes became more or less frequent. Such change often explains performance decay.

Class representation plots for two models

Next is the Confusion Matrix, which visualizes the classification errors and their type. Our sample model is highly accurate, so nothing draws our attention here.

Confusion matrices for two models.

Quality metrics by class. We also show each of the model quality metrics for the individual classes.

Sometimes, high overall accuracy can hide that the model fails on some rare classes. This plot helps quickly reveal the difference in performance.

Quality metrics by class.

Then, we have a Class Separation plot for the probabilistic models. It shows how well the model distinguishes each given class from the rest, taking into account predicted probabilities.

The plot helps notice specific outliers in the predictions and visually understand if the model gets confused on a particular class.

The idea is to show both the model accuracy and the quality of its calibration.

Well-calibrated model VS the model that needs to be calibrated.

Let's imagine that most of the correct predictions for a given class concentrate at a 0,7 probability level. There are no more objects above this threshold. And, there are no misclassified objects of a different class in the vicinity. It seems like we did a great job assigning the labels!

But in fact, this suggests we can further improve the model calibration. In an ideal world, we should observe a high density of correct predictions close to 1.

You can also refer to this plot to choose the best probability threshold for each class. Depending on the use case, error tolerance and costs vary. You might want to assign a given label not at the default 50% but at a different probability level.

Class separation quality metrics.

In our example, the model is well-calibrated, but the split is not clear-cut. For the "malignant" class, there is some confusion. The model makes a few errors in the "middle" of the probability range.

Depending on our business process, we might make different choices regarding the threshold.

Suppose the model is used to prioritize cases for expert review by the physician. A false negative means that we will miss a malignant tumor. To avoid this, we might set the threshold as low as 35%. Then, we label all predictions with a probability higher than 35% as suspicious and send them for the immediate manual expert review.

You can also compare how this changes over time and make adjustments to your business logic.

We have more classic ROC curves and Precision-Recall curves to aid in this analysis. They are not very informative in our example but can provide more insight for less accurate models.

ROC curves
Precision-recall curve

How to work with the Precision-Recall table

Next, we have a dedicated Precision-Recall table for probabilistic models.

As we just discussed, the trade-off between precision and recall is a classic choice required to design a business process around your model output.

In some cases, you want to increase the recall. This way, you will find more instances of the target class at the cost of dealing with some false positives. Just like in the examples of malignant tumors above.

Precision recall trade-off

In other cases, you would prefer to minimize these false-positive errors.

For example, you are sending out marketing campaigns and building a model to classify those likely to convert. According to your contact policy, you do not want to spam your clients with irrelevant offers. You prefer to send fewer offers but only to those who are indeed likely to accept. In this case, you can set the threshold higher to increase precision at the cost of the recall.

No model is perfect, and you usually have to discuss this trade-off with business stakeholders.

The Precision-Recall table helps make and interpret this choice. The table shows possible combinations for different probability thresholds and prediction coverage. It provides better visibility and understanding of the implications beyond looking at the standard curves.

You can see the relative share (top%) and absolute number (Count) of predictions that remain after the cut-off at a certain probability threshold (Prob). The table then shows the corresponding model quality metrics for each combination. It includes Precision, Recall, the share of True Positives (TP), and False Positives (FP).

Precision-recall table

How to work with the Classification Quality table

The report includes one more table: Classification Quality By Feature.

It follows similar goals as the Error Bias table in the Regression Performance Report: help to understand if specific feature values can explain the model errors.

For each feature, we visualize how well the model performs. We plot the distribution of our True Positives, True Negatives, False Positives, and False Negatives alongside the feature value range.

This way, we can visually explore if a specific type of misclassification is connected to the particular values of a given feature. It helps reveal low-performance segments and analyze if you can address or minimize the error through model retraining or output post-processing.

feature example

In this case, both our Reference and Production models perform well. We do not notice any pattern that explains the rare errors they make.

For probabilistic classification, we show a set of plots that map predicted probabilities alongside feature values. It helps understand not only the type of error but also explore the exact distribution of the model output.

feature example

When should I use this report?

Like with Regression, we recommend using this report whenever you can or want to check on your models. To get a full picture, you can also combine it with the Data Drift and Target Drift report.

Here are our suggestions on when to use it:

  1. To analyze the results of the model test, trial run, or A/B test.
  2. To generate regular reports on the model performance in production both for data scientists and business stakeholders.
  3. To trigger model retraining and/or explore if retraining is likely to improve the performance.
  4. To debug model performance in production or analyze its improvement potential.

How can I try it?

Go to Github, read the docs and explore the tool in action using sample notebooks. Here you can look at the example for the Iris dataset (Jupyter notebook, Colab) and breast cancer dataset (Jupyter notebook, Colab).

For the most recent update on the functionality, explore the docs and examples folder on GitHub.

And if you like it, spread the word!

_______________

For any questions, contact us via hello@evidentlyai.com. That's an early release, so let us know of any bugs! You can also open an issue on Github.

Want to stay in the loop?

https://www.linkedin.com/in/emelidral/
Emeli Dral

Co-founder and CTO

https://www.linkedin.com/in/elenasamuylova/
Elena Samuylova

Co-founder and CEO

You Might Also Like:

Best MLOps content

Get a roundup with the best blogs, events, and product updates every few weeks. No spam.

Thank you! Please check your email to confirm subscription!
Oops! Something went wrong while submitting the form.

By signing up you agree to receive emails from us. You can opt out any time.

February 25, 2021
Last Updated:
September 19, 2022

Evidently 0.1.8: Machine Learning Performance Reports for Classification Models

Evidently
Get PRODUCT UPDATES
Thank you! Please check your email to confirm subscription!
Oops! Something went wrong while submitting the form.

A new report in the performance tabs family: now, you can use Evidently to summarize and explore the quality of classification models. It works for probabilistic and non-probabilistic classifiers, binary and multi-class alike.

What is it?

Like our Regression Performance Report, this new report helps understand machine learning performance in production. You can run it whenever you have the ground truth labels available to evaluate the classification quality.

The report answers the following questions:

  • How well does the model perform in production?
  • Did it change since the model training or a particular past period?
  • Where does the model assign the wrong labels? Can specific features explain the misclassification?

If the actual labels are not yet known, you can use the Data Drift or Numerical Target Drift reports instead to analyze the model quality using only leading indicators.

How does it work?

To analyze your classification model's performance, you need to prepare model logs as a pandas DataFrame. You should include the input features, predicted labels (or probabilities), and actual labels.

Features, predictions, actual class, quality

You can run the report with a single DataFrame to analyze the performance of a chosen model.

If you want to compare the current model performance with the training or past period, you should prepare two DataFrames. You can also use a single DataFrame and indicate the corresponding rows that belong to the "Reference" and "Production" ("Current") data.

Evidently will treat "Reference" data as a basis for comparison.

Two datasets as a input to generate the report.
In the updated version of the library, we refer to the second dataset as "Current" instead of "Production".

Once you prepare the data, call the report in the Jupyter notebook as ClassificationPerformanceTab, or ProbClassificationPerformanceTab for probabilistic classification.

How to work with the performance report

First, the report shows a few standard quality metrics relevant for classification models: Accuracy, Precision, Recall, and F-1 Score. For probabilistic classification, it also indicates ROC AUC and LogLoss.

By default, we assign the label with the maximum predicted probability. In binary classification, it is a class with a predicted probability of 0,5 and higher. In multi-class, it is the top predicted one. Right now, you cannot choose a custom threshold to recalculate the metrics. We will add these levers later on.

Quality Metrics appear both for Reference and Production datasets.
This way, you can compare the performance and understand how it changed.

As an example, we will use a probabilistic classification model built using the Breast Cancer Prediction dataset (see the Jupyter notebook here).

Here we can quickly notice that the performance remains stable in Production:

Model quality metrics for two models.

Next, the report generates plots to detail the model performance and the types of errors the model makes.

Class Representation plots show the number of objects of each class in the Reference and Production datasets. It quickly highlights if any of the classes became more or less frequent. Such change often explains performance decay.

Class representation plots for two models

Next is the Confusion Matrix, which visualizes the classification errors and their type. Our sample model is highly accurate, so nothing draws our attention here.

Confusion matrices for two models.

Quality metrics by class. We also show each of the model quality metrics for the individual classes.

Sometimes, high overall accuracy can hide that the model fails on some rare classes. This plot helps quickly reveal the difference in performance.

Quality metrics by class.

Then, we have a Class Separation plot for the probabilistic models. It shows how well the model distinguishes each given class from the rest, taking into account predicted probabilities.

The plot helps notice specific outliers in the predictions and visually understand if the model gets confused on a particular class.

The idea is to show both the model accuracy and the quality of its calibration.

Well-calibrated model VS the model that needs to be calibrated.

Let's imagine that most of the correct predictions for a given class concentrate at a 0,7 probability level. There are no more objects above this threshold. And, there are no misclassified objects of a different class in the vicinity. It seems like we did a great job assigning the labels!

But in fact, this suggests we can further improve the model calibration. In an ideal world, we should observe a high density of correct predictions close to 1.

You can also refer to this plot to choose the best probability threshold for each class. Depending on the use case, error tolerance and costs vary. You might want to assign a given label not at the default 50% but at a different probability level.

Class separation quality metrics.

In our example, the model is well-calibrated, but the split is not clear-cut. For the "malignant" class, there is some confusion. The model makes a few errors in the "middle" of the probability range.

Depending on our business process, we might make different choices regarding the threshold.

Suppose the model is used to prioritize cases for expert review by the physician. A false negative means that we will miss a malignant tumor. To avoid this, we might set the threshold as low as 35%. Then, we label all predictions with a probability higher than 35% as suspicious and send them for the immediate manual expert review.

You can also compare how this changes over time and make adjustments to your business logic.

We have more classic ROC curves and Precision-Recall curves to aid in this analysis. They are not very informative in our example but can provide more insight for less accurate models.

ROC curves
Precision-recall curve

How to work with the Precision-Recall table

Next, we have a dedicated Precision-Recall table for probabilistic models.

As we just discussed, the trade-off between precision and recall is a classic choice required to design a business process around your model output.

In some cases, you want to increase the recall. This way, you will find more instances of the target class at the cost of dealing with some false positives. Just like in the examples of malignant tumors above.

Precision recall trade-off

In other cases, you would prefer to minimize these false-positive errors.

For example, you are sending out marketing campaigns and building a model to classify those likely to convert. According to your contact policy, you do not want to spam your clients with irrelevant offers. You prefer to send fewer offers but only to those who are indeed likely to accept. In this case, you can set the threshold higher to increase precision at the cost of the recall.

No model is perfect, and you usually have to discuss this trade-off with business stakeholders.

The Precision-Recall table helps make and interpret this choice. The table shows possible combinations for different probability thresholds and prediction coverage. It provides better visibility and understanding of the implications beyond looking at the standard curves.

You can see the relative share (top%) and absolute number (Count) of predictions that remain after the cut-off at a certain probability threshold (Prob). The table then shows the corresponding model quality metrics for each combination. It includes Precision, Recall, the share of True Positives (TP), and False Positives (FP).

Precision-recall table

How to work with the Classification Quality table

The report includes one more table: Classification Quality By Feature.

It follows similar goals as the Error Bias table in the Regression Performance Report: help to understand if specific feature values can explain the model errors.

For each feature, we visualize how well the model performs. We plot the distribution of our True Positives, True Negatives, False Positives, and False Negatives alongside the feature value range.

This way, we can visually explore if a specific type of misclassification is connected to the particular values of a given feature. It helps reveal low-performance segments and analyze if you can address or minimize the error through model retraining or output post-processing.

feature example

In this case, both our Reference and Production models perform well. We do not notice any pattern that explains the rare errors they make.

For probabilistic classification, we show a set of plots that map predicted probabilities alongside feature values. It helps understand not only the type of error but also explore the exact distribution of the model output.

feature example

When should I use this report?

Like with Regression, we recommend using this report whenever you can or want to check on your models. To get a full picture, you can also combine it with the Data Drift and Target Drift report.

Here are our suggestions on when to use it:

  1. To analyze the results of the model test, trial run, or A/B test.
  2. To generate regular reports on the model performance in production both for data scientists and business stakeholders.
  3. To trigger model retraining and/or explore if retraining is likely to improve the performance.
  4. To debug model performance in production or analyze its improvement potential.

How can I try it?

Go to Github, read the docs and explore the tool in action using sample notebooks. Here you can look at the example for the Iris dataset (Jupyter notebook, Colab) and breast cancer dataset (Jupyter notebook, Colab).

For the most recent update on the functionality, explore the docs and examples folder on GitHub.

And if you like it, spread the word!

_______________

For any questions, contact us via hello@evidentlyai.com. That's an early release, so let us know of any bugs! You can also open an issue on Github.

Want to stay in the loop?

https://www.linkedin.com/in/emelidral/
Emeli Dral

Co-founder and CTO

https://www.linkedin.com/in/elenasamuylova/
Elena Samuylova

Co-founder and CEO

You Might Also Like:

Long live models!

Get a roundup with product updates, events, and the best blogs every few weeks. No spam.

Thank you! Please check your email to confirm subscription!
Oops! Something went wrong while submitting the form.

By signing up you agree to receive emails from us. You can opt out any time.

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.