*This article is a part of the **Classification Metrics Guide**. *

A confusion matrix is easily the most popular method of visualizing the quality of classification models. You can also derive several other relevant metrics from it.

We will show how to build a confusion matrix using the open-source Evidently Python library.

Let’s dive in!

- A confusion matrix is a
**table**that sums up the performance of a classification model. It works for binary and multi-class classification. - The confusion matrix shows the number of
**correct predictions**: true positives (TP) and true negatives (TN). - It also shows the
**model errors**: false positives (FP) are “false alarms,” and false negatives (FN) are missed cases. - Using TP, TN, FP, and FN, you can calculate various classification quality metrics, such as precision and recall.

Get an instant model quality report

Looking to understand your ML model quality? Try Evidently, an open-source Python library with 4m+ downloads. Get an interactive performance report with just a couple of lines of code.

Get an instant model quality report

Looking to understand your ML model quality? Try Evidently, an open-source Python library with 3m+ downloads. Get an interactive performance report with just a couple of lines of code.

A **confusion matrix** is a table that summarizes the performance of a classification model by comparing its predicted labels to the true labels. It displays the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) of the model's predictions.

Here's an example of a confusion matrix:

Let’s unpack it step by step!

First, a quick reminder about the problem behind the matrix.

A **classification model** is a machine learning model that assigns predefined categories or classes (labels) to new input data. If you have only two classes, the classification is binary. If you have more, it is a multi-class problem.

Here are some examples of **binary classification** problems:

- Fraud detection: predicting if a payment transaction is fraudulent.
- Churn prediction: predicting if a user is likely to stop using the service.
- Lead scoring: predicting if a potential customer is likely to convert into paying.

In all these examples, there are two distinct classes that the classification model needs to predict:

- "Fraudulent" and "non-fraudulent" transactions.
- "Churner" and "non-churner" users.
- "Likely to convert" and "unlikely to convert" customers.

To create a **confusion matrix**, you first need to generate the model predictions for the input data and then get the** actual labels.**

This way, you can judge the correctness of each model prediction. Was a transaction truly fraudulent? Did the user leave the service? Did a customer make the purchase?

Once you know the actual classes, you can **count **the number of times the model was right or wrong.

To make it more specific, you can also count the different** types of errors. **

Let’s consider an example of a payment fraud detection model. There are two types of errors the model can make:

- It can make a
**false alarm**and label an ordinary transaction as fraudulent. - It can
**miss real fraud**and label a fraudulent transaction as ordinary.

The first type of error is called a** false positive**. The second is called a **false negative.**

The words “positive” and “negative” refer to the target and non-target classes. In this example, fraud is our target. We refer to transactions flagged as fraudulent as “positives.”

The distinction between false positives and negatives is important because the consequences of errors are different. You might consider one error less or more harmful than the other.

When it comes to correct predictions, you also have two different things to be correct about:

- The model can detect
**actual fraud.** - The model can correctly label
**normal transactions.**

The first type is a** true positive**. The second type is a **true negative**.

Understanding the different types of “correctness” is also valuable. You are likely more interested in how well the model can identify fraudulent transactions rather than how often the model is right overall.

All in all, for every model prediction, you get one of 4 possible outcomes:

- True Positive (TP)
- True Negative (TN)
- False Positive (FP)
- False Negative (FN)

A **confusion matrix** helps visualize the frequency of each of them in a single place. This way, you can grasp the number of correct predictions and errors of each type simultaneously.

To create the matrix, you simply need to draw a table. For binary classification, it is a 2x2 table with two rows and columns.

Rows typically show the actual classes, and columns show the predicted classes.

Then, you populate the matrix with the numbers of true and false predictions on a given dataset, calculated as shown above.

Let’s look at a specific example!

To recap, we will go backward: look at an example of a pre-built confusion matrix and explain how to reach each element.

Let’s say we have an email **spam classification **model. It is a binary classification problem. The two possible classes are “spam” and “not spam.”

After training the model, we generated predictions for 10000 emails in the validation dataset. We already know the actual labels and can evaluate the quality of the model predictions.

Here is how the resulting matrix can look:

**True Positive (TP)**

- This is the top left (green) corner of the matrix.
- It shows the number of correctly identified positive cases. These are the cases where the actual label is positive, and the model correctly predicted it as positive.
- In spam detection, this is the number of correctly predicted spam emails.
- In the example, the number of true positives is 600.

**True Negative (TN)**

- This is the bottom right (green) corner of the matrix.
- It shows the number of correctly identified negative cases. These are the cases where the actual label is negative, and the model correctly predicted it as negative.
- In spam detection, this is the number of correctly predicted non-spam emails.
- In the example, the number of true negatives is 9000.

**False Positive (FP): **

- This is the bottom left (pink) corner of the matrix.
- It shows the number of incorrectly predicted positive cases. These are the cases where the actual label is negative, but the model predicted it as positive.
- To put it simply, these are
**false alarms**. They are also known as**Type 1 errors.** - In spam detection, this is the number of emails incorrectly labeled as spam. Think of regular emails sent to the spam folder in error.
- In the example, the number of false positives is 100.

**False Negative (FN): **

- This is the top right (pink) corner of the matrix.
- It shows the number of incorrectly predicted negative cases. In other words, these are the cases where the actual label is positive, but the model predicted it as negative.
- To put it simply, these are
**missed cases**. They are also known as**Type 2 errors.** - In spam detection, this is the number of missed spam emails that made their way into the primary inbox.
- In the example, the number of false negatives is 300.

Want to see a real example with data and code?Here is a tutorial on the employee churn prediction problem “What is your model hiding?”. You will train two different classification models and explore how to evaluate each model’s quality and compare them.

The confusion matrix shows the** absolute number **of correct and false predictions. It is convenient when you want to get a sense of scale (“How many emails did we falsely send to a spam folder this week?).”

However, it is not always practical to use absolute numbers. To compare models or track their performance over time, you also need some **relative **metrics. The good news is you can derive such quality metrics directly from the confusion matrix.

Here are some commonly used metrics to measure the performance of the classification model.

**Accuracy** is the share of correctly classified objects in the total number of objects. In other words, it shows how often the model is right overall.

You can calculate accuracy by dividing all true predictions by the total number of predictions. Accuracy is a valuable metric with an intuitive explanation.

In our example above, accuracy is **(9000+600)/10000 = 0.96**. The model was correct in 96% of cases.

However, accuracy can be misleading for imbalanced datasets when one class has significantly more samples. In our example, we have many non-spam emails: 9100 out of 10000 are regular emails. The overall model “correctness” is heavily skewed to reflect how well the model can identify those non-spam emails. The accuracy number is not very informative if you are interested in catching spam.

**Precision**** **is the share of true positive predictions in all positive predictions. In other words, it shows how often the model is right when it predicts the target class.

You can calculate precisions by dividing the correctly identified positives by the total number of positive predictions made by the model.

In our example above, accuracy is **600/(600+100)= 0.86**. When predicting “spam,” the model was correct in 86% of cases.

Precision is a good metric when the cost of false positives is high. If you prefer to avoid sending good emails to spam folders, you might want to focus primarily on precision.

**Recall, or true positive rate (TPR)**. Recall shows the share of true positive predictions made by the model out of all positive samples in the dataset. In other words, the recall shows how many instances of the target class the model can find.

You can calculate the recall by dividing the number of true positives by the total number of positive cases.

In our example above, recall is **600/(600+300)= 0.67**. The model correctly found 67% of spam emails. The other 33% made their way to the inbox unlabeled.

Recall is a helpful metric when the cost of false negatives is high. For example, you can optimize for recall if you do not want to miss any spam (even at the expense of falsely flagging some legitimate emails).

To better understand how to strike a balance between metrics, read a separate chapter about Accuracy, Precision, and Recall and how to set a custom decision threshold. You can also read about other classification metrics, such as F1-Score and ROC AUC.

You can use a confusion matrix in multi-class classification problems, too. In this case, the matrix will have more than two rows and columns. Their number depends on the number of labels the model is tasked to predict.

Otherwise, it follows the same logic. Each row represents the instances in the actual class, and each column represents the instances in a predicted class. Rinse and repeat as many times as you need.

Let’s say you are classifying reviews that users leave on the website into 3 groups: “negative,” “positive,” and “neutral.” Here is an example of a confusion matrix for a problem with 3 classes:

In this confusion matrix, each row represents the actual review label, while each column represents the predicted review label.

What’s convenient, the diagonal cells show correctly classified samples, so you can quickly grasp them together. The off-diagonal cells show model errors.

Here is how you read the matrix:

- In the top row, the model correctly labeled 700 (out of all 1000) negative reviews.
- In the second row, the model correctly labeled 8300 (out of 8600) neutral reviews but misclassified 200 as negative and 100 as positive.
- In the third row, the model correctly predicted 300 (out of 400) positive reviews but misclassified 100 as neutral.

You can read more about how to calculate Accuracy, Precision, and Recall for multi-class classification in a separate chapter.

A confusion matrix is typically used in post-training model evaluation. You can also use it in the assessment of production model quality.

In this case, you can generate two side-by-side matrices to compare the latest model quality with some reference period: say, past month, past week, or model validation period.

The main limitation of using the confusion matrix in production model evaluation is that you must get the** true labels **on every model prediction. This might be possible, for example, when subject matter experts (e.g., payment disputes team) review the model predictions after some time. However, often you only get feedback on some of the predictions or receive only partial labels.

Depending on your exact ML product, it might be more convenient to dynamically monitor specific metrics, such as **precision**. For example, in cases like payment fraud detection, you are more likely to send suspicious transactions for manual review and receive the true label quickly. This way, you can get the data for some of the confusion matrix's components faster than others.

Separately, it might also be useful to monitor the **absolute number of positive and negative labels **predicted by the model and the** ****distribution drift**** **in the model predictions. Even before you receive the feedback, you can detect a deviation in the model predictions (prediction drift): such as when a model starts to predict “fraud” more often. This might signal an important change in the model environment.

If you want to generate a confusion matrix for your data, you can easily do this with tools like sklearn.

To get a complete classification quality report for your model, you can use Evidently, an open-source Python library that helps evaluate, test, and monitor ML models in production.

You will need to prepare your dataset that includes predicted values for each class and true labels and pass it to the tool. You will instantly get an interactive report that includes a confusion matrix, accuracy, precision, recall metrics, ROC AUC score and other visualizations. You can also integrate these model quality checks into your production pipelines.

Evidently allows calculating various additional Reports and Test Suites for model and data quality. To start, check out Evidently on GitHub and go through the Getting Started Tutorial.