*This article is a part of the **Classification Quality Metrics guide**. *

There are different ways to calculate accuracy, precision, and recall for multi-class classification. You can calculate metrics by each class or use macro- or micro-averaging. This chapter explains the difference between the options and how they behave in important corner cases.

We will also show how to calculate accuracy, precision, and recall using the open-source Evidently Python library.

Before diving in, ensure you know how accuracy, precision, and recall work in binary classification. If you need a refresher, there is a separate chapter in the guide.

- You can use accuracy, precision, and recall in multi-class classification.
- Accuracy works the same way as in binary classification, but there are different ways to calculate precision and recall.
- In a simple case, you can calculate
**precision and recall by class**. - If you have a lot of classes, you can also macro- or micro-average precision and recall.
**Macro-averaging**gives equal weight to each class, while**micro-averaging**gives equal weight to each instance. - When each data point is assigned a single class, micro-averaged precision and recall are the same and identical to accuracy.

Start with AI observability

Want to keep tabs on your classification models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.

Start with AI observability

Want to keep tabs on your classification models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.

Multi-class classification is a machine learning task that assigns the objects in the input data to **one of several** predefined categories.

In binary classification, you deal with two possible classes. For example, "spam" or "not spam" emails or "fraudulent" or "non-fraudulent" transactions.

In multi-class, you have multiple categories to choose from. Some example use cases:

**Product categorization.**You might want to automatically assign each product on an e-commerce website to a specific category, such as "clothing," "electronics," or "home and garden."**User segmentation**. You might classify software product users into multiple known categories based on their usage patterns and upsell potential, such as "small business owners," "freelancers," and "enterprise."******Classification of support tickets**. When users submit support requests, you can automatically tag tickets by the topic, such as "technical issue," "billing inquiry," or "product feedback."

Having multiple classes brings more complexity to evaluating the model quality. This is what this chapter is about!

**Multi-class vs. multi-label. **In this chapter, we focus on multi-class classification. It is different from multi-label. In multi-class classification, there are multiple classes, but each object belongs to a single class. In multi-label classification, each object might belong to multiple categories simultaneously. The evaluation then works differently.

Accuracy is a popular performance metric in classification problems. The good news is that you can directly borrow the metric from binary classification and calculate it for multi-class in the same way.

**Accuracy** measures the proportion of correctly classified cases from the total number of objects in the dataset. To compute the metric, divide the number of correct predictions by the total number of predictions made by the model.

Let’s consider that we have a problem with 4 classes. Here is the distribution of the true labels (**actual** classes) on the validation dataset.

After we trained our model and generated the predictions for the validation dataset, we can evaluate the model quality. Here is the result we received:

Now, this colorful example might be mildly confusing because it shows all the model predictions **and** the actual labels. However, to calculate accuracy, we only need to know which predictions were correct and which were not. Accuracy is “blind” to specific classes.

We can simplify the illustration:

To calculate accuracy, divide all correct predictions by the total number of predictions.

In our case, the accuracy is 37/45 = 82%.

**Accuracy is straightforward to interpret**. Did you make a model that classifies 90 out of 100 samples correctly? The accuracy is 90%! Did it classify 87? 87%!

However, accuracy has its downsides. While it does provide an estimate of the overall model quality, it disregards **class balance** and the **cost of different errors.**

Just like with binary classification, in multi-class, some classes might be more prevalent. It might be easier for the model to classify them – at the cost of minority classes. In this case, high accuracy can be confusing.

Say, you are dealing with manufacturing defect prediction.For every new product on a manufacturing line, you assign one of the categories: “no defect,” “minor defect,” “major defect,” or “scrap.” You are most interested in finding defects: the goal is to proactively inspect and take faulty products off the line.

The model might be mostly correct in assigning the “no defect” and “scrap” labels but be unreasonable when predicting actual defects. Meanwhile, the accuracy metric might be high due to the successful performance of the majority classes.

In this case, accuracy might not be a suitable metric.

In our visual example, the model did not do a very good job of predicting Class "B." However, since there were only 5 instances of this class, it did not impact the accuracy dramatically.

To better understand the performance of the classifier, you need to look at other metrics like precision and recall. They can provide more detailed information about the types of errors the classifier makes for each class.

Precision and recall metrics are also not limited to binary classification. You can use them in multi-class classification problems as well.

However, there are different approaches to how to do that, each with its pros and cons.

- In the first case, you can calculate the precision and recall
**for each class individually**. This way, you get multiple metrics based on the number of classes you have in the dataset. - In the second case, you can
**"average" precision and recall across all the classes t**o get a single number. You can use different methods to average the precision, such as macro- or micro-averaging.

Let’s look at both approaches, starting with calculating the metrics by class.

The most intuitive way is to calculate the precision and recall by class. It follows the same logic as in binary classification.

The only difference is that when computing the recall and precision in binary classification, you focus on a single positive (target) class. With multi-class, there are many classes to predict. To overcome this, you can calculate precision and recall for each class in the dataset individually, each time treating this specific class as "positive." Accordingly, you will treat all other classes as a single "negative" class.

Let's come up with definitions!

**Precision for a given class** in multi-class classification is the fraction of instances correctly classified as belonging to a specific class out of all instances the model predicted to belong to that class.

In other words, precision measures the model's ability to identify instances of a particular class correctly.

**Recall** in multi-class classification is the fraction of instances in a class that the model correctly classified out of all instances in that class.

In other words, recall measures the model's ability to identify all instances of a particular class.

Let’s stick to the same example. Here is the reminder on how the model predictions look:

Say we want to calculate the precision and recall for **Class “A.”**

To calculate the** recall**, we divide the number of correct predictions of Class “A” by the total number of Class “A” objects in the dataset (both identified and not).

To calculate the** precision**, we divide the number of correct predictions of Class “A” by the total number of Class “A” predictions (true and false).

We can see that for Class “A,” the model is not doing badly.

- The recall is 13/15=87%. The model correctly identified 13 out of 15 class “A” objects.
- The precision is somewhat worse: 13/18=72%. The model predicted that a certain object belongs to class “A” 18 times but was correct only in 13 of them.

Now, let’s look at **Class “B.”** The results are much worse.

- The
**recall in Class “B”**is only 1/5 = 20%. The model correctly identified only 1 example out of 5. - The
**precision in Class “B”**is 1/3=33%. The model predicted Class “B” 3 times but was correct only once.

For other classes, we follow a similar approach. We’ll skip the visuals, but here are the final results for all 4 classes:

- Class “A” recall: 13/15=87%. Class “A” precision: 13/18=72%.
- Class “B” recall: 1/5=20%. Class “B” precision: 1/3=33%.
- Class “C” recall: 9/10=90%. Class “C” precision: 9/10=90%.
- Class “D” recall: 14/15=93%. Class “C” precision: 9/9=100%.

When is calculating precision and recall by class a good idea?

- When you have specific
**classes of interest.**Calculating the metrics by category is useful when you want to evaluate the performance of a particular class (or classes) and to know how well the classifier can distinguish this class from the others. - It can also be helpful when you deal with
**imbalanced classes**. Calculating precision and recall for the minority class (or classes) will clearly expose any issues. - When you have a
**small number of classes**. In this case, you can judge each metric individually and grasp them together. It is often the simplest solution.

However, there is a downside:

- Calculating precision and recall for each class can result in a
**large number of performance metrics**. The more classes, the more metrics you get. They can be challenging to interpret and grasp simultaneously.

When you have a lot of classes, you might prefer to use macro or micro averages. They provide a more concise summary of the performance.

The idea is simple: instead of having those many metrics for every class, let’s reduce it to one “average” metric. However, there are differences in how you can implement it. The two popular approaches are macro- and micro-averaging.

Here is how you compute macro-averaged precision and recall:

- Calculate the number of true positives (TP), false positives (FP), and false negatives (FN) for each class.
- Compute precision and recall for each class as TP / (TP + FP) and TP / (TP + FN).
**Average the precision and recall across all classes**to get the final macro-averaged precision and recall scores.

Here are the formulas to average precision and recall across all classes:

where N is the total number of classes, and *Precision1, Precision2, ..., PrecisionN* and *Recall1, Recall2, ..., RecallN* are the precision and recall values for each class.

In short, first, you measure the metric by class, then you average it across classes.

As an alternative, you can calculate **micro-average precision and recall.**

In this case, you must first calculate the total number of true positives (TP), false positives (FP), and false negatives (FN) predictions across all classes:

**Total True Positive**is the sum of true positive counts across all classes;**Total False Positive**is the sum of false positive counts across all classes;**Total False Negative**is the sum of false negative counts across all classes.

Then, calculate the precision and recall using these total counts.

- To calculate the
**precision**, divide the total True Positives by the sum of total True Positives and False Positives. - To calculate the
**recall**, divide the total True Positives by the sum of total True Positives and False Negatives.

The formulas for micro-average precision and recall are:

In short, you sum up all TP, FP, and FN predictions across classes and calculate precision and recall jointly.

Now, if you look at the last two formulas closely, you will see that micro-average precision and micro-average recall will **arrive at the same number.**

The reason is every False Positive for one class is a False Negative for another class. For example, if you misclassify Class “A” as Class “B,” it will be a False Negative for Class “A” (a missed instance) but a False Positive for Class “B” (incorrectly assigned as Class “B”).

Thus, the total number of False Negatives and False Positives in the multi-class dataset will be the same. (It would work differently for multi-label!).

There is a principal difference between macro and micro-averaging in how they aggregate performance metrics.

**Macro-averaging **calculates each class's performance metric (e.g., precision, recall) and then takes the arithmetic mean across all classes. So, the macro-average gives equal **weight to each class**, regardless of the number of instances.

**Micro-averaging**, on the other hand, aggregates the counts of true positives, false positives, and false negatives across all classes and then calculates the performance metric based on the total counts. So, the micro-average gives equal **weight to each instance**, regardless of the class label and the number of cases in the class.

To illustrate this difference, let’s return to our example. We have 45 instances and 4 classes. The number of instances in each class is as follows:

- Class “A”: 15 instances
- Class “B”: 5 instances
- Class “C”: 10 instances
- Class “D”: 15 instances

We already estimated the recall and precision by class, so it will be easy to compute **macro-average** precision and recall. We sum them up and divide them by the number of classes.

Using macro-averaging, the average precision and recall across all classes would be:

- Macro-average precision = (0.87 + 0.33 + 0.9 + 0.93) / 4 = 0.76
- Macro-average recall = (0.72 + 0.2 + 0.9 + 1) / 4 = 0.71

Each class equally contributes to the final quality metric.

Now, let’s look at **micro-averaging**. In this case, you need first to calculate the total counts of true positives, false positives, and false negatives across all classes. Then, you compute precision and recall using the total counts.

We already claimed that precision and recall would be the same in this case. Let’s visually demonstrate it.

Let’s start at the same point and follow the formulas. Here are the model predictions:

We first need to calculate the **True Positives** across each class. Since we arranged the predictions by the actual class, it is easy to count them visually.

The model correctly classified 13 samples of Class “A,” 1 sample of Class “B,” 9 samples of Class “C,” and 14 samples of Class “D.” The** total number **of True Positives is 37.

To calculate the recall, we also need to look at the total **False Negatives** number. To count them visually, we need to look at the “missed instances” that belong to each class but were missed by the model.

Here is how they are split across classes: the model missed 2 instanced of class “A,” 4 instances of class “B,” and 1 instance of class “C” and “D” each. The total number of False Negatives is 8.

Now, what about **False Positives**? A false positive is an instance of incorrect classification. The model said it was “B,” but was wrong? This is a False Positive for Class “B.”

This is a bit more complex to grasp visually, since we need to look at the color-coded predicted labels, and the errors are spread across classes.

However, what’s important is that we **look at the same erroneous predictions** as before! Each class’s False Positive is another class’s False Negative.

They are distributed differently: for example, our model often erroneously assigned class “A” but never class “D.” But the total number of False Negatives and False Positives is the same: 8.

As a result, both **micro-average** precision and recall are the same: 0.82. Here is how to compute it:

What’s even more interesting, **this number is the same as accuracy. **What we just did was divide the number of correct predictions by the total number of the (right and wrong) predictions. This is the accuracy formula!

For multi-class classification, micro-average precision equals micro-average recall and equals accuracy.

**To sum up, how did micro- and macro-averaging work out for our examples?** The results were different:

- Macro-average precision is 76%.
- Macro-average recall is 71%.
- Micro-average precision and recall are both 82%.

**Macro-averaging results in a “worse” outcome **since it gives equal weight to each class. 1 out of 4 classes in our example has very low performance. This significantly impacts the score since it constitutes 25% of the final evaluation.

**Micro-averaging leads to a “better” metric. **It gives equal weight to each instance, and the number of objects in the worse-performing class is low. It only has 5 examples out of 45 total. In this case, their contribution to the overall score was lower.

A suitable metric depends on the specific problem and the importance of each class or instance.

**Macro-averaging** treats each class equally.

- It can be useful when
**all classes are equally important**, and you want to know how well the classifier performs on average across them. - It is also useful when you have an
**imbalanced dataset**and want to ensure each class equally contributes to the final evaluation.

However, macro averaging can also distort the perception of performance.

- For example, it
**can make the classifier look “worse”**due to low performance in an unimportant and small class since it will still contribute equally to the overall score. - In the opposite scenario, it
**can disguise poor performance**in the critical minority class when the overall number of classes is large. In this case, the “contribution” of each class is diluted. The classifier may still achieve high macro-averaged precision and recall by performing well on the majority classes but poorly on the minority class.

If classes have unequal importance, measuring precision and recall by class or weighing them by importance might be helpful.

**Micro-averaging** can be more appropriate when you want to account for the total number of misclassifications in the dataset. It gives equal weight to each instance and will have a higher score when the overall number of errors is low. (If this sounds like accuracy, it is because it is!)

However, micro-averaging can also **overemphasize the performance of the majority class,** especially when it dominates the dataset. In this case, micro-averaging can lead to inflated performance scores when the classifier performs well on the majority class but poorly (or very poorly) on the minority classes. If the class is small, you might not notice!

As a result, there is no single best metric. To choose the most suitable one, you need to consider the number of classes, their balance, and their relative importance.

Here is extra: in some scenarios, it might be appropriate to use **weighted averaging**. This approach takes into account the balance of classes. You weigh each class based on its representation in the dataset. Then, you compute precision and recall as a weighted average of the precision and recall in individual classes.

Simply put, it would work like macro-averaging, but instead of dividing precision and recall by the number of classes, you give each class a fair representation based on the proportion it takes in the dataset.

This approach is useful if you have an imbalanced dataset but want to assign larger importance to classes with more examples.

You have different options when calculating quality metrics in multi-class classification.

- Calculating
**precision and recall by class**is useful when you want to evaluate the performance of a classifier for a specific class of interest or when dealing with imbalanced classes, but it can result in a large number of performance metrics. - When you have a large number of classes or want a more concise summary of overall performance, using
**macro or micro averages**can be a better option. **Macro-averaging**shows average performance**across classes**, treating each class as equally important.**Micro-averaging**gives equal weight to every instance and shows average performance across all predictions. In the case of multi-class classification, micro-averaged precision, recall, and accuracy are the same.- You might also consider using
**weighted averaging**. - You might prefer one metric over another depending on the class balance and their relative importance.

To quickly calculate and visualize accuracy, precision, and recall for your machine learning models, you can use Evidently, an open-source Python library to evaluate, test and monitor ML models in production.

You will need to prepare your dataset that includes predicted values for each class and true labels and pass it to the tool. You will instantly get an interactive report that shows accuracy, precision, recall, ROC curve and other visualizations of the model’s quality. You can also integrate these model quality checks into your production pipelines.

Evidently allows calculating various additional Reports and Test Suites for model and data quality. Check out Evidently on GitHub and go through the Getting Started Tutorial.

Try our open-source library with over 20 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶

Book a personalized 1:1 demo with our team or start a free 30-day trial.

No credit card required