This article is a part of the Classification Quality Metrics guide.
There are different ways to calculate accuracy, precision, and recall for multi-class classification. You can calculate metrics by each class or use macro- or micro-averaging. This chapter explains the difference between the options and how they behave in important corner cases.
We will also show how to calculate accuracy, precision, and recall using the open-source Evidently Python library.
Before diving in, ensure you know how accuracy, precision, and recall work in binary classification. If you need a refresher, there is a separate chapter in the guide.
Want to keep tabs on your classification models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.
Want to keep tabs on your classification models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.
Multi-class classification is a machine learning task that assigns the objects in the input data to one of several predefined categories.
In binary classification, you deal with two possible classes. For example, "spam" or "not spam" emails or "fraudulent" or "non-fraudulent" transactions.
In multi-class, you have multiple categories to choose from. Some example use cases:
Having multiple classes brings more complexity to evaluating the model quality. This is what this chapter is about!
Multi-class vs. multi-label. In this chapter, we focus on multi-class classification. It is different from multi-label. In multi-class classification, there are multiple classes, but each object belongs to a single class. In multi-label classification, each object might belong to multiple categories simultaneously. The evaluation then works differently.
Accuracy is a popular performance metric in classification problems. The good news is that you can directly borrow the metric from binary classification and calculate it for multi-class in the same way.
Accuracy measures the proportion of correctly classified cases from the total number of objects in the dataset. To compute the metric, divide the number of correct predictions by the total number of predictions made by the model.
Let’s consider that we have a problem with 4 classes. Here is the distribution of the true labels (actual classes) on the validation dataset.
After we trained our model and generated the predictions for the validation dataset, we can evaluate the model quality. Here is the result we received:
Now, this colorful example might be mildly confusing because it shows all the model predictions and the actual labels. However, to calculate accuracy, we only need to know which predictions were correct and which were not. Accuracy is “blind” to specific classes.
We can simplify the illustration:
To calculate accuracy, divide all correct predictions by the total number of predictions.
In our case, the accuracy is 37/45 = 82%.
Accuracy is straightforward to interpret. Did you make a model that classifies 90 out of 100 samples correctly? The accuracy is 90%! Did it classify 87? 87%!
However, accuracy has its downsides. While it does provide an estimate of the overall model quality, it disregards class balance and the cost of different errors.
Just like with binary classification, in multi-class, some classes might be more prevalent. It might be easier for the model to classify them – at the cost of minority classes. In this case, high accuracy can be confusing.
Say, you are dealing with manufacturing defect prediction. For every new product on a manufacturing line, you assign one of the categories: “no defect,” “minor defect,” “major defect,” or “scrap.” You are most interested in finding defects: the goal is to proactively inspect and take faulty products off the line.
The model might be mostly correct in assigning the “no defect” and “scrap” labels but be unreasonable when predicting actual defects. Meanwhile, the accuracy metric might be high due to the successful performance of the majority classes.
In this case, accuracy might not be a suitable metric.
In our visual example, the model did not do a very good job of predicting Class "B." However, since there were only 5 instances of this class, it did not impact the accuracy dramatically.
To better understand the performance of the classifier, you need to look at other metrics like precision and recall. They can provide more detailed information about the types of errors the classifier makes for each class.
Precision and recall metrics are also not limited to binary classification. You can use them in multi-class classification problems as well.
However, there are different approaches to how to do that, each with its pros and cons.
Let’s look at both approaches, starting with calculating the metrics by class.
The most intuitive way is to calculate the precision and recall by class. It follows the same logic as in binary classification.
The only difference is that when computing the recall and precision in binary classification, you focus on a single positive (target) class. With multi-class, there are many classes to predict. To overcome this, you can calculate precision and recall for each class in the dataset individually, each time treating this specific class as "positive." Accordingly, you will treat all other classes as a single "negative" class.
Let's come up with definitions!
Precision for a given class in multi-class classification is the fraction of instances correctly classified as belonging to a specific class out of all instances the model predicted to belong to that class.
In other words, precision measures the model's ability to identify instances of a particular class correctly.
Recall in multi-class classification is the fraction of instances in a class that the model correctly classified out of all instances in that class.
In other words, recall measures the model's ability to identify all instances of a particular class.
Let’s stick to the same example. Here is the reminder on how the model predictions look:
Say we want to calculate the precision and recall for Class “A.”
To calculate the recall, we divide the number of correct predictions of Class “A” by the total number of Class “A” objects in the dataset (both identified and not).
To calculate the precision, we divide the number of correct predictions of Class “A” by the total number of Class “A” predictions (true and false).
We can see that for Class “A,” the model is not doing badly.
Now, let’s look at Class “B.” The results are much worse.
For other classes, we follow a similar approach. We’ll skip the visuals, but here are the final results for all 4 classes:
When is calculating precision and recall by class a good idea?
However, there is a downside:
When you have a lot of classes, you might prefer to use macro or micro averages. They provide a more concise summary of the performance.
The idea is simple: instead of having those many metrics for every class, let’s reduce it to one “average” metric. However, there are differences in how you can implement it. The two popular approaches are macro- and micro-averaging.
Here is how you compute macro-averaged precision and recall:
Here are the formulas to average precision and recall across all classes:
where N is the total number of classes, and Precision1, Precision2, ..., PrecisionN and Recall1, Recall2, ..., RecallN are the precision and recall values for each class.
In short, first, you measure the metric by class, then you average it across classes.
As an alternative, you can calculate micro-average precision and recall.
In this case, you must first calculate the total number of true positives (TP), false positives (FP), and false negatives (FN) predictions across all classes:
Then, calculate the precision and recall using these total counts.
The formulas for micro-average precision and recall are:
In short, you sum up all TP, FP, and FN predictions across classes and calculate precision and recall jointly.
Now, if you look at the last two formulas closely, you will see that micro-average precision and micro-average recall will arrive at the same number.
The reason is every False Positive for one class is a False Negative for another class. For example, if you misclassify Class “A” as Class “B,” it will be a False Negative for Class “A” (a missed instance) but a False Positive for Class “B” (incorrectly assigned as Class “B”).
Thus, the total number of False Negatives and False Positives in the multi-class dataset will be the same. (It would work differently for multi-label!).
There is a principal difference between macro and micro-averaging in how they aggregate performance metrics.
Macro-averaging calculates each class's performance metric (e.g., precision, recall) and then takes the arithmetic mean across all classes. So, the macro-average gives equal weight to each class, regardless of the number of instances.
Micro-averaging, on the other hand, aggregates the counts of true positives, false positives, and false negatives across all classes and then calculates the performance metric based on the total counts. So, the micro-average gives equal weight to each instance, regardless of the class label and the number of cases in the class.
To illustrate this difference, let’s return to our example. We have 45 instances and 4 classes. The number of instances in each class is as follows:
We already estimated the recall and precision by class, so it will be easy to compute macro-average precision and recall. We sum them up and divide them by the number of classes.
Using macro-averaging, the average precision and recall across all classes would be:
Each class equally contributes to the final quality metric.
Now, let’s look at micro-averaging. In this case, you need first to calculate the total counts of true positives, false positives, and false negatives across all classes. Then, you compute precision and recall using the total counts.
We already claimed that precision and recall would be the same in this case. Let’s visually demonstrate it.
Let’s start at the same point and follow the formulas. Here are the model predictions:
We first need to calculate the True Positives across each class. Since we arranged the predictions by the actual class, it is easy to count them visually.
The model correctly classified 13 samples of Class “A,” 1 sample of Class “B,” 9 samples of Class “C,” and 14 samples of Class “D.” The total number of True Positives is 37.
To calculate the recall, we also need to look at the total False Negatives number. To count them visually, we need to look at the “missed instances” that belong to each class but were missed by the model.
Here is how they are split across classes: the model missed 2 instanced of class “A,” 4 instances of class “B,” and 1 instance of class “C” and “D” each. The total number of False Negatives is 8.
Now, what about False Positives? A false positive is an instance of incorrect classification. The model said it was “B,” but was wrong? This is a False Positive for Class “B.”
This is a bit more complex to grasp visually, since we need to look at the color-coded predicted labels, and the errors are spread across classes.
However, what’s important is that we look at the same erroneous predictions as before! Each class’s False Positive is another class’s False Negative.
They are distributed differently: for example, our model often erroneously assigned class “A” but never class “D.” But the total number of False Negatives and False Positives is the same: 8.
As a result, both micro-average precision and recall are the same: 0.82. Here is how to compute it:
What’s even more interesting, this number is the same as accuracy. What we just did was divide the number of correct predictions by the total number of the (right and wrong) predictions. This is the accuracy formula!
For multi-class classification, micro-average precision equals micro-average recall and equals accuracy.
To sum up, how did micro- and macro-averaging work out for our examples? The results were different:
Macro-averaging results in a “worse” outcome since it gives equal weight to each class. 1 out of 4 classes in our example has very low performance. This significantly impacts the score since it constitutes 25% of the final evaluation.
Micro-averaging leads to a “better” metric. It gives equal weight to each instance, and the number of objects in the worse-performing class is low. It only has 5 examples out of 45 total. In this case, their contribution to the overall score was lower.
A suitable metric depends on the specific problem and the importance of each class or instance.
Macro-averaging treats each class equally.
However, macro averaging can also distort the perception of performance.
If classes have unequal importance, measuring precision and recall by class or weighing them by importance might be helpful.
Micro-averaging can be more appropriate when you want to account for the total number of misclassifications in the dataset. It gives equal weight to each instance and will have a higher score when the overall number of errors is low. (If this sounds like accuracy, it is because it is!)
However, micro-averaging can also overemphasize the performance of the majority class, especially when it dominates the dataset. In this case, micro-averaging can lead to inflated performance scores when the classifier performs well on the majority class but poorly (or very poorly) on the minority classes. If the class is small, you might not notice!
As a result, there is no single best metric. To choose the most suitable one, you need to consider the number of classes, their balance, and their relative importance.
Here is extra: in some scenarios, it might be appropriate to use weighted averaging. This approach takes into account the balance of classes. You weigh each class based on its representation in the dataset. Then, you compute precision and recall as a weighted average of the precision and recall in individual classes.
Simply put, it would work like macro-averaging, but instead of dividing precision and recall by the number of classes, you give each class a fair representation based on the proportion it takes in the dataset.
This approach is useful if you have an imbalanced dataset but want to assign larger importance to classes with more examples.
You have different options when calculating quality metrics in multi-class classification.
To quickly calculate and visualize accuracy, precision, and recall for your machine learning models, you can use Evidently, an open-source Python library to evaluate, test and monitor ML models in production.
You will need to prepare your dataset that includes predicted values for each class and true labels and pass it to the tool. You will instantly get an interactive report that shows accuracy, precision, recall, ROC curve and other visualizations of the model’s quality. You can also integrate these model quality checks into your production pipelines.
Evidently allows calculating various additional Reports and Test Suites for model and data quality. Check out Evidently on GitHub and go through the Getting Started Tutorial.
Try our open-source library with over 20 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶