A new report is released! You can now use Evidently open-source library to analyze the performance of production ML models and explore their weak spots. Regression models are covered first.
What is it?
The Data Drift and Target Drift reports we released earlier explore the change of the model features and target. Both come in handy when you do not have an immediate feedback loop.
Still, sooner or later, we know the ground truth or actual labels. We can then analyze the model performance directly and compare it to what we have seen in training.
To quickly get a detailed view of the actual model quality, you can use our new Regression Performance report. It will work for any use case where you predict a continuous variable, such as demand or price prediction. There is similar functionality for classification models.
This new report helps you answer the following questions:
- How well does my model perform in production?
- Which errors does the model make?
- Are there specific data segments where the model performs differently?
- Did anything change compared to training?
The report provides a performance summary and a more in-depth look into the model errors. The goal is to go beyond averages and help you identify data slices that require attention.
You are reading a blog about an early Evidently release. This functionality has since been improved and simplified, and additional Report types are available. You can browse available reports and check out the current documentation for details.
How does it work?
Once again, you need to prepare two Pandas DataFrames. The first one is your Reference dataset. Usually, this is the training or test dataset that you initially used to evaluate the model performance. It should include the input features, and both predictions and the fact.
The second is your Current (production) dataset. Similarly, it should also include ground truth data. While your model input and predictions are usually logged at the moment of serving, the actual values might be known only later and stored by some external system.
The data capture setup can vary depending on the use case and infrastructure, and we expect you to merge the data on your own as needed. Evidently will work with a complete input dataset.
The tool also works with a single dataset. In this case, we do not perform a comparison between Reference and Current performance. Instead, you can analyze a single model.
Once you import the dataset or two, creat the Report object in the Jupyter notebook and include the RegressionPreset.
How to work with the performance report
First, the report shows a few standard model quality metrics relevant for regression models: Mean Error (ME), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE).
Metrics are shown both for Reference and Current datasets. This way, you can quickly understand the aggregate performance, and see if anything has changed between the two.
In this example built using the Bike Demand Prediction dataset, we can quickly see that the model performance went downhill:
For each quality metric, we also show one standard deviation of its value. This helps you understand how stable the metric (and thus the model performance) is.
For example, if the variance is high, you should be careful in interpreting the change in absolute value between Reference and Current as "improvement" or "deterioration" in model quality. It might also indicate that your evaluation sample is too small.
You can change or add custom quality metrics. Our goal is to give you a helpful report that works out of the box to cover a variety of applications. However, if you want to add your own metric, you can implement it as well.
In some cases, the simple aggregate metrics are just enough. In others, you must notice some change and need to look for a root cause. Or, want to investigate the possibility of model improvement by understanding the details of the performance.
For this, the report provides some extra information.
First, we visually plot the predictions and actual values and show how they change over time. You can zoom in on specific periods to explore.
In this example, we can see what we already know: the model now makes many mistakes.
The report also plots the model error so that you can visually explore its change. To dig deeper, you can look at the distribution and Q-Q plot:
There are several insights you can grasp from analyzing these plots:
- Change in the size of the model error. We can understand if the model now performs better or worse by looking at how wrong the model is on average. You can prefer absolute or percentage error depending on the use case.
- Change in the error distribution. Do we tend to have many evenly "small" errors or a few "large" ones? Did this change between our training and production datasets? For example, if you notice a shift from being a little bit wrong all the time to being dramatically wrong in certain instances, it makes sense to investigate further.
- Error normality. Sometimes, you might expect the error to be distributed normally and want to track if it remains so. It is also important if you plan to run further statistical tests since some tests are only applicable to normal distributions.
- Presence of the unused signal. In the model training phase, we often use error analysis to find ways to improve the model. Ideally, the errors should look like random noise. If we see a non-random pattern, we can as well try to learn it and reduce the error! Similarly, if we notice a new pattern in the production model errors, this hints about an underlying shift. Retraining or rebuilding a model might help improve quality.
- Error bias. Not all errors are created equal. Ask your subject matter expert! For example, underestimating some value can be more harmful than overestimating, or vice versa. Looking only at absolute errors can disguise it. It is a good practice to analyze the error skew and how it changes over time, especially if some business logic is applied on top of the model prediction.
For many of these insights, the next step is to look for specific slices in the data. You want to locate the segments where the model error is too large or too different to explore and fix it.
To aid with this granular analysis Evidently generates an Error Bias table.
How to interpret the Error Bias table
The table helps identify the areas of low or biased performance that can be described through specific features.
Here is how it works.
We look at each individual prediction and estimate the error. We single our two types of error:
- Overestimation. The model predicts the values that are higher than actual.
- Underestimation. The model predicts the values that are lower than actual.
Then, we identify the top-5% of predictions where the model overestimates and the top-5% where the model underestimates.
We treat the rest 90% of observations as the majority where the error is close to average. Our goal is instead to highlight and analyze the extremes.
If there is no skew, the difference in the error size between the three groups will be marginal. If that is the case, there is little interest in exploring further: we have a great model!
In practice, we might observe some "fat tails" in error distribution and possibly a skew towards a given error type.
In our example, we make errors of all types. There is a slight bias towards overestimation. We can notice this by looking at the error in the Majority group. But what is more worrisome, is that the quality of the model went all the way down, both at the mean and at extremes.
Where does the model fail?
When the model fails like ours, we want to understand something about the objects inside the 5% groups at both extremes.
Is there a relationship between the specific feature values and under- or overestimation? Are there particular segments (feature regions) in the data, where the error is consistently large and/or biased?
An example from the real estate price prediction problem for an online rental portal: we might consistently predict a lower-than-actual price for rental properties in certain areas, of a given apartment type and/or containing certain words in a description. When looking at the Error Bias table, we want to be able to find such groups.
There are some complex and computationally intensive ways to perform a similar search. However, as a reasonable proxy, we can make an analysis feature by feature.
That is precisely what the Error Bias table does. To describe "where the model fails", we look at the feature statistics in each of the groups we divided our observations into:
- In the top-5% of instances where the model overestimates (OVER)
- In the 90% of predictions in between (MAJORITY)
- In the top-5% of instances where the model underestimates (UNDER)
For each numerical feature, we look at the mean. For categorical features, we check the most common value.
To identify the segments of interest, one should look for the difference in feature values between the MAJORITY group and OVER or UNDER groups. When the difference is large, it means that the model performs differently if values of this feature are in a certain range.
For quicker search, you can sort the table using the column "Range(%)". When the percentage is high, this means that either or both of the "extreme" groups look differently.
If we look closer at individual features with the large observed difference, we might find the specific segments where the model performs worse and decide what to do with it.
For example, here we can see that for the feature "humidity" the Range was 3.67% in training, but now it is as high as 12.53%. Let's expand the corresponding row of the table!
We can then make a more specific observation. The model now significantly overestimates when the humidity values are between 60 and 80, even though we saw these values in training. Previously, the error was unbiased and similar on the whole range.
If we look further, we can understand that something happens with the weather. The model also overestimates the demand when the temperature is above 30°C.
Altogether, this sums up to the explanation. We trained the model using the data from the cold months of the year. Now, we apply it in early summer. We observe new weather conditions and new behavior of the bike renters.
In this scenario, retraining the model is the easiest thing to do. If not enough data is collected yet, we might consider adding some corrective coefficients manually to compensate for the model flaws.
When should I use this report?
Here are our suggestions on when to use it—you can also combine it with the Data Drift and Target Drift reports to get a comprehensive picture.
1. To analyze the results of the model test.
You can use this report to present and explore the trial model's results, be it an offline or online test. You can then contrast it to the performance in training to interpret the results.
Though this is not the primary use case, you can also generate the report when comparing the model performance in an A/B test, or during a shadow model deployment.
2. Whenever you want (or can) check on your production model
You can run this report as a regular job to analyze and report on your production model performance.
You might want to put in on schedule, e.g. to generate the report every week. In other cases, you might be limited to how and when you receive the ground truth. If it comes in batches, e.g. once the sales data is updated, you can run the report every time this happens.
The goal is to detect model decay and identify specific performance issues.
By manipulating the input data frame, you can also perform analysis on slices of data. It might be helpful to zoom in on the model performance on some segment of interest. For example, you can choose only the predictions for users from a specific region, time period, or both.
3. To trigger or decide on the model retraining
If you want to be precise with the model update process, you can identify a specific performance threshold to decide on the model retraining. This report can help you to check where you stand and if it is time to refresh the model.
If the model performance goes down, you can use this report together with the Data Drift one to identify the likely reason and decide how to act. Not always the issues can be solved by blind retraining, especially if data quality is to blame.
At the same time, if the model performance is stable, you can also skip the update and save resources on the unnecessary retraining.
4. To debug or improve model performance by identifying areas of high error
You can use the Error Bias table to identify the groups that contribute way more to the total error, or where the model under- or over-estimates.
There are different approaches to how you can address it to improve the model performance:
- Provide or label more data for the low-performing segments;
- Rebuild the model to account for these segments, with technics like oversampling or reweighting to penalize over- or under-estimation for the identified group;
- Add business rules or post-processing for the identified segments;
- Pause the model, use a fallback approach or manual review for specific data points.
How can I try it?
Go to GitHub, and explore the tool in action using sample notebooks.
For the most recent update on the functionality, explore the docs and examples folder on GitHub.
And if you like the tool, spread the word!
[fs-toc-omit] Want to stay in the loop?
Sign up to the User newsletter to get updates on new features, integrations and code tutorials. No spam, just good old release notes.
For any questions, contact us on email@example.com. That's an early release, so let us know of any bugs! You can also open an issue on Github.