📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

ML Monitoring

Pragmatic ML monitoring for your first model. How to prioritize metrics?

Last updated:

January 9, 2025

Published:

August 29, 2022

contents‍

Start testing your AI systems today

Get demo

In the previous blog, we explored what monitoring an ML system means. It includes tracking software system performance, data inputs, ML model quality, and, ultimately, the business KPI. There is an overwhelming set of potential metrics to keep tabs on. How do you prioritize those?

In this blog, we'll try to introduce a pragmatic hierarchy.

Of course, the monitoring approach will vary based on the use case importance, the cost of every mistake, and particular business priorities. But if you take an "average" model, below is a blueprint to use as a starting point to design the ML-focused monitoring.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶

1. Logging

To avoid stating the obvious, you need to log things first.

If you have a prediction service, you need software system logs. These are timestamped events of what happened in your ML-powered application. Multiple mature software tools help push logs to a central system and analyze them. This logging is not ML-specific but rather depends on the serving setup.

In the context of an ML system, you need model prediction logs. You should record data on every prediction: the features that went in and the model output.‍

As a general rule, you should log every prediction. Some exceptions exist, such as edge models on a user's device or when privacy regulations prevent input logging. If you are in this situation, you should plan for some workarounds to collect and label at least some of the data.

Otherwise, you need to write every input and prediction to a database. Even if you do not have automated model monitoring, you can use this for manual performance analysis. You also need these logs to retrain the model later or debug whatever went wrong.

Step 1: Log your predictions and model outputs to a database.

2. Software health

The first level of monitoring is very straightforward. You want to know if the system works.

You have a backend ML service, and it needs observability. A no-brainer place to start is to monitor your infrastructure and software system.

You'd need to instrument the service to collect performance and resource utilization data. Depending on the setup, you might choose to track things like latency, RPS, or the rate of batch job completion. Ask your DevOps team for details!

Step 2. Add basic software checks to know that your service is working. Adjust the setup based on your infrastructure and service architecture.

3. ML model calls

How often is your model making predictions? You can add this sanity check to your baseline system dashboard. It will measure the number of model calls and responses explicitly.

Monitoring model calls will help track the demand for the service. You will understand if you need to scale it or if there are any adoption issues (is anyone even using it?)

Monitoring model responses will help detect situations when the software application works but the ML model is not applied. For example, if the model does not respond on time, and you use a fallback instead.

Step 3. Explicitly track the number of model calls and predictions made.

4. ML model outputs

Once you know that the ML service is functioning, it is time to determine how well it works. Here you want to understand two things. First, how well is the model performing? Second, did anything break or require attention?

There are different ways to judge the model output. Depending on what is available, you can track model quality, business KPI or the best proxy such as prediction drift.

ML model quality

You can directly evaluate the model quality if you get actual values or labeled data fast. There is no need for anything fancy. You predicted something—how many times were you right? How far off were your predictions?

This monitoring approach is very minimalistic, but it might be just enough. If you have broken pipelines, missing features, or outliers, you will almost certainly see the model quality drop. If you calculate the quality fast, you can know and react immediately.

It is still essential to make sure that you track the right metrics. For example, overall accuracy in an unbalanced binary classification task might disguise low model quality in the minor class. For regression, you can choose something easily interpretable, for instance, an absolute error, instead of RMSE.

Example of the Evidently model quality tests output — *Example of the* *Evidently* *model quality tests output.*

If the quality goes down, you can send an alert. To set the performance threshold, you can consider the past model quality, how much you expect it to fluctuate, and the cost of the quality drop (does 1% quality decline matter?)

Even if you do not have data on past performance, you can still compare the current quality against some reasonable baseline. For example, if you have a binary classification model, you can compare it against a dummy model that predicts the majority class. This way, you'll catch if the model is entirely unreasonable. (This is one of the heuristics we implemented in the Evidently tests, for example).

Step 4.1. If you get the labels or actuals, monitor the ML model quality. Be sure to choose a suitable metric. Use your training data or earlier model performance to set the expectations.

Pro tip: If you detect a model quality drop, you might want to generate plots to explore what went wrong. For example, for the regression model, you might want to look at the error distribution to understand if your model overestimates or underestimates the target.

Business KPI

Evaluating the business impact of an ML model can be tricky. But whenever you can, you should track the metrics that the ML product is truly driving. For example, you can monitor the click-through rate of a recommendation block or conversions. This will provide the crucial feedback loop on the true model performance.

It is also useful to come up with some interpretable metrics to help business stakeholders understand how well the model is doing. Here, the main goal is, truly, communication.

For example, in a demand forecasting use case, the business stakeholder might want to make sure that the absolute error remains under a certain threshold. You can then publish a dashboard that shows the share of predictions that remains within these boundaries. The manager can keep an eye and see that the model behaves as expected and does not cause too much trouble.

Step 4.2. If you can, track the business KPI that the model is driving. If you cannot measure it directly, consider a proxy metric to help communicate the ML business impact and risks.

If you deal with a fairly uncomplicated use case, know the true model performance, and can measure business impact, you can stop the monitoring here. You can still run additional tests or generate dashboards on demand, but as long as you know the key metrics, you won't need other alerts. As simple as that.

Prediction drift

However, models operate in different scenarios. Sometimes you get the actual values fast: you predict delivery times and know the truth an hour later. But more often than not, you have to wait longer.

In such cases, you can calculate neither accuracy nor error rate. But you can still do something: analyze the model output as the best proxy.‍

First, you can come up with some expectations for the model predictions. These can be "obvious" sanity checks. For example, demand prediction should be non-negative and stay within a specific range.

You can codify these expectations and run tests whenever you get new predictions. Such checks are not very informative but will help trigger an investigation on time. If the tests fail, something is wrong: a low-quality input, an outdated model, or maybe a pipeline bug.‍

Second, you can compare the distribution of the model predictions. This will tell you if the model outputs are similar to before.

Imagine you have a marketing personalization model that suddenly starts recommending the same item to half the client base. You'd want to detect it before you send a promo newsletter.

Here you need a baseline: the training data or predictions for some relevant past period, be it a day or a month. You can run a statistical hypothesis test or measure the distance between old and new distributions.

Example of the Evidently target drift report — *Example of the* *Evidently* *target drift report.*

Note that prediction drift does not always mean the model is unreasonable. The changes might be seasonal, or the model might adequately react to shifts in the user base, etc. Because of this, you'd probably need to evaluate the model inputs to get the complete picture. However, if the outputs drift, it is almost always the signal to dig deeper. Thus, it has priority over input drift in our monitoring hierarchy.

Bottom line: if you do not know the true quality, prediction drift is often the second-best thing to look at.

Step 4.3. If you cannot measure the actual model performance, evaluate the model outputs as the best proxy. You can always formulate some sanity checks on what is "normal" for the model. If you have a reference dataset, monitor for prediction drift.

5. Data quality

The goal of output monitoring is to notice when something breaks. But If the model quality drops or predictions look weird, the next question is: what has changed?

A huge portion of model bugs happens due to data quality issues. You might as well monitor them in advance. This includes tracking data integrity, adherence to the data schema, detection of outliers, etc. In essence, you want to judge if the model inputs are correct and complete.

Checks can be more or less thorough, depending on your resources and the model you deal with. If you have many features, you might want to prioritize them by importance and look at the key drivers.‍

Missing data. You can start by monitoring missing data for the whole dataset or specific features.‍

Next, schema compliance. Are all the features there? Do their types match? Are there new columns?

Example of the Evidently tests output — *Example of the* *Evidently* *tests output.*

Feature ranges and stats. The more you know about data stats—the better you know what model got as an input. Manually defining expected ranges for each feature can be cumbersome but is often helpful.

You can also apply some rules of thumb, e.g., to expect +/-10% variance from training data. Of course, ranges can be strict for some features and loose for others.You can also track specific descriptive properties of the feature, such as mean, max, or quantiles.

We've implemented ready-made presets that combine relevant checks together in Evidently, an open-source toolset for ML monitoring. You can explore the tool on GitHub or browse through the tests library in the documentation.

Correlations. Some data issues are less straightforward. Features can stay within range but behave abnormally. Tracking correlations is an excellent proxy to detect such situations.

Yes, correlations see only linear relationships. But if the feature behavior changes, you will likely catch it by considering linear dependencies. You can also track correlations between the feature and target.

The idea is to compare the correlation in your baseline data to the current one and react to major changes, e.g. if a strong positive correlation suddenly becomes negative.

Step 5. Evaluate input data quality and integrity. Depending on the use case, you can check for missing data, data schema compliance, range violations, feature stats, and correlation changes. You can limit these checks to the key drivers.

6. Data drift

Now, the data drift, aka change in the input distributions. This is an additional dimension to the data quality and integrity. With quality, you care about the correctness of the input. With data drift, you care about detecting meaningful changes in otherwise correct inputs.

You don't always have to monitor for drift. But it is handy in two scenarios:

When you cannot evaluate the true model quality. Just like with output drift, you can track input drift as a proxy for understanding the model relevance. You'd usually interpret data and prediction drift together to get the full picture.
When you want to detect changes early. Even if you get the true labels fast, data drift monitoring can help detect upcoming change before it affects the model quality.

The usual approach for drift detection is to compare distributions between the reference dataset and current production data on per-feature level. You can use statistical tests like Kolmogorov-Smirnov, distance metrics like Wasserstein distance, etc. You can also add another heuristic on top to judge if the whole dataset drifted, e.g., if over 50% of the important features drifted.‍

Further read: if you want to dive deeper, here is a practical experiment to compare various approaches to data drift detection.

Example of the Evidently data drift report — *Example of the* *Evidently* *data drift report.*

If you detect a distribution shift, it can signal that the model is applied to a new, unknown population (imagine an expanding user base) or that something happens in the environment (new marketing campaigns affecting user behavior). Either way, it can affect the model quality. You can then evaluate the appropriate actions: from retraining the model to pausing it.

Step 6. If you do not get the labels immediately, consider monitoring for the input data drift. It can complement prediction drift monitoring and warn early about the model quality decay.

7. Advanced monitoring

The list is not complete!

You might need a more comprehensive monitoring system, especially for high-stakes use cases. Here are some of the additional things to consider:

Performance by segments. You can check the performance on specific slices of data and proactively find low-performing segments.
Model bias and fairness. Highly relevant in insurance, education, healthcare, etc.
Outliers. In cases like credit scoring, you might monitor outliers to send them for manual review.
Explainability. Ability to examine individual model decisions might be critical for the end-user experience to generate trust in the model or debug it.

Tips to start

This blog provides a basic blueprint, but every case is still unique.

You do not need to track all the possible metrics (alert fatigue is real!) but define and track the right ones. It is best to analyze the specific model risks and business goals to adjust them for your use case.

Here is a list of sample questions to think through.

To define the overall monitoring setup:

How critical is the use case? You will need more comprehensive monitoring if the model brings a lot of revenue or contains specific risks such as unfair user treatment. If your only risk is suggesting an irrelevant movie, you can start with something simpler. (If you are Netflix, that might be a lot of money, still!)
How quickly and easily do you get the labels? There is a major difference between tracking true model quality and judging it through proxies. It affects both monitoring "contents" and the infrastructure and ETL jobs you'd need to build and run.
How many resources do you have to monitor and maintain the models? There might be trade-offs, and simpler architectures will be more robust, especially if you do not have a complete ML platform just yet.
How do you plan to respond to the issues? You might want to get alerts and manually debug the issues or use monitoring as an automated trigger for retraining or model switching. The metrics and tests would be quite different between the two.

To think through specific metrics:

Which metrics did you use to evaluate the models during training? It's a good place to start if the pre-deployment evaluation or A/B test was done right.
Are there specific populations of interest that you care about? Instead of tracking a lot of metrics, you might rather need to track the same metric for different segments.
What is your model decay profile? How frequently do you plan to update it? You can learn a lot even before you deploy the model. For example, estimate how quickly the model ages. This can also help understand the frequency of checks and comparison windows.
How complex are the pipelines? Where is the data coming from? If you use a lot of data sources and feature transformations, you might need to think through data quality checks from the start.
What are your important features? You might need different metrics and tests when you have a few well-understood, important features compared to hundreds of weaker ones.
What type of issues did your model have in the past? You can think back from known model failures to decide how you could have detected them in advance. You might add specific custom checks based on this, e.g., monitor closely the behavior of a particular feature that can signal the issues early.

In many cases, there is no need to overthink it. And it is always best to start small and expand than leave your models unmonitored!

Summing up

If you have limited resources and are only starting with ML monitoring, your primary goal is to detect and learn about the issues. You need to log the predictions and define a set of basic tests.

Here is a reasonable metrics hierarchy to start building your checks:

Software health. Monitor software service health or your batch jobs. If the service is down, there is no use in the model.
Model calls. Track the number of model calls explicitly to evaluate its usage.
Model outputs. Find the best possible way to track the model outputs. Model quality and business KPIs are the best signals. If you cannot measure them, tracking statistical prediction drift is often a good proxy.
Input data quality. Check for data quality and integrity, especially if you have complex pipelines and multiple data sources. Consider tracking missing data, schema compliance, value ranges, and feature interactions.
Data drift. If you do not get the labels fast, or deal with complex use cases, track changes in the input distributions to understand the changes in the model environment.

To think through the specific monitoring setup, consider the use case importance, error costs, how quickly you get the labels, and whether you plan to react to issues automatically or keep the human in the loop. If you operate in a high-stakes environment, you might need to expand your monitoring by tracking subpopulations, outliers, and model bias.

In the next blog, we'll cover different architectures for ML monitoring. Sign up to get an update once it's out!