Q&A: ML drift that matters. "How to interpret data and prediction drift together?"

contents‍

Get started with AI evaluation and observability

TL;DR: Data and prediction drift often need contextual interpretation. In this blog, we walk you through possible scenarios for when you detect these types of drift together or independently.

‍Production machine learning systems have issues. But sometimes, you cannot even learn about them! You are just making predictions without immediate feedback. In this case, it is usual to keep tabs on the model inputs and outputs as a proxy of model performance.

But what if one of the alarms fired, but the other did not? Let's look at how you can interpret it.

Let's first cover the bases.

‍Data drift is the change in the input data distributions. One can also call it feature or input drift. For example, your model uses weather data as inputs. It used to be freezing and now is tropical. Data drift detected!

‍Output drift is the change in the model predictions, recommendations, or whatever else it returns. For example, the model rarely suggested shoppers buy sunglasses, and now it is in every recommendation block. Prediction drift detected!

You can evaluate both input and output drift with the help of statistical tests to compare the "new" data distributions to the "old." Or, you can rely on simpler descriptive statistics.

Assuming you have both tests set up, how can you interpret them together?

Scenario 1. The inputs drifted, but the predictions are stable.

It is natural to pay attention to the output drift as a first-class citizen. Many of the data issues would make the model predictions go rogue anyway.

But what if the model predictions are stable, but the data inputs shift significantly? Is it something you should worry about?

As usual with machine learning, it depends on the context.

There are two ways how we can interpret it: positive or negative.

‍Positive interpretation: the model is all set, and the drift does not matter!‍

Important features are stable. Depending on how you set up drift detection, the input drift detector might be overly sensitive to the less important features. Even if they shift, this does not affect the model predictions (and reality) much.‍
The model is robust and can adapt to drift. Even if things are a bit volatile, the model handles this drift well. For example, you trained it using a long enough period, and it has seen enough volatility in history to respond appropriately.

There is no need to adjust or retrain the model in either case. You might want to adjust the data drift detection approach, though. For example, you might limit drift detection only to the most important features, change the size of comparison windows, or pick a less sensitive statistical test.

‍Negative interpretation: the model is unreasonable!‍

Important features changed. You might look at the drift in more detail and confirm what your drift detector made clear: features that are key to the model are now behaving differently. That is something the model should respond to.‍
The model should have reacted, but it did not. In the simplest example, a model might not extrapolate well. We've seen such an example in the "How to Break a Model in 20 Days" tutorial: despite the weather becoming warmer, the model could not predict an increase in the demand for city bike rentals.

In this case, relying on the seeming stability of the model output would have been a mistake. You need to intervene to retrain or rebuild the model!

Scenario 2. Both input data and model predictions drifted.

Panic mode! Everything changed overnight:

It might not be as bad as it sounds, though. Let's consider both cases.

‍Positive interpretation: model handles drift well!‍

Important features changed. Yes, you are observing the drift. There is a meaningful change to the real-world phenomenon, and the model is operating in a non-typical setting. But…‍
The model is robust and reacts well. You don't always expect that from a machine learning model that only learns from available observations. But if built right, it can adequately respond to the changes. For example, it will monotonously increase the sales forecast following lowered prices. You might even incorporate such assumptions into your machine learning system design.

In this case, there is no need to intervene. Yes, the reality changed, and the model predictions too, but the model behavior follows the expectations. Like in our fictitious example, when an e-commerce system starts up-selling sunglasses in response to the sunny weather outside. If changes continue accumulating, you might need to calibrate or rebuild the model, but for now, it's good to go!

‍Negative interpretation: things have gone rogue!
That is probably the first idea to cross your mind when all alarms fire. And that might be true sometimes.

Important features changed. Yes, the reality is new to the model. This was not seen in training or earlier history. And…‍
The model behavior is unreasonable. It does not know how to react to these changes, and the predictions are low quality and do not adapt in a "logical way." Probably we will see an increasing model error rate very soon.

In this case, you need to intervene. You should start from the investigation of causes and choose the appropriate action: solve the data quality issues, retrain or rebuild the model.

Scenario 3. The model output drifted, but the input data did not.

A more concerning example that might not feel right. What if the predictions drifted while the features look stable?

Output drift is always a solid signal to dig deeper. In most cases, this situation would mean some bug, data quality issue, or wrongly set drift detector.

‍In the first scenario, this is a symptom of an error. The drift detector does not work right.

For example, there might be a bug in the code that performs the drift test. You might accidentally point to the incorrect reference dataset. Or the evaluation job fails without an alert. It is best to pick reliable tooling!

If you run the prediction drift test after making some transformations to the model output, this transformation code or business logic might be the culprit, too.

‍In the second scenario, this is the signal to review the sensitivity of your drift test.
You might often tune prediction drift and data drift thresholds independently. For example, you adjust your input drift detector to only react to major shifts to avoid alert fatigue. In this case, it might not fire when you have a lot of small changes in each feature. But this drift is real, and it already affects the model outputs.

In another situation, the opposite might be true. Your output drift detector might be overly sensitive and react even to a minor variation. If you don't mind a few false positives, that might still be the best approach for highly important use cases.

In any case, this discrepancy between data and prediction drift is a good signal to run an investigation or review your settings.

Scenario 4. No drift detected.

Positive scenario: awesome! Let's grab a coffee.

Negative scenario: did the drift detection job even work?

Okay, let's not dig into that too much. Nevertheless, it's always a good practice to re-evaluate your monitoring approaches every once in a while to avoid false negatives.

Summing up

Drift detection is a nuanced thing! Here are some takeaways on evaluating ML model inputs and outputs drift.

If you look only at one signal, the prediction drift is typically more important. If detected, it is a clear signal to investigate the reason.
Yet, if you look at them both, interpret them in context! Not all drift matters if your model can respond to changes well. On the other hand, the lack of output drift can be deceiving when the model cannot extrapolate. It might make sense to look at data and prediction drift together to get the complete picture.
Blind retraining when drift is detected can be suboptimal. You should consider two sets of risks and costs. First, unnecessary retraining of the model in response to harmless drift. Second, retraining the model on the low-quality or non-representative data if the drift is real. It might make sense to perform the root-cause analysis and data comparison before hitting "retrain."
Your drift detection approach is key. You can get false positives or negative alerts or pick the wrong comparison windows or statistical tests. It often makes sense to evaluate historical data drift to tune the drift detection method for your use case.

Stay tuned for our following blogs, where we discuss the statistical approaches behind drift detection!

Have a question about production machine learning? Join our Discord community and ask in the #ds-ml-questions channel.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶

ML Monitoring

Q&A: ML drift that matters. "How to interpret data and prediction drift together?"

Scenario 1. The inputs drifted, but the predictions are stable.

Scenario 2. Both input data and model predictions drifted.

Scenario 3. The model output drifted, but the input data did not.

Scenario 4. No drift detected.

Summing up

[fs-toc-omit]Get started with AI observability

Emeli Dral

Elena Samuylova

You might also like

ML Monitoring

To retrain, or not to retrain? Let's get analytical about ML model updates

ML Monitoring

"My data drifted. What's next?" How to handle ML model drift in production.

Get started with Evidently