📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

ML in Production Guide

What is data drift in ML, and how to detect and handle it

Last updated:

January 9, 2025

contents‍

When you deploy a machine learning model in production, it faces real-world data. As the environment changes, this data might differ from what the model has seen during training. As a result, the so-called data drift can make the model less accurate.

In this article, we'll break down what data drift means, why it matters, and how it's different from similar concepts. We'll cover how to spot and deal with data drift to keep your models working well even when the data is always changing.

We’ll also show how to evaluate data drift using the open-source Evidently Python library.

TL;DR

Data drift refers to changes in the distribution of the features an ML model receives in production, potentially causing a decline in model performance.
When ground truth labels aren't accessible, data drift monitoring techniques serve as proxy signals to assess whether an ML system operates under familiar conditions.
You can use various approaches to detect data distribution drift, including monitoring summary feature statistics, statistical hypothesis testing, or distance metrics.

Evidently Classification Performance Report

Start with AI observability

Want to keep tabs on your production ML models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.

Start free ⟶Or try open source ⟶

Start with AI observability

Want to keep tabs on your production ML models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.

Start free ⟶Or try open source ⟶

What is data drift?

TL;DR. Data drift is a shift in the distributions of the ML model input features.

Data drift is a change in the statistical properties and characteristics of the input data. It occurs when a machine learning model is in production, as the data it encounters deviates from the data the model was initially trained on or earlier production data.

This shift in input data distribution can lead to a decline in the model's performance. The reason is, when you create a machine learning model, you can expect it to perform well on data similar to the data used to train it. However, it might struggle to make accurate predictions or decisions if the data keeps changing and the model cannot generalize beyond what it has seen in training.

In simple terms, data drift is a change in the model inputs the model is not trained to handle. Detecting and addressing data drift is vital to maintaining ML model reliability in dynamic settings.

Let’s take a simple example.

Imagine a retail chain that uses machine learning to predict how many products of a particular type they need to stock in each of their stores. They trained their model using historical sales data from the past few years.

Until now, most of their sales have been in physical stores, and their model has become quite good at forecasting demand for in-store products. However, as the retailer ran a marketing campaign to promote their new mobile app, there's been a significant shift towards online sales, especially for some product categories.

The training data didn't have enough online sales information, so the model didn't perform as well for this segment. But it didn't matter much because online sales were a small part of their business. With the surge in online shopping, the quality of the model's forecasts has significantly dropped, affecting their ability to manage inventory effectively.

This shift in sales channels, from predominantly in-store to largely online, is an example of data drift.

Data drift example — An example of data drift is a change in sales distribution by channel.

A few terms related to data drift are concept drift, model drift, prediction drift, and training-serving skew. The topic of data drift also comes up in data engineering and data analysis outside machine learning, where it might take a different meaning. Let’s explore these differences in more detail.

‍A word of caution: none of these terms is strictly defined! Examining them helps grasp various factors that impact a machine learning model in production. However, in the real world, these semantic distinctions are rarely important. Multiple changes can happen simultaneously, and practitioners tend to use these terms interchangeably.

Data drift vs. Concept drift

TL;DR. Data drift is a change in the input data. Concept drift is a change in input-output relationships. Both often happen simultaneously.

While data drift describes changes in the data distribution, concept drift relates to changes in the relationships between input and target variables. Basically, concept drift means that whatever your model is predicting – it is changing.

Data drift might be a symptom of concept drift, and both often co-occur. However, it is not a must. In the example with retail sales forecasting above, there can easily be no concept drift. The shopping preferences of customers, both online and offline, might remain consistent. The same goes for the average basket size per channel. However, a new marketing campaign reshapes the distribution of shopper segments. The model quality might drop because a segment it performs worse on became larger – not because the patterns have changed. This is data drift without concept drift.

An example of concept drift could be a new competitor offering massive discounts on the same product groups your retail store sells. This could shift shopper behavior, resulting in a drastic decrease in the average basket size in offline stores. This could, in turn, lead to inaccurate forecasts. Another instance could be the onset of COVID-19, which transformed how people shopped and disrupted logistical patterns. In these cases, all previously created models became almost obsolete.

The difference: Data drift refers to the shifts in input feature distributions, whereas concept drift refers to shifts in the relationships between model inputs and outputs.

The similarity: Both data drift and concept drift can result in a decline in model quality and often coincide. In monitoring, data distribution drift can be a symptom of concept drift.

‍Want a deeper dive? Read an in-depth explainer about concept drift.

Data drift vs. Prediction drift

TL;DR. Data drift is a change in model inputs, while prediction drift is a change in the model outputs.

When discussing data drift, we typically refer to the input features that go into the model. Prediction drift, in comparison, is the distribution shift in the model outputs.

The shift in model outputs can signal changes in the environment or issues with the model quality. Often, this is the best proxy if you cannot directly measure the model performance. Imagine that a fraud model starts to predict fraud more often. Or, a pricing model is now showing significantly lower prices. The change in the model predictions is a good reason to investigate.

Prediction drift can signal many issues, from low-quality data inputs to concept drift in the modeled process. At the same time, prediction drift does not always imply model deterioration. It can also occur if the model adjusts well to the new environment. For example, if there is an actual increase in fraud attempts, you can expect that the distribution of the predicted fraud cases will look different. In this case, you could observe both feature and prediction drift without a decay in the model quality.

Monitoring for data and prediction drift — Prediction and data drift might occur together, but this is not necessarily a bad thing.

Want a deeper dive? Read how to interpret prediction and data drift together.

It's worth noting that sometimes practitioners refer to data drift or dataset shift as a joint term referring to both model inputs and output changes.

The difference: data drift refers to the changes in the model input data, while prediction drift refers to the changes in the model outputs.

The similarity: both data and prediction drift are useful techniques for production model monitoring in the absence of ground truth and can signal the change in the model environment.

Data drift vs. Training-serving skew

TL;DR. Training-serving skew is a mismatch between training and production data. Data drift is a shift in the distribution of production data inputs over time.

‍Training-serving skew is a situation where there's a mismatch between the data the model was trained on and the data it encounters in production.

While environmental changes can contribute to this skew, it includes all possible discrepancies between the two datasets, including issues related to data preprocessing, feature engineering, and more. In addition, while data drift is usually a more gradual process you encounter during model operations, the training-serving skew refers to the immediate post-deployment window.

Data drift vs. Training-serving skew — Data drift is a gradual change in production data. Training-seving skew is a mismatch between real and training data.

For example, you can encounter a training-serving skew when the features available in training are not possible to compute in production or come with a delay. The model won't be able to perform as well if it lacks important attributes it was trained to consider.

You might also face training-serving skew if you train the model on a synthetic or external dataset that does not fully match the model application environment.

The difference: data drift refers to the gradual change in the input data distributions. Training-serving skew refers to the mismatch visible shortly after the start of the model production use and can include issues unrelated to the changes in the environment.

The similarity: in both cases, we refer to the changes in the input data. You might use similar distribution comparison techniques to detect input data drift and training-serving skew by contrasting production data with training.

Data drift vs. Data quality

TL;DR. Data drift refers to the change in data distributions in otherwise valid data. Data quality issues refer to the data bugs, inconsistencies, and errors.

Broadly speaking, data drift can include all sorts of changes and anomalies in data. For instance, you can consider changes in the data schema, missing values, inconsistent formatting, or incorrect inputs as examples of “data drift.” This is often the case in domains outside machine learning, such as database management and data analysis.

However, it is often beneficial to distinguish between data quality issues and data drift in machine learning. They have different implications and solutions.

Data quality issues refer to corrupted and incomplete data that might occur, for example, due to pipeline bugs or data entry errors. Data drift refers to the change in distributions in otherwise correct and valid data that might occur due to environmental shifts.

Data drift vs. Data quality — Data quality concerns issues like missing or corrupted data. Data drift is focused on shifts in context.

When you detect data distribution shifts, you can often attribute it to data quality issues. For example, if there is an accidental change in the feature scale due to entry error, you will notice a statistical shift in the distribution. Because of this, it helps to divide the two groups of checks. First, you verify the data quality, such as whether the data is complete, relevant features remain non-negative, etc. Then, you apply data distribution checks to see if there is a statistical shift in the feature pattern. Otherwise, whenever drift is detected, you'd need first to discard data quality issues as a possible root cause.

The difference: Data quality concerns issues like missing values or errors in the data. Data drift refers to statistical changes in the data distribution, even if the data has high quality. Data quality issues can lead to observed data drift, but they are not the same thing.

The similarity: Both data quality issues and data drift can lead to model quality drops, and both refer to the changes in the data. Data drift detection techniques can often expose data quality issues.

Data drift vs. Outlier detection

TL;DR. Data drift refers to the change in the overall data distributions. Outlier detection is focused on identifying individual anomalies in the input data.

Drift detection refers to the "global" data distributions in the whole dataset. You want to detect whether the data has shifted significantly compared to the past period or model training. The goal is to decide if you can trust the model still performs as expected and if the environment remains similar.

Outlier detection serves a different purpose. You want to identify individual objects in the data that look different from others. Often, the goal is then to act on the level of the respective objects. For example, you can ask a human expert to make a decision instead of the ML model or apply some business logic for this particular output, such as denying to make a prediction. This often happens when the cost of an individual error is high.

Data drift vs. Outlier detection — Outlier detection is focused on individual anomalies, while data drift detects dataset-level shifts.

Data drift and outliers can exist independently. For example, you can observe dataset drift without outliers or individual outliers without data drift. You’d typically design detection methods differently: drift detectors should be robust to some outliers, while outlier detectors should be sensitive enough to catch individual anomalies.

The difference: Data drift refers to changes in the overall data distribution, while outlier detection focuses on identifying individual unusual inputs in the data. Drift detection helps evaluate overall model reliability, while outlier detection helps discover inputs the model might be ill-equipped to handle.

The similarity: Both checks help monitor and understand the changes in input data. You can track the share of outliers in the data as a reflection of the upcoming data drift.

Why is data drift important?

Data drift is an important concept in production machine learning for a few reasons.

First, conceptually understanding that distribution drift can – and will – happen helps prepare to maintain the production ML model quality through model monitoring and retraining.

Second, tracking data distribution drift can be a technique to monitor the model quality in production when ground truth or true labels are unavailable.

Lastly, data drift analysis can help interpret and debug model quality drops and understand changes in the model environment.

Model maintenance

TL;DR. Machine learning models are not "set it and forget it" solutions. Data will shift with time, which requires a model monitoring and retraining process.

You typically train ML models on specific datasets, expecting they'll perform well on unseen, real-world data. However, assuming that the data will remain static is often unrealistic.

Even if there are no drastic changes and events like significant marketing campaigns or COVID-19, you can expect minor variations to accumulate over time. For example, in sales demand forecasting for hundreds or thousands of different items, you can always expect new products to appear and customer preferences and market conditions to evolve.

As a result, complex real-world data will always deviate from the training data used sooner or later. Data (and concept) drift are in-built properties of a machine learning system – which explains the need for ongoing model maintenance.

Typically, you can combat this by regularly retraining the models on the new data to help the model stay up to date. This means you need to design a model update process and retraining schedule.

Additionally, you need a robust model monitoring setup to provide visibility into the current model quality and ensure you can intervene in time. This helps detect model quality issues in between updates or design the retraining process to happen on a trigger.

ML model monitoring — Model monitoring is an essential part of production model operations.

Tracking the true model quality (accuracy, mean error, etc.) is usually the best way to detect changes in the model. However, it is also not always possible to measure the model quality directly due to feedback delays. In this case, you might need to design early monitoring using proxy metrics – and tracking data distribution drift is one of the options.

Want a deeper dive? Read an in-depth guide to ML model monitoring.

Feedback delay

TL;DR. Feedback delay is a time lag between model predictions and receiving feedback on those predictions. Monitoring input data distribution drift is a valuable proxy when ground truth labels are unavailable.

Feedback delay can occur when there is a significant time gap between the model making a prediction and receiving feedback on the accuracy or correctness of that prediction.

Ground truth is often available with a delay.

For example, in a recommender system, there might be a delay between when the system recommends an item to a user and when we can determine if the user liked or interacted with that recommendation. This delay can vary from seconds (for some online interactions where users can immediately click or accept a recommended offer) to minutes, hours, or even days, depending on the nature of the task.

In tasks like payment fraud detection, it may be difficult to definitively label a user transaction as fraudulent or legitimate until you perform further investigation or get user feedback. Sometimes, ground truth labels may only become available weeks or months later, such as when a customer disputes a fraudulent transaction.

Having ground truth labels (Did the user buy a recommended product? Was it indeed a fraudulent transaction?) is crucial for evaluating the model quality in production. It is hard to make real-time decisions on the model quality otherwise. At the same time, say, a problematic fraud detection model may cause significant harm before issues are identified and resolved.

In these scenarios, you can set up early monitoring using proxy metrics. Monitoring input data distribution drift is one such method: you can apply different techniques to compare the distribution of the incoming data to previous batches. This helps detect significant environmental shifts that might affect the model performance before you measure it directly.

Data drift monitoring — Using data drift monitoring as a proxy when the model accuracy is unknown.

You can use various data distribution comparison techniques to evaluate changes in both model input features and model outputs. We’ll briefly cover different drift detection methods in the following section.

Model debugging

TL;DR. Analyzing input data distribution drift helps explain and locate the reasons for model quality drops, as well as notice important changes in the modeled process.

Evaluating data drift is also a useful technique for model troubleshooting and debugging. If you observe a model quality drop through a direct metric like accuracy, your next step is investigating the underlying cause. This usually boils down to looking for changes in the input data.

Data drift analysis helps understand and contextualize the changes in the model features and locate the source. In this scenario, you might not use drift detection as an alerting signal – however, you can employ the data drift analysis when debugging.

A simple approach is to run checks for per-feature distribution comparison and identify which features have shifted most significantly. Then, you can visually explore their distributions to interpret the change using your domain knowledge. Are users coming from a new source? Is there a new product group absent in the training data?

For example, you might notice a shift in a particular categorical feature that helps identify a new emerging user segment. You might also detect uncommunicated changes to the modeled process, such as new products added to the marketing promotions or users in new locations.

How to detect data drift

As we explored earlier, comparing the distributions of the input features and model output helps with early monitoring and debugging ML model decay.

But how exactly do you detect a change in the distributions? How “different” is different enough? Let’s review some of the possible methods.

Summary statistics

A common way to compare two distributions is by looking at key statistics like the mean, median, variance, quantiles, etc.

For example, you can check if the current mean value of a numerical variable is within two standard deviations from the reference value. You can monitor the value of any individual statistics to react when it shifts.

An example statistical summary for a numerical feature.

Monitoring only summary statistics has downsides, especially when you watch many features simultaneously, as it can become noisy. Still, this is a viable strategy when you have expectations, e.g., about median or quantile values for particular features based on your domain knowledge.

An example statistical summary for categorical features.

You can also track feature range compliance, such as whether the values stay within a min-max range. This helps detect data quality issues, such as negative sales or sudden shifts in feature scale. However, this approach might not catch data drift when values remain within the expected range but show a different distribution pattern.

An example check that verifies min-max range compliance for a specific feature.

Statistical tests

A more advanced drift detection approach involves using statistical hypothesis testing. For example, you can use tests like Kolmogorov Smirnov for numerical features or the Chi-square test for categorical attributes. They help assess whether the difference between the two datasets is statistically significant.

If the difference is detected, it suggests that the two datasets come from distinct distributions rather than random variation due to sampling. The output (and a “drift score”) of a statistical test is a p-value that serves as a “measure of confidence.”

You might also need to consider the specifics of the data distribution to pick an appropriate test: for example, whether you expect the distribution to be normal.

Detecting drift in individual features using statistical distribution tests often makes sense when you have a small number of important and interpretable features and deal with smaller datasets and high-stakes use cases like healthcare or education.

However, it's essential to remember that a statistically significant difference may not always be practically significant. A useful rule of thumb is that if you can detect the difference with a relatively small sample, it's likely important. Otherwise, the tests may be overly sensitive.

Distance metrics

Another approach involves using distance metrics that quantitatively measure how far apart the distributions are. Distance metrics are valuable for understanding the extent of drift. A few commonly used distance metrics are Wasserstein Distance, Jensen-Shannon Divergence, or Population Stability Index (PSI), often used in credit risk modeling.

The output (the “drift score”) is usually a distance or divergence. The higher the value, the further apart the distributions are. Depending on the metric, you might work with an absolute scale or a scale from 0 to 1.

Want to read more about drift detection methods? Check out this blog.

The benefit of this approach is that it allows quantifying the change rather than testing the assumption that both samples come from the same distribution. You can treat the resulting metric as a reflection of “drift size” that you can observe over time.

When dealing with a large dataset, it is advisable to use distance metrics since statistical tests are likely to be overly sensitive in such scenarios. You can also choose to monitor “aggregate” drift by tracking the share of drifted features in the dataset as opposed to individual feature drifts.

Rule-based checks

Lastly, you can perform some drift checks using straightforward rules. For example, you might choose to be alerted whenever:

The share of the predicted “fraud” category is over 10%.
A new categorical value appears in a feature “location” or “product type.”
More than 10% of the feature “salary” values are out of the defined min-max range.

While such checks do not directly measure drift, they can serve as good alerting heuristics to notify about likely meaningful changes to investigate further.

How to handle data drift

Say you detected data drift in your input model features using your preferred method. What are the next steps?

Analysis

The first step in addressing data drift is analysis. Before taking any action, you need to understand the nature of the shift.

You can start by comparing the visual distributions of the drifted features to explain the source change. Is it a genuine shift in the data, a data quality issue, or a false positive?

Sometimes, there is a legitimate business explanation, such as a product change, and no action is required. You may also wait to collect enough data for model retraining.

On the other hand, many issues manifesting as data drift might stem from data quality bugs. In this case, you need to locate and address the problem – for example, by contacting data producers or fixing the feature transformation pipeline.

Feature data drift analysis — Example of data drift analysis for a particular feature. The complete absence of data in specific categories may indicate a data quality bug.

The frequent presence of data quality issues in production pipelines is a good reminder of why using drift detection as an automatic retraining trigger might be suboptimal. Before using data to retrain, you must verify that the data is valid and complete – and can serve as "training" material.

You might also face false positives. If you observe unnecessary data drift alerts, you might adjust the sensitivity of your drift detection methods. A good rule of thumb is to consider feature importance and, for example, alert only to the drift in top model features.

If you face a true data drift, you might need to take specific actions, such as retraining your model or updating the decision process.

Retraining

A common industry approach is to retrain models using the labeled data from the newly observed distribution. If labels are available and there is sufficient new data for retraining, this is usually a recommended course of action.

Depending on the extent of the drift, you must choose different strategies for the model update. For example, you can retrain your model using old and new data. Alternatively, you can give higher weight to more recent data or drop the old data entirely. You can also re-run the feature engineering and model design process to develop a completely new approach.

In any case, it is essential to have a robust architecture for new model roll-out and a thorough pre-release model testing procedure – to confirm that the new model is good enough.

However, sometimes retraining is not feasible – for example, because you simply do not have the new labels to run the model update. In such cases, you might consider process interventions.

Process intervention

For example, you can stop the model. If the model's predictions are adversely affected by drift, you might need to halt its operation temporarily. You can do that for some affected segments, for example, particular product groups.

You can also consider modifying the decision process on top of the model output. For example, when running a fraud detection model, you might lower the decision threshold that flags transactions as suspicious to send more of those for manual review.

Adjusting the classification decision threshold.

What are classification decision thresholds? Read more in the guide.

You can also consider adding other business rules to filter out potentially unreliable model predictions, such as overriding some extreme predictions.

Finally, you can consider switching to other decision-making processes, from alternative fallback models to the human in the loop. For example, you can bring in domain experts to replace the automated decisions with good old human decision-making.

Model redesign

Addressing data distribution shifts doesn't have to be a reactive process. You can also consider adjusting your model design to be more resilient to data shifts.

A frequent approach is feature selection. Before utilizing some of the available features in model training, you can review their historical variability and filter out the features with significant historical drifts. You can also consider applying some feature engineering techniques, such as bucketing volatile numerical features into a limited number of categories.

Sometimes, it is worthwhile to choose a model that is less performant on a historical evaluation set but is more robust to data shifts. You can also consider using domain-based or causal models at the core of your system, which can be more reliable in case of data changes.

Want to read more about handling data drift? Check out the blog.

Detecting drift with Evidently

Evidently is an open-source Python library that helps implement testing and monitoring for production machine learning models. Evidently helps run various checks for your datasets and get interactive visual reports to analyze the results, including data drift.

You can choose from 20+ pre-built data drift detection methods and automatically generate interactive visualizations that help explore the changes visually.

Evidently Data Drift Report — Example of the Data Drift report in Evidently.

You can easily:

Run a Data Drift report to compare and visually explore all features in the dataset.
Implement data drift checks as part of your prediction pipelines using the Test Suite functionality that easily integrates with any workflow orchestrators.
Evaluate data drift for unstructured data, including raw text data and embeddings.
Deploy a live monitoring dashboard to track how metrics change over time and build a continuous monitoring process.

Would you like to learn more? Check out the open-source Getting Started tutorial.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶