📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Tutorials

How to detect, evaluate and visualize historical drifts in the data

Last updated:

July 16, 2025

Published:

July 29, 2021

contents‍

Start testing your AI systems today

Get demo

TL;DR: You can look at historical drift in data to understand how your data changes and choose the monitoring thresholds. Here is an example with Evidently, Plotly, Mlflow, and some Python code.

We often talk about detecting drift on live data for production ML systems.

The goal is then to check if the current distributions deviate from training or some past period. When drift is detected, you know that your machine learning model operates in a relatively unknown space. It's time to do something about it.

‍But there are a lot of nuances here. How big of drift is DRIFT? Should I care if only 10% of my features drifted? Should I look at drift week-by-week or month-by-month?

The devil is in the details. The answer will heavily depend on the model, use case, how easy (or possible) it is to retrain, and how much the performance drop costs you.

Here is a process that might help you think through it.

Let's look at historical data drift!

Code example: if you prefer to head straight to the code, open this example Jupyter notebook.

⚠️ Disclaimer:
This example uses the Evidently API as available in version 0.6.7 or lower. Please ensure you are using the correct version when running this example. For updated and new examples, visit our documentation.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶

Why look at the past data drift?

Our goal is to learn drift dynamics. Basically, we want to answer: how much did our data change in the past?

This is useful for two reasons:

First, we can understand the model decay profile.

We shared a similar idea on how you could check for model retraining needs in advance. There, we looked at how model performance changes over time.

‍Now, we can look at how the data changes instead. You can prefer one to another. Understanding data drift is especially helpful if we know we'll have to wait for the ground truth labels in production.

Assuming that the data changes at some constant pace, we can use this analysis to set our expectations. And prepare the proper infrastructure.

Is the use case highly dynamic? Should we get ready for frequent retraining and build automatic pipelines?

‍Second, this analysis can help us define the model monitoring triggers.

We might want to check for data drift once the model is in production. How sensitive should our triggers be? What needs to happen for us to react to drift?

To choose the thresholds and conditions, we need to understand how our data changed in the past. It is a delicate balance! We don't want too many false alarms, but we also want to react to meaningful changes.

We can run several drift checks on the past data (modeling different drift thresholds and monitoring windows) and explore the results.

Here is an example of how this can be done.

Defining drift detection logic

Let's take a bike-sharing dataset from Kaggle. We'll use it to explore the past drift in data. You can follow it in this example Jupyter notebook.

In a real-life setting, you can work similarly with your training data.

To decide on our drift detection logic, we should make a few assumptions.

First, we should define the comparison window.‍

For example, we can look at data month-by-month. This choice depends on your data understanding. How is the data generated? What is the real-world process behind it? How fast do you expect it to change?

Here are a few tips:

If your data has known seasonality, you can account for it. For example, perform a week-by-week comparison and not the day-by-day one to avoid overreacting to the weekend patterns. Or compare December data for different years between each other.
If you have an idea of how quickly the model degrades, use it as starting point. We shared this approach before. If you know that your model degrades in a month, you can look at the weekly drift to see how far in advance you can predict this drop and how it looks.
You can build several drift "views" and set different expectations. For example, look at both weekly and monthly drift and set different thresholds. The use case is what defines it: you can have multiple seasonalities or other known patterns.

You can always test a few assumptions and see how this changes the outcome.

Our second assumption is the drift detection threshold.

For example, we can evaluate drift for individual features using statistical tests and judge the results based on P-values if we use tests like K-S test. (There are other drift detection methods we can employ, like using distance metrics).

But to act on it or implement monitoring triggers, we often need some "yes" or "no" decision or an aggregate number for the whole dataset. To do this, we can implement some custom logic on top of the results of the statistical tests.

For example, we can look at the share of drifting features and only raise alarms if over 50% of those have a statistically significant change.

We can also manipulate the confidence level of statistical tests. The usual default is 0.95, but you might have your reasons to set it at 0.99 or 0.9, for example.

Here are a few tips:

You can focus on the key features only. You can assign different weights to features based on the importance or simply run the drift detection only for the top features. Not all drift is created equal. Often a variation in secondary features does not influence the key performance metrics.
If you have many weak features, you can look at the overall drift. But in this case, you can increase the threshold to define that drift has occurred (both in terms of the higher confidence level and the number of features).

In our example, we have several months of data.

Let's define our assumptions:

We use the first month as training data.
We perform drift comparisons on a month-by-month basis.
We will test both drifts for individual features and for the dataset.
In case of dataset drift, we will set the confidence level for our statistical tests at 0.95. We will treat all features as equally important and consider that our dataset has drifted if more than 50% of the features drift.

How it looks in code

To see all the details, head to our example Jupyter notebook.

To implement this approach, we will use the following libraries:

JSON, pandas, and NumPy as standard libraries needed to work with data.
Plotly, to visualize our data drift.
Evidently, to calculate the drift using statistical tests.
Mlflow, to log and record the results.

Once we import the libraries, we load the data. This is how it looks:

We define column mapping to specify the feature type. That is needed to perform the correct statistical tests using Evidently.

data_columns = {}
data_columns['numerical_features'] = ['weather', 'temp', 'atemp', 'humidity', 'windspeed']

We also define our reference data (the first month) and the following periods to evaluate for drift. We treat each month of data as an experiment.

reference_dates = ('2011-01-01 00:00:00','2011-01-28 23:00:00')

experiment_batches = [
    ('2011-02-01 00:00:00','2011-02-28 23:00:00'),
    ('2011-03-01 00:00:00','2011-03-31 23:00:00'),
    ('2011-04-01 00:00:00','2011-04-30 23:00:00'),
    ('2011-05-01 00:00:00','2011-05-31 23:00:00'),  
    ('2011-06-01 00:00:00','2011-06-30 23:00:00'), 
    ('2011-07-01 00:00:00','2011-07-31 23:00:00'), 
]

Next, we implemented two custom functions. They introduce logic on top of the results of statistical tests provided by Evidently profiles.

The first one helps detect dataset drift. It will return a singular true TRUE or FALSE response on the overall drift. We can set the confidence level for statistical tests and choose the threshold of drifting features. We can also get a share of drifting features by setting get_ratio as TRUE.

The second one helps detect feature drift. Evidently already returns P-values for the individual features in the JSON Profile output. We add this function to get a binary response for each feature: 1 for drift, 0 if not. We can set the confidence level for statistical tests. And, we can still get the P-value for each feature by setting get_pvalues as TRUE.

You can build your custom logic following this example.

Visualizing feature drift

Let's start with the feature drift.

We call our function to evaluate the drift for individual features.

features_historical_drift = []

for date in experiment_batches:
    drifts = detect_features_drift(raw_data.loc[reference_dates[0]:reference_dates[1]], 
                           raw_data.loc[date[0]:date[1]], 
                           column_mapping=data_columns, 
                           confidence=0.95,
                           threshold=0.9)
    
    features_historical_drift.append([x[1] for x in drifts])
    
features_historical_drift_frame = pd.DataFrame(features_historical_drift, columns = data_columns['numerical_features'])

We then use Plotly to visualize the results on a heatmap. We color the periods where we detect drift in red.

This is what we get:

Ouch! Not at all stable.

The truth is, we took a use case with high seasonality. Our data is literally about the weather. Temperature, humidity, wind speed patterns change a lot month by month.

This gives us a very clear signal that we need to factor in the most recent data and update the model often.

If we want to look at it in a more granular fashion, we can plot our P-values. We set the get_pvalues as TRUE and then generate a new plot.

In case of more subtle changes, it can be helpful to see the P-values as gradients and not just the boolean TRUE/FALSE for drift.

Visualizing dataset drift

Let's now call a function to calculate the dataset drift.

Knowing how volatile the data is, we set our threshold high. We only call it DRIFT if 90% of features have a statistical change in distributions.

dataset_historical_drift = []

for date in experiment_batches:
    dataset_historical_drift.append(detect_dataset_drift(raw_data.loc[reference_dates[0]:reference_dates[1]], 
                           raw_data.loc[date[0]:date[1]], 
                           column_mapping=data_columns, 
                           confidence=0.95))

That is how the results would look on a month-by-month basis.

You can also set a more granular view and make a plot using the share of drifting features within each month.

Depending on the use case, you can choose a different way to display this data, such as a bar plot.

Logging the drifts

Visualizations are helpful when our goal is to explore or share information.

But we might also want to simply log the numeric results of the feature and dataset drift tests elsewhere. For example, we want to log a drift value as a result of an experiment.

Or, we want to track it for a model in production.

As soon as we define our drift conditions, we can monitor if they are met. We want to get a boolean response whether or not drift has occurred in production, or trigger alerts based on a certain threshold.

To log the drift results, we can use Mlflow tracking. It is a popular library for managing the ML lifecycle. In this case, we use Evidently and our custom function to generate the output (the dataset drift metric) and then log it with Mlflow.

You can follow it in our example in the Jupyter notebook. It is an easy one!

Here is how the results would look in the Mlflow interface. We have the dataset drift metric logged for each run.

Or here is an expanded view:

Adapting to your use case

You can adapt this example to your use case.

For instance, you can modify the drift detection logic to better account for the use case at hand. You can test several combinations and choose the appropriate windows and drift conditions.

You can also use other Evidently Reports. For example, you can generate a performance report to log model quality metrics in addition to the data drift. You can then follow similar steps to visualize historical model performance using Plotly to analyze the results of your model decay tests.

Or course, this approach also works for production. Especially when you have batch model runs. For example, you run drift checks on your data and predictions on a regular basis and log it with Mlflow or simply write it to a database.

Have you been looking at your data drift? Let us know!