📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Tutorials

Batch inference and ML monitoring with Evidently and Prefect

Last updated:

April 9, 2025

Published:

October 25, 2023

contents‍

Start testing your AI systems today

Get demo

In this tutorial, you will learn how to run batch ML model inference and deploy a model monitoring dashboard for production ML models.

It is a simple solution that uses two open-source tools: Evidently and Prefect. The tutorial contains an end-to-end code blueprint with all the steps: from model training to batch model serving and production ML monitoring.

You can copy the repository and use this reference architecture to adapt for your use case.

Code example: if you prefer to head straight to the code, open this example folder on GitHub.

⚠️ Disclaimer :
This example uses the Evidently API as available in version 0.6.7 or lower. Please ensure you are using the correct version when running this example. For updated and new examples, visit our documentation.

Background

Batch inference involves making predictions on a group of observations at once. You typically schedule batch prediction jobs on a schedule, such as hourly or daily. The predictions get stored in a database and are then accessible to consumers.

Batch inference is a good option when predictions are needed at set intervals and not necessarily in real time: such as periodic demand forecasting.

Production ML monitoring. Once you deploy an ML model in production, you must track its performance. This typically means tracking the ongoing model quality, like accuracy or mean error. However, it is not always possible due to delayed feedback. For instance, when forecasting next week's sales, you must wait until the end of that period to calculate the error.

Because of this, you might also need early monitoring:

Data quality monitoring and validation help ensure that the model gets reliable and consistent data. Otherwise, missing values or corrupted rows may lead to low-quality forecasts.
Data and prediction drift checks help verify that the model operates in a familiar environment without significant changes or unexpected updates, such as the appearance of the new products that weren't in the model training.

For batch model inference, you can implement ML monitoring as scheduled batch jobs. In this tutorial, you will learn how to run such data validation, drift detection and ML model quality checks in a production pipeline.

Tutorial scope

This tutorial shows how to run batch ML inference, monitoring jobs and deploy a dashboard to track the model performance over time. By the end of this tutorial, you will know how to implement a batch ML monitoring architecture using:

Prefect to orchestrate the monitoring pipelines.
Evidently to calculate monitoring metrics and display them on a live dashboard.

You will be able to run the complete tutorial locally.

Batch ML monitoring with Evidently and Prefect

Here is a brief overview of the architecture. You will:

run ML scoring and monitoring jobs using Prefect,
generate data and model quality metrics using Evidently Reports,
store the calculated metrics as JSON snapshots in the local directory,
visualize metrics over time on the Evidently dashboard accessible as a web UI.

You can later use this reference architecture to expand and customize for your use case.

Note: you can also use the same batch monitoring architecture even if you deploy an ML model as a service. You can run a set of monitoring jobs over the prediction logs.

Prerequisites

We expect that you:

Have some experience working with batch ML models.
Went through the Evidently Get Started Tutorial and can generate visual and JSON Reports with Metrics.
Launched a demo dashboard to familiarize yourself with the Evidently Monitoring UI.

To follow this tutorial, you'll need the following tools installed on your local machine:

Docker and Docker Compose plugin
Python version 3.8 or above
Git
Prefect
Evidently

Note: we tested this example on macOS/Linux.

👩‍💻 Installation and inference

This section explains the instructions in the example README. Check the original README file for more technical details and notes.

1. Fork / Clone the repository

First, clone the Evidently GitHub repository containing the example code:

git clone git@github.com:evidentlyai/evidently.git
cd evidently/examples/integrations/prefect_evidently_ui

2. Launch Prefect

Launch the Prefect application by using Docker Compose. This app is responsible for running monitoring pipelines that calculate the monitoring reports.

docker compose up prefect -d

3. Train the model and prepare the “reference” dataset

This example is based on the NYC Taxi dataset. The data and the model training are out of the scope of this tutorial. Therefore, we prepared a few scripts to download data, preprocess, and train a machine-learning model.

# Enter container of airflow-webserver
docker exec -ti prefect /bin/bash

# Train model and prepare ‘reference’ data
python src/pipelines/load_data.py     
python src/pipelines/process_data.py 
python src/pipelines/train.py 
python src/pipelines/prepare_reference_data.py

This script generates a simple machine learning model that solves a regression problem: predicting the duration of the trip (in minutes) based on features like distance, fare price and number of passengers. We create a new forecast each hour and assume the ground truth is available with an hourly delay.

In this script, we also prepare a reference dataset: a representative dataset that shows expected feature behavior. It will serve as a baseline for data drift detection.

4. Run inference and monitoring pipelines

Execute the scheduler script to run the Prefect flows for inference and monitoring.

python src/pipelines/scheduler.py

The scheduler.py script runs the following pipelines

monitor_data.py pipeline evaluates the quality of the input data and assesses input data drift by comparing it with a reference dataset.
predict.py pipeline performs inference using the trained model on the input data.
monitor_model.py pipeline requires ground truth data to assess the quality of the model's predictions. We assume that these labels come with a delay and are available for the previous period. Therefore, this pipeline runs for the prior period.

For simplicity, the scheduler.py script uses the following hardcoded parameters to schedule other pipelines.

START_DATE_TIME = '2021-02-01 02:00:00'
END_DATE_TIME = '2021-02-01 10:00:00'
BATCH_INTERVAL = 60

By fixing the parameters, we ensure the reproducibility. When you run the tutorial, you should get the same visuals. We will further discuss how to customize this example.

5. Explore the scheduled pipelines in the Prefect UI

Access the Prefect UI by navigating to http://localhost:4200 in a web browser. The Perfect UI shows the executed pipelines and their current status.

🚀 View the monitoring dashboard

You executed the batch model inference and monitoring pipelines in the previous step. This means you have all the relevant information about the model and data quality. Now, let’s take a look at the monitoring dashboard!

1. Launch the Evidently UI

Launch the Evidently application by using Docker Compose. This app is responsible for visualizing the metrics computed from the monitoring jobs.

docker compose up evidently-ui -d

2. Explore the model performance

Open the Evidently monitoring dashboards by visiting http://localhost:8001 in a web browser. You can now access the dashboards that show the model and data quality.

There are four different dashboards designed for this example.

Data Quality
Predictions Drift
Model Performance
Target Drift

Note: in this example, each dashboard exists as a separate Project. We did it for demonstration purposes. In practice, you can log all the data related to one model to a single project and visualize it on the same dashboard.

You may open each dashboard to get an overview of the metric change over time and access the underlying Reports that sum up the model performance for a specific period.

Here is an example of a Model Quality dashboard:

Here is an example of the Target Drift dashboard that shows the changes in the behavior of the model target.

Note: this example includes several dashboard types to demonstrate the tool's capabilities. This metric selection is not fixed: you can customize the metrics and visualizations for your specific use case.

🧑‍🎨 Design the ML monitoring

Now, let’s look at the code in more detail to understand how the backend and frontend of this monitoring setup work together, and how you can customize it to your needs.

Step 1. Design the ML monitoring jobs

There are three Prefect pipelines to monitor input data quality, model predictions, and model performance.

Prefect executes three pipelines at different time intervals (T-1, T, and T+1). You make new predictions for each period, run input data checks, and monitor model performance.

The pipelines perform the following tasks:

Predict (t=0): At time T, this task processes the current data and generates predictions for the next period T+1, using the trained machine learning model.
Monitor Data (t=0): This task profiles and compares the current data to the reference. It checks for data drift or quality issues that might impact the model quality.
Monitor Model (t=1): This task evaluates the quality of the predictions made by the model at time T by comparing them to the actual values. It checks for model performance degradation or target drift that might require retraining or updating the model. Since the actuals usually arrive with a delay, you will run the model monitoring pipelines at the next time interval (T+1).

You use the Evidently Python library to calculate different metrics inside each monitoring job: by generating a JSON Report with the selected metrics. The resulting JSON that contains a quality summary for a particular period is called a snapshot.

Let’s explore this logging part in more detail!

Step 2. Compute and log the metrics

To illustrate what happens inside, let’s look at the data quality monitoring pipeline that tracks the quality and stability of the input data.

The monitor_data Perfect flow orchestrates the data monitoring process. It takes a timestamp ts and an interval (in minutes) as input arguments. On every flow run, it calculates the JSON snapshots with the data quality metrics using Evidently and logs them to a specific directory.

Here is how the logging backend works:

To be able to later display the metrics in the Evidently UI, you need to log related JSON snapshots to the correct directory. It will serve as a data source for the monitoring service.

You can create a Project inside an Evidently Workspace to easily group related snapshots. On every monitoring run, you generate a new Evidently Report and associate it with the same Project ID inside the Workspace. This way, you will automatically save the Report to the directory corresponding to this Project in the JSON snapshot format. Later, you can pull any metrics stored in the snapshots and visualize them over time.

Each Project has its monitoring dashboard in the Evidently UI. Typically, you can create one Project per ML model. In this scenario, you'd save all sorts of metrics (data quality, drift, model quality, etc.) together and visualize them on different panels of the same monitoring dashboard. However, it is entirely up to you. For example, in this tutorial, we decided to create separate Projects for each type of monitoring and get different dashboards.

Want to understand the Projects and Workspaces better? Check out the dedicated Evidently documentation section on Monitoring.

To sum up, to create a snapshot that will serve as a data source for the monitoring dashboard, you need to:

Create (on the first run) or “connect” to an existing Workspace (on the following runs)
Generate an Evidently Report
Add report to a specific Project

The code snippet below from the src/pipelines/monitor_data.py shows how to generate a data quality report using Evidently and log it to the “Data Quality” Project.

@flow(flow_run_name="monitor-data-on-{ts}")
def monitor_data(
    ts: pendulum.DateTime,
    interval: int = 60) -> None:
    """Build and save data validation reports."""

    ...

		# Get or Create Evidently Dashboard Workspace
		ws = Workspace.create("evidently")
		
		# Data Quality (drift)
		data_quality_report = generate_data_quality_report(
		    current_data=current_data,
		    reference_data=reference_data,
		    num_features=num_features,
		    cat_features=cat_features,
		    prediction_col=prediction_col,
				target_col=target_col,
		    timestamp=ts.timestamp()
		)
		
		# Add reports (snapshots) to the Project Monitoring Dashboard
		project_dq = get_evidently_project(ws, "Data Quality")
		ws.add_report(project_dq.id, data_quality_report)

A workspace may have multiple Projects, each with its own monitoring dashboard. Every Project directory contains snapshots and metadata.

Evidently directory for Evidently dashboards

Now, let’s explore the generate_data_quality_report() function.

@task
def generate_data_quality_report(
    current_data: pd.DataFrame,
    reference_data: pd.DataFrame,
    num_features: List[Text],
    prediction_col: Text,
    timestamp: float
) -> None:
   
    # Prepare column mapping
		column_mapping = ColumnMapping()
    column_mapping.numerical_features = num_features
    column_mapping.prediction = prediction_col

	  # Build Data Quality report
		data_quality_report = Report(
        metrics=[
            DatasetDriftMetric(),
            DatasetMissingValuesMetric(),
        ],
        timestamp=pendulum.from_timestamp(timestamp)
    )
    data_quality_report.run(
        reference_data=reference_data,
        current_data=current_data,
        column_mapping=column_mapping
    )
    
    return  data_drift_report

This task computes two metrics to evaluate the missing values and share of drifted columns. It takes the current data, reference data, numerical features, categorical features, and the prediction column as input arguments and computes the Report.

Here is a visual representation of a single Report logged as a snapshot during the data monitoring flow:

Want a different set of Metrics? In this example, we picked two metrics. However, you can select other metrics or presets. For example, pick a DataDriftPreset to log individual feature drift scores. You can also log Test Suites instead of Reports and capture the pass or fail results for any checks executed in the monitoring pipeline. You can refresh your knowledge on Reports and Test Suites with the Get Started tutorial for Reports and Test Suites.

After generating the Report for each batch of data, you save it as a JSON snapshot to the Data Quality Project, as shown above.

# Add reports (snapshots) to the Project Monitoring Dashboard
project_dq = get_evidently_project(ws, "Data Quality")
ws.add_report(project_dq.id, data_quality_report)

This way, all the snapshots generated on every run are collected together. You can view the individual Reports for each batch in the UI.

The Model Monitoring pipeline follows the same logic but computes a different set of Metrics related to model quality.

Step 3. Design the Dashboard

After you log the snapshots, you must choose which ones to display on the Dashboard. Each might contain multiple monitoring panels: you can select which metrics to display and how.

The code snippet below from src/utils/evidently_monitoring.py demonstrates adding counters and line plots to a dashboard. To add each monitoring panel, you must choose the panel type and specify the metric values to pull from the saved snapshots:

def add_data_quality_dashboard(project: Project) -> Project:
    
    # Title (Text) Panel
    project.dashboard.add_panel(
        DashboardPanelCounter(
            filter=ReportFilter(metadata_values={}, tag_values=[]),
            agg=CounterAgg.NONE,
            title='Data Drift',
        )
    )
    
    # Counter Panel: Share of Drifted Features
    project.dashboard.add_panel(
        DashboardPanelCounter(
            title="Share of Drifted Features",
            filter=ReportFilter(metadata_values={}, tag_values=[]),
            value=PanelValue(
                metric_id="DatasetDriftMetric",
                field_path="share_of_drifted_columns",
                legend="share",
            ),
            text="share",
            agg=CounterAgg.LAST,
            size=1,
        )
    )

    # Counter Panel: Number of Columns
    project.dashboard.add_panel(
        DashboardPanelCounter(
            title="Number of Columns",
            filter=ReportFilter(metadata_values={}, tag_values=[]),
            value=PanelValue(
                metric_id="DatasetDriftMetric",
                field_path="number_of_columns",
                legend="share",
            ),
            text="share",
            agg=CounterAgg.LAST,
            size=1,
        )
    )

    # Line Plot Panel: Dataset Quality
		project.dashboard.add_panel(
		        DashboardPanelPlot(
		            title="Dataset Quality",
		            filter=ReportFilter(metadata_values={}, tag_values=[]),
		            values=[
		                PanelValue(
		                    metric_id="DatasetDriftMetric", 
		                    field_path="share_of_drifted_columns", 
		                    legend="Drift Share"),
		                PanelValue(
		                    metric_id="DatasetMissingValuesMetric",
		                    field_path=DatasetMissingValuesMetric.fields.current.share_of_missing_values,
		                    legend="Missing Values Share",
		                ),
		            ],
		            plot_type=PlotType.LINE,
		        )
		    )
    
    return project

Step 4. Update the Dashboard

Every time you add or update a dashboard, you must update the Project in Evidently workspace.

The code snippet below from the src/utils/evidently_monitoring.py demonstrates how to update the Data Quality Project:

def build_dashboards(project: Project):

    ws = Workspace.create("evidently")
    
    # Data Quality Dashboard
    project_dq = get_evidently_project(ws, "Data Quality")
    project_dq.dashboard.panels = []
    project_dq = add_data_quality_dashboard(project_dq)
    project_dq.save()

Customize it to your project

This example showed an end-to-end ML monitoring process implemented as a set of batch jobs. You logged the resulting metrics as JSON snapshots and created an ML monitoring dashboard to show values over time.

You can take this example as an inspiration and adapt it for your ML model by following these general guidelines:

Design the ML monitoring jobs. Define the frequency and type of model monitoring jobs. They will depend on the model serving cadence, ground truth delay, and your overall needs. For example, you might have two different steps at the model serving: one to validate the input data (before you generate the predictions) and another to evaluate the prediction drift after you run the scoring pipeline.
Choose the metrics to log. You can select from multiple Evidently Presets and individual Metrics. It often makes sense to log more supporting information even if you do not want to visualize each metric on the monitoring panel. It will be helpful during the model debugging process. You’ll be able to access each logged Evidently Reports and corresponding visualizations.
Design the monitoring dashboards. You may group metrics and plots in multiple dashboards or have a single dashboard to view all of them in one place.
Store and version snapshots. You may store them on your local machine or cloud storage. To reliably keep them for the long term, you may use separate tools to manage and version snapshots.

By following these guidelines, you can adapt this example to suit your specific project needs. This will enable you to build a robust, scalable, and maintainable monitoring pipeline that ensures optimal model performance and reliability.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶

Summing up

This tutorial demonstrated running batch ML monitoring jobs and designing an ML monitoring dashboard.

You built a solution that consists of three consecutive Prefect pipelines for data quality checks, model predictions, and model quality monitoring.
You learned how to use Evidently to calculate the monitoring Reports and store the monitoring metrics as JSON snapshots.
You learned how to visualize them in the Evidently UI and design different monitoring panels.

You can further work with this example: