In this code tutorial, you will learn how to run batch ML model inference, collect data and ML model quality monitoring metrics, and visualize them on a live dashboard.
This is a blueprint for an end-to-end batch ML monitoring workflow using open-source tools. You can copy the repository and use this reference architecture to adapt for your use case.
Code example: if you prefer to head straight to the code, open this example folder on GitHub.
When an ML model is in production, you need to keep tabs on the ML-related quality metrics in addition to traditional software service monitoring. This typically includes:
- Data quality and integrity metrics, such as share of nulls, schema compliance, etc.
- Distribution of the model outputs (prediction drift) and model inputs (data drift) as a proxy for model quality.
- ML model quality metrics (accuracy, mean error, etc.). You often compute them with a delay after receiving the ground truth labels or actuals.
[fs-toc-omit]ML monitoring architecture
In this tutorial, we introduce a possible implementation architecture for ML monitoring as a set of monitoring jobs.
You can adapt this approach to your batch ML pipelines. You can also use this approach to monitor online ML services: when you do not need to compute metrics every second, but instead can read freshly logged data from the database, say, every minute or once per hour.
In this tutorial, you will learn how to build a batch ML monitoring workflow using Evidently, Prefect, PostgreSQL, and Grafana.
The tutorial includes all the necessary steps to imitate the batch model inference and subsequent data joins for model quality evaluation.
By the end of this tutorial, you will learn how to implement an ML monitoring architecture using:
- Evidently to perform data quality, data drift and model quality checks.
- Prefect to orchestrate the checks and write metrics to the database.
- PostgreSQL database to store the defined monitoring metrics.
- Grafana as a dashboard to visualize the metrics in time.
You will run your monitoring solution in a Docker environment for easy deployment and scalability.
We expect that you:
- Have some experience working with batch ML models
- Went through the Evidently Get Started Tutorial and can generate visual and JSON reports with metrics.
You also need the following tools installed on your local machine:
- Docker and Docker Compose plugin
- Python version 3.9 or above
First, install the pre-built example. Check the README file for more detailed instructions.
[fs-toc-omit]1. Fork or clone the repository
Clone the Evidently GitHub repository with the example code. This repository provides the necessary files and scripts to set up the integration between Evidently, Prefect, PostgreSQL, and Grafana.
[fs-toc-omit]2. Create a virtual environment
Create a Python virtual environment to isolate the dependencies for this project. Then, install the required Python libraries from the requirements.txt file:
[fs-toc-omit]3. Launch the monitoring cluster
Set up the monitoring infrastructure using Docker Compose. It will launch a cluster with the required services such as PostgreSQL and Grafana. This cluster is responsible for storing the monitoring metrics and visualizing them.
[fs-toc-omit]4. Create tables for monitoring database
To store the ML monitoring metrics in the PostgreSQL database, you must create the necessary tables. Run a Python script below to set up the database structure to store metrics generated by Evidently.
[fs-toc-omit]5. Download the data and train model
This example is based on the NYC Taxi dataset. The data and the model training are out of the scope of this tutorial. We prepared a few scripts to download data, pre-process it and train a simple machine learning model.
[fs-toc-omit]6. Prepare “reference” data for monitoring
To generate monitoring reports with Evidently, we usually need two datasets:
- The reference dataset. This can be data from model validation, earlier production use, or a manually curated “golden set.” It serves as a baseline for comparison.
- The current dataset. It includes the recent production data you compare to reference.
In this example, we take data from January 2021 as a reference. We use this data as a baseline and generate monitoring reports for consecutive periods.
Do you always need the reference dataset? It depends. A reference dataset is required to compute data distribution drift. You can also choose to work with a reference dataset to quickly generate expectations about the data, such as data schema, feature min-max ranges, baseline model quality, etc. This is useful if you want to run Test Suites with auto-generated parameters. However, you can calculate most metrics (e.g., the share of nulls, feature min/max/mean, model quality, etc.) without the reference dataset.
Run ML monitoring example
After completing the installation, you have a working Evidently integration with Prefect, PostgreSQL, and Grafana.
Follow the steps below to launch the example. You will run a set of inference and ML monitoring pipelines to generate the metrics, store them in the database and explore them on a Grafana dashboard.
[fs-toc-omit]1. Run inference and monitoring pipelines
Set the Prefect API URL environment variable to enable communication between the Prefect server and the Python scripts. Then, execute the scheduler script to automatically run the Prefect flows for inference and monitoring:
The scheduler.py script runs the following pipelines:
- predict.py performs inference by applying the trained model to the input data.
- monitor_data.py monitors the input data quality by comparing it to the reference data.
- monitor_model.py evaluates the quality of the model predictions. This pipeline requires ground truth data. We assume the labels arrive with a delay and are available for the previous period. Therefore, this pipeline runs for the prior period.
For simplicity, the scheduler.py script uses the following hardcoded parameters to schedule the pipelines.
By fixing the parameters, we ensure the reproducibility. When you run the tutorial, you should get the same visuals. We will further discuss how to customize this example.
Instead of running the scheduler script, you can execute each pipeline individually. This way, you will have more control over the specific execution time and interval of each pipeline.
[fs-toc-omit]2. Check the pipelines in the Prefect UI
Access the Prefect UI by navigating to http://localhost:4200 in a web browser. The Perfect UI shows the executed pipelines and their current status.
[fs-toc-omit]3. Explore the monitoring metrics in the Grafana
Open the Grafana monitoring dashboards by visiting http://localhost:3000 in a web browser. The example contains pre-built Grafana dashboards showing data quality, target drift, and model performance metrics.
You can navigate and see the metrics as they appear on the Grafana dashboard.
Design prediction and monitoring pipelines
Now, let’s explore each component of the ML model monitoring architecture.
In this section, we will explain the design of the three Prefect pipelines to monitor input data quality, model predictions, and model performance. You will understand how they work and how to modify them.
First, let’s visualize the pipeline order and dependencies.
You execute three pipelines at different time intervals (T-1, T, and T+1). For each period, you make new predictions, run input data checks, and monitor model performance.
The pipelines perform the following tasks:
- Predict (t=0): At time T, this task processes the current data and generates predictions for the next period T+1, using the trained machine learning model.
- Monitor Data (t=0): This task profiles the current data and compares it to the reference. It checks for data drift or quality issues that might impact the model quality.
- Monitor Model (t=0): This task evaluates the quality of the predictions made by the model at time T by comparing them to the actuals. It checks for model performance degradation or target drift that might require retraining or updating the model. Since the actuals usually arrive with a delay, you will run the model monitoring pipelines at the next time interval (T+1).
[fs-toc-omit]Prefect basics: tasks and flows
In Prefect, tasks are the fundamental building blocks of workflows. Tasks represent individual operations, such as reading data, preprocessing data, training a model, or evaluating a model.
Let’s consider a simple example below:
To define a task in Prefect, one can use the @task decorator. This decorator turns any Python function into a Prefect task. This example demonstrates a simple flow containing two tasks:
- The say_hello task accepts a name argument and creates a greeting message.
- The do_good_open_source task takes a greeting message and constructs a list of extended messages.
Flows are the backbone of Prefect workflows. They represent the relationships between tasks and define the order of execution. To create a flow, by using the @flow decorator. The help_world first calls the say_hello. The output of this task (the greeting message) is then passed as an argument to the do_good_open_source task. The resulting list of messages from do_good_open_source is printed using a list comprehension.
Running this Python module outputs looks like:
Prefect can automatically log the details of the running flow and visualize them in the Prefect UI.
1. Prediction pipeline
This pipeline makes predictions using a pre-trained ML model. The predict function is a Prefect flow that generates predictions for a new batch of data within a specified interval.
The predict flow orchestrates the entire prediction process. It takes a timestamp and an interval (in minutes) as input arguments. The flow consists of the following steps:
- Compute the batch start and end time using the given timestamp and interval.
- Prepare data by loading it from the specified Parquet file using the load_data task.
- Load the trained machine learning model from a Joblib file.
- Generate predictions for the batch data using the get_predictions task.
- Save the generated predictions to a Parquet file using the save_predictions task.
By defining the predict as a Prefect flow, you create a reusable and modular pipeline for generating predictions on new data batches.
💡 Note: The predict flow is decorated with the @flow decorator, which includes the flow_run_name parameter that gives a unique name for each flow run based on the timestamp (ts).
2. Data monitoring pipeline
The data quality monitoring pipeline tracks the quality and stability of the input data. We'll use Evidently to perform data profiling and generate a data quality report.
💡 For ease of demonstration, we check for both input data and the prediction distribution drift as a part of the data quality pipeline. (Both included in the Data Drift Preset). You may want to split these tasks into separate pipelines in your projects.
The code snippet below from the src/pipelines/monitor_data.py shows how to create a Prefect flow to monitor data quality and data drift in a machine learning pipeline.
The monitor_data flow orchestrates the data monitoring process. It takes a timestamp ts and an interval (in minutes) as input arguments. The flow consists of the following steps:
- Prepare the current data by merging it with the corresponding predictions using the prepare_current_data task.
- Load the reference data from the Parquet file and select the relevant columns.
- Generate data quality and data drift reports using the generate_reports task.
Let’s dive deeper into the generate_reports task!
This task generates a set of Evidently metrics related to the data quality and data drift. It takes the current data, reference data, numerical features, categorical features, and the prediction column as input arguments and computes two reports.
The Data Quality report includes the DatasetSummaryMetric. It profiles the input dataset by computing metrics like the number of observations, missing values, empty and almost empty values, etc.
🚦 Conditional data validation. In this example, we compute and log the model quality metrics. As an alternative, you can directly check if the input data complies with certain conditions (for example, if there are features out of range, schema violations, etc.) and log the pass/fail test result in addition to metric values. In this case, use Evidently Test Suites.
The Data Drift report includes the DatasetDriftPreset. It compares the distributions of the features and predictions between the current and reference dataset. We do not pass any custom parameters, so it uses the default Evidently drift detection algorithm.
In this case, we do not generate the visual reports using Evidently, but instead, get the metrics as a Python dictionary using .as_dict() Evidently output format. This output includes the metric values, relevant metadata (such as applied drift detection method and threshold), and even optional visualization information, such as histogram bins.
This task then commits the computed metrics to the database for future analysis and visualization.
💡 Customizing the Metrics. In this example, we use only a couple of metrics available in Evidently. You can browse other metrics in these sample notebooks or explore the list of 100+ Metrics and Tests to choose those suitable for your use case.
3. Model monitoring pipeline
The model performance monitoring pipeline tracks the model quality over time. It uses Evidently to generate model quality metrics and compare the distribution of the model target against the reference period (evaluate target drift).
The code snippet below from the src/pipelines/monitor_model.py demonstrates how to create a Prefect flow for model monitoring:
The monitor_model Prefect flow generates the relevant metrics and commits them to the database. It consists of the following steps:
- Prepare the current data by merging it with the corresponding predictions using the prepare_current_data task.
- Load the reference data.
- Generate the model performance, data quality, and data drift reports using the generate_reports task.
The generate_reports task generates model performance and target drift reports. It utilizes the following Evidently metrics:
- RegressionQualityMetric to create a model performance report. It computes metrics like Mean Error, Mean Absolute Error, etc.
- ColumnDriftMetric to create a target drift report. We compute this metric for the Target column to check for distribution drift.
Store and visualize ML metrics
Now, let’s look at what happens with the computed metrics.
[fs-toc-omit]1. Store the metrics in PostgreSQL
In this example, we use SQLAlchemy, a popular Python SQL toolkit and Object-Relational Mapper (ORM), to interact with the PostgreSQL database. With SQLAlchemy, you can define the table schema using Python classes and easily manage the database tables using Python code.
We prepared a Python script named create_db.py, that creates the database tables required for storing monitoring metrics.
The table models are defined in the src/utils/models.py module.
For example, the TargetDriftTable class represents a table schema for storing target drift metrics in the PostgreSQL database.
This DB table contains the following columns for monitoring purposes:
- timestamp stores the timestamp when the target drift metrics were computed.
- stattest_name stores the name of the drift detection method.
- stattest_threshold stores the threshold value for drift detection method
- drift_score stores the computed drift score.
- drift_detected stores a True or False value, indicating whether drift was detected based on the drift score and the threshold value.
It works similarly for other tables in the database.
📊 How do the data drift checks work? You can explore "Which test is the best" research blog to understand the behavior of different drift detection methods. To understand the parameters of the Evidently drift checks, check the documentation.
💡 Why not Prometheus? A popular combination is to use Grafana together with Prometheus. In this case, we opt for a SQL database. The reason is that Prometheus is well-suited to store time series metrics in near real-time. However, it is not convenient to write delayed data. In our case, we compute metrics on a schedule (which can be as infrequent as once per day or week) and compute model quality metrics with a delay. Using Prometheus also adds an additional service to run (and monitor!)
[fs-toc-omit]2. Visualize metrics in Grafana
We prepared Grafana dashboard configurations to visualize the collected metrics and data source configurations to connect it to the PostgreSQL database. You can find them in the grafana/ directory.
After you launch the monitoring cluster, Grafana will connect to the monitoring database and create dashboards based on the templates.
As an example, let's explore the Evidently Numerical Target Drift dashboard, which provides insights into the distribution drift of the model target.
The top widgets show the name of the drift detection method (in this case, Wasserstein distance) and the threshold values.
The middle and bottom widgets display the history of drift checks for each period. This helps identify specific time points when drift occurred and the corresponding drift scores. You can understand the severity of drift and decide whether you want to intervene.
You can easily customize the dashboard widgets and scripts used to build them.
Alerts. You can also use Grafana to define alert conditions to inform when metrics are outside the expected range.
Customize for your data
To adapt this example for your machine learning projects, follow these guidelines:
Data inputs. Modify the load_data task to load your dataset from the relevant data source (e.g., CSV, parquet file, or database).
Model inference. Replace the existing model with your trained machine learning model. You may need to adjust the get_predictions task to ensure compatibility with your chosen algorithm and data format.
Monitoring metrics. Customize the monitoring tasks to include metrics relevant to your project. Consider including data quality, data drift, target drift, and model performance checks. Update the generate_reports task with the appropriate Evidently metrics or tests.
💡 Evidently Metrics and Tests. You can browse other metrics in these sample notebooks or explore the list of 100+ Metrics and Tests to choose those suitable for your use case.
Database setup. Modify the database configuration and the table models in src/utils/models.py to store the monitoring metrics relevant to your project.
Reference dataset. Define a representative reference dataset and period suitable for your use case. This should be a stable data snapshot that captures the typical distribution and characteristics of the features and target variable. The period should be long enough to reflect the variations in the data. Consider the specific scenario and seasonality: sometimes, you might use a moving reference, for example, by comparing each week to the previous.
Store the reference dataset or generate it on the fly. Consider the size of the data and resources available for processing. Storing the reference dataset may be preferable for larger or more complex datasets, while generating it on the fly could be more suitable for smaller or highly dynamic datasets.
Grafana dashboards. Customize the Grafana dashboards to visualize the specific monitoring metrics relevant to your project. This may involve updating the SQL queries and creating additional visualizations to display the results of your custom metrics.
Architecture pros and cons
Applicable for both batch and real-time. This monitoring architecture can be used for both batch and real-time ML systems. With batch inference, you can directly follow this example, adding data validation or monitoring steps to your existing pipelines. For real-time systems, you can log model inputs and predictions to the database and then read the prediction logs from the database at a defined interval.
Async metrics computation. Metrics computation is separate from model serving. This way, it does not affect the model serving latency in cases when this is relevant.
Adaptable. You can replace specific components for those you already use. For example, you can use the same database you use to store model predictions or use a different workflow orchestrator, such as Airflow. (Here is an example integration of how to use Evidently with Airflow). You can also replace Grafana with a different BI dashboard and even make use of the additional visualization information available in the Evidently JSON/Python dictionary output to recreate some of the Evidently visualizations faster.
Might be too “heavy.” We recommend using this or similar architecture when you already use one or two of the mentioned tools as part of your workflow. For example, you already use Grafana for software monitoring or Prefect to orchestrate dataflows.
However, it might be suboptimal to introduce several complex new services to monitor a few ML models, especially if this is infrequent batch inference.
You can use Evidently to compute HTML reports and store them in any object storage. In this case, you will implement a complete “offline” monitoring setup as a set of monitoring jobs. You will also make use of the rich pre-built Evidently visuals for debugging.
Here is an example tutorial of using Evidently with Streamlit to host HTML reports.
This tutorial demonstrated how to integrate Evidently into production pipelines using Prefect, PostgreSQL, and Grafana.
- You built a solution that consists of three consecutive pipelines for data quality checks, model predictions, and model quality monitoring.
- You learned how to store the monitoring metrics in PostgreSQL and visualize them in Grafana.
You can further work with this example:
- Adapt it for your data, both for batch and real-time ML systems.
- Customize the specific monitoring metrics using the Evidently library.
- Use this architecture as an example and replace individual components.