Give us a star ⭐️ on GitHub to support the project!
🚀 Join us January 25 for the Evidently monthly demo and Q&A. Register now →
Want to read this as a PDF instead?
By signing up you agree to receive emails from us. Opt out any time.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Want to read this as a PDF instead?
By signing up you agree to receive emails from us. Opt out any time.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Want to read this as a PDF instead?
By signing up you agree to receive emails from us. Opt out any time.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
📚 The Big Book of ML Monitoring. Get your PDF copy
May 4, 2023
Last Updated:
November 1, 2023

Batch ML monitoring blueprint: Evidently, Prefect, PostgreSQL, and Grafana

Tutorials
OPEN-SOURCE ML MONITORING
Evaluate, test and monitor your ML models with Evidently.
START ON GITHUB
ML IN PRODUCTION NEWSLETTER
Best MLOps content, blogs, and events.
Thank you! Please check your email to confirm subscription!
Oops! Something went wrong while submitting the form.

In this code tutorial, you will learn how to run batch ML model inference, collect data and ML model quality monitoring metrics, and visualize them on a live dashboard.  

This is a blueprint for an end-to-end batch ML monitoring workflow using open-source tools. You can copy the repository and use this reference architecture to adapt for your use case.

Code example: if you prefer to head straight to the code, open this example folder on GitHub.

Background

[fs-toc-omit]ML monitoring

ML monitoring metrics

When an ML model is in production, you need to keep tabs on the ML-related quality metrics in addition to traditional software service monitoring. This typically includes:

  • Data quality and integrity metrics, such as share of nulls, schema compliance, etc. 
  • Distribution of the model outputs (prediction drift) and model inputs (data drift) as a proxy for model quality.
  • ML model quality metrics (accuracy, mean error, etc.). You often compute them with a delay after receiving the ground truth labels or actuals.
[fs-toc-omit]Want to learn more about ML monitoring?
Sign up for our Open-source ML observability course. Designed for data scientists and ML engineers. Yes, it's free!

Save my seat ⟶

[fs-toc-omit]ML monitoring architecture

In this tutorial, we introduce a possible implementation architecture for ML monitoring as a set of monitoring jobs.

ML monitoring architecture

You can adapt this approach to your batch ML pipelines. You can also use this approach to monitor online ML services: when you do not need to compute metrics every second, but instead can read freshly logged data from the database, say, every minute or once per hour.  

Tutorial scope

In this tutorial, you will learn how to build a batch ML monitoring workflow using Evidently, Prefect, PostgreSQL, and Grafana. 

The tutorial includes all the necessary steps to imitate the batch model inference and subsequent data joins for model quality evaluation.   

By the end of this tutorial, you will learn how to implement an ML monitoring architecture using:

  • Evidently to perform data quality, data drift and model quality checks. 
  • Prefect to orchestrate the checks and write metrics to the database.
  • PostgreSQL database to store the defined monitoring metrics. 
  • Grafana as a dashboard to visualize the metrics in time.

You will run your monitoring solution in a Docker environment for easy deployment and scalability.

[fs-toc-omit]Pre-requisites

We expect that you:

  • Have some experience working with batch ML models 
  • Went through the Evidently Get Started Tutorial and can generate visual and JSON reports with metrics.

You also need the following tools installed on your local machine:

Note: we tested this example on macOS/Linux.

Installation

First, install the pre-built example. Check the README file for more detailed instructions. 

[fs-toc-omit]1. Fork or clone the repository

Clone the Evidently GitHub repository with the example code. This repository provides the necessary files and scripts to set up the integration between Evidently, Prefect, PostgreSQL, and Grafana.

git clone git@github.com:evidentlyai/evidently.git
cd evidently/examples/integrations/postgres_grafana_batch_monitoring

[fs-toc-omit]2. Create a virtual environment

Create a Python virtual environment to isolate the dependencies for this project. Then, install the required Python libraries from the requirements.txt file:

python3 -m venv .venv
source .venv/bin/activate
echo "export PYTHONPATH=$PWD" >> .venv/bin/activate
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt

[fs-toc-omit]3. Launch the monitoring cluster

Set up the monitoring infrastructure using Docker Compose. It will launch a cluster with the required services such as PostgreSQL and Grafana. This cluster is responsible for storing the monitoring metrics and visualizing them.

docker compose up -d

[fs-toc-omit]4. Create tables for monitoring database

To store the ML monitoring metrics in the PostgreSQL database, you must create the necessary tables. Run a Python script below to set up the database structure to store metrics generated by Evidently.

python src/scripts/create_db.py

[fs-toc-omit]5. Download the data and train model

This example is based on the NYC Taxi dataset. The data and the model training are out of the scope of this tutorial. We prepared a few scripts to download data, pre-process it and train a simple machine learning model.

python src/pipelines/load_data.py     
python src/pipelines/process_data.py 
python src/pipelines/train.py

[fs-toc-omit]6. Prepare “reference” data for monitoring

To generate monitoring reports with Evidently, we usually need two datasets:

  • The reference dataset. This can be data from model validation, earlier production use, or a manually curated “golden set.” It serves as a baseline for comparison. 
  • The current dataset. It includes the recent production data you compare to reference.
Batch ML monitoring pipeline
Source: Evidently DOCS "Data requirements"

In this example, we take data from January 2021 as a reference. We use this data as a baseline and generate monitoring reports for consecutive periods.

python src/pipelines/prepare_reference_data.py
Do you always need the reference dataset? It depends. A reference dataset is required to compute data distribution drift. You can also choose to work with a reference dataset to quickly generate expectations about the data, such as data schema, feature min-max ranges, baseline model quality, etc. This is useful if you want to run Test Suites with auto-generated parameters. However, you can calculate most metrics (e.g., the share of nulls, feature min/max/mean, model quality, etc.) without the reference dataset.

Run ML monitoring example

After completing the installation, you have a working Evidently integration with Prefect, PostgreSQL, and Grafana. 

Follow the steps below to launch the example. You will run a set of inference and ML monitoring pipelines to generate the metrics, store them in the database and explore them on a Grafana dashboard.

[fs-toc-omit]1. Run inference and monitoring pipelines

Set the Prefect API URL environment variable to enable communication between the Prefect server and the Python scripts. Then, execute the scheduler script to automatically run the Prefect flows for inference and monitoring:

export PREFECT_API_URL=http://localhost:4200/api
python src/pipelines/scheduler.py

The scheduler.py script runs the following pipelines:

  • predict.py performs inference by applying the trained model to the input data.
  • monitor_data.py monitors the input data quality by comparing it to the reference data.
  • monitor_model.py evaluates the quality of the model predictions. This pipeline requires ground truth data. We assume the labels arrive with a delay and are available for the previous period. Therefore, this pipeline runs for the prior period.

For simplicity, the scheduler.py script uses the following hardcoded parameters to schedule the pipelines.

START_DATE_TIME = '2021-02-01 02:00:00'
END_DATE_TIME = '2021-02-01 10:00:00'
BATCH_INTERVAL = 60
By fixing the parameters, we ensure the reproducibility. When you run the tutorial, you should get the same visuals. We will further discuss how to customize this example.

Instead of running the scheduler script, you can execute each pipeline individually. This way, you will have more control over the specific execution time and interval of each pipeline.

python src/pipelines/predict.py --ts '2021-02-01 01:00:00' --interval 60
python src/pipelines/monitor_data.py --ts '2021-02-01 01:00:00' --interval 60
python src/pipelines/monitor_model.py --ts '2021-02-01 02:00:00' --interval 60

[fs-toc-omit]2. Check the pipelines in the Prefect UI

Access the Prefect UI by navigating to http://localhost:4200 in a web browser. The Perfect UI shows the executed pipelines and their current status.

ML pipelines in Prefect

[fs-toc-omit]3. Explore the monitoring metrics in the Grafana 

Open the Grafana monitoring dashboards by visiting http://localhost:3000 in a web browser.  The example contains pre-built Grafana dashboards showing data quality, target drift, and model performance metrics.

You can navigate and see the metrics as they appear on the Grafana dashboard.

Monitoring metrics in Grafana

Design prediction and monitoring pipelines 

Now, let’s explore each component of the ML model monitoring architecture.

In this section, we will explain the design of the three Prefect pipelines to monitor input data quality, model predictions, and model performance. You will understand how they work and how to modify them.

First, let’s visualize the pipeline order and dependencies.

ML monitoring pipeline

You execute three pipelines at different time intervals (T-1, T, and T+1). For each period, you make new predictions, run input data checks, and monitor model performance.

The pipelines perform the following tasks:

  1. Predict (t=0): At time T, this task processes the current data and generates predictions for the next period T+1, using the trained machine learning model.
  2. Monitor Data (t=0): This task profiles the current data and compares it to the reference. It checks for data drift or quality issues that might impact the model quality. 
  3. Monitor Model (t=0): This task evaluates the quality of the predictions made by the model at time T by comparing them to the actuals. It checks for model performance degradation or target drift that might require retraining or updating the model. Since the actuals usually arrive with a delay, you will run the model monitoring pipelines at the next time interval (T+1).

[fs-toc-omit]Prefect basics: tasks and flows

In Prefect, tasks are the fundamental building blocks of workflows. Tasks represent individual operations, such as reading data, preprocessing data, training a model, or evaluating a model.

Let’s consider a simple example below:

from prefect import flow, task

@task
def say_hello(name):
    msg = f"Hey {name}, you are awesome!"
    return msg

@task
def do_good_open_source(greeting_msg):
    skills = ['Model Quality', 'Data Drift', 'Target Drift', 'Data Quality']
    msgs = [f'{greeting_msg} - Evidently helps with: {good}! \\n' for good in skills]
    return msgs

@flow
def help_world(name):
    greeting_msg = say_hello(name)
    help_msgs = do_good_open_source(greeting_msg)
    [print(msg) for msg in help_msgs]

if __name__ == "__main__":
    help_world(name="World")

To define a task in Prefect, one can use the @task decorator. This decorator turns any Python function into a Prefect task. This example demonstrates a simple flow containing two tasks:

  • The say_hello task accepts a name argument and creates a greeting message.
  • The do_good_open_source task takes a greeting message and constructs a list of extended messages.

Flows are the backbone of Prefect workflows. They represent the relationships between tasks and define the order of execution. To create a flow, by using the @flow decorator. The help_world first calls the say_hello. The output of this task (the greeting message) is then passed as an argument to the do_good_open_source task. The resulting list of messages from do_good_open_source is printed using a list comprehension.

Running this Python module outputs looks like:

❯ python prefect_demo_tutorial.py

16:01:56.307 | INFO    | prefect.engine - Created flow run 'acrid-angelfish' for flow 'help-world'
16:01:56.953 | INFO    | Flow run 'acrid-angelfish' - Created task run 'say_hello-0' for task 'say_hello'
16:01:56.954 | INFO    | Flow run 'acrid-angelfish' - Executing 'say_hello-0' immediately...
16:01:57.289 | INFO    | Task run 'say_hello-0' - Finished in state Completed()
16:01:57.453 | INFO    | Flow run 'acrid-angelfish' - Created task run 'do_good_open_source-0' for task 'do_good_open_source'
16:01:57.454 | INFO    | Flow run 'acrid-angelfish' - Executing 'do_good_open_source-0' immediately...
16:01:57.729 | INFO    | Task run 'do_good_open_source-0' - Finished in state Completed()

**Hey World, you are awesome! - Evidently helps with Model Quality! 
Hey World, you are awesome! - Evidently helps with Data Drift! 
Hey World, you are awesome! - Evidently helps with Target Drift! 
Hey World, you are awesome! - Evidently helps with Data Quality!** 

16:01:57.855 | INFO    | Flow run 'acrid-angelfish' - Finished in state Completed('All states completed.')

Prefect can automatically log the details of the running flow and visualize them in the Prefect UI.

Running flow in Prefect

1. Prediction pipeline

This pipeline makes predictions using a pre-trained ML model. The predict function is a Prefect flow that generates predictions for a new batch of data within a specified interval.

@task
def load_data(path: Path, start_time: Text, end_time: Text) -> pd.DataFrame:
    """Load data from parquet file."""
		...

@task
def get_predictions(data: pd.DataFrame, model) -> pd.DataFrame:
    """Predictions generation."""
		...

@task
def save_predictions(predictions: pd.DataFrame, path: Path) -> None:
    """Save predictions to parquet file."""
		...

@flow(flow_run_name="predict-on-{ts}",)
def predict(
    ts: pendulum.DateTime,
    interval: int = 60
) -> None:
    """Calculate predictions for the new batch (interval) data."""

    DATA_FEATURES_DIR = 'data/features'

    # Compute the batch start and end time
    start_time, end_time = get_batch_interval(ts, interval)

    # Prepare data
    path = Path(f'{DATA_FEATURES_DIR}/green_tripdata_2021-02.parquet')
    batch_data = load_data(path, start_time, end_time)

    # Predictions generation
    model = joblib.load('models/model.joblib')
    predictions = get_predictions(batch_data, model)

    # Save predictions
    filename = ts.to_date_string()
    path = Path(f'data/predictions/{filename}.parquet')
    save_predictions(predictions, path)

The predict flow orchestrates the entire prediction process. It takes a timestamp and an interval (in minutes) as input arguments. The flow consists of the following steps:

  • Compute the batch start and end time using the given timestamp and interval.
  • Prepare data by loading it from the specified Parquet file using the load_data task.
  • Load the trained machine learning model from a Joblib file.
  • Generate predictions for the batch data using the get_predictions task.
  • Save the generated predictions to a Parquet file using the save_predictions task.

By defining the predict as a Prefect flow, you create a reusable and modular pipeline for generating predictions on new data batches.

💡 Note: The predict flow is decorated with the @flow decorator, which includes the flow_run_name parameter that gives a unique name for each flow run based on the timestamp (ts).

2. Data monitoring pipeline

The data quality monitoring pipeline tracks the quality and stability of the input data. We'll use Evidently to perform data profiling and generate a data quality report.

💡 For ease of demonstration, we check for both input data and the prediction distribution drift as a part of the data quality pipeline. (Both included in the Data Drift Preset). You may want to split these tasks into separate pipelines in your projects.

The code snippet below from the src/pipelines/monitor_data.py shows how to create a Prefect flow to monitor data quality and data drift in a machine learning pipeline.

@task
def prepare_current_data(start_time: Text, end_time: Text) -> pd.DataFrame:
    """Merge the current data with the corresponding predictions."""
    ...

@task
def generate_reports(
    current_data: pd.DataFrame,
    reference_data: pd.DataFrame,
    num_features: List[Text],
    cat_features: List[Text],
    prediction_col: Text
) -> None:
    """
    Generate data quality and data drift reports and
    commit metrics to the database."""
    ...

@flow(flow_run_name="monitor-data-on-{ts}")
def monitor_data(
    ts: pendulum.DateTime,
    interval: int = 60) -> None:
    """Build and save data validation reports."""

    num_features = ['passenger_count', 'trip_distance', 'fare_amount', 'total_amount']
    cat_features = ['PULocationID', 'DOLocationID']
    prediction_col = 'predictions'

    # Prepare current data
    start_time, end_time = get_batch_interval(ts, interval)
    current_data: pd.DataFrame = prepare_current_data(start_time, end_time)

    # Prepare reference data
    DATA_REF_DIR = 'data/reference'
    ref_path = f'{DATA_REF_DIR}/reference_data_2021-01.parquet'
    ref_data = pd.read_parquet(ref_path)
    columns: List[Text] = num_features + cat_features + [prediction_col]
    reference_data = ref_data.loc[:, columns]

    generate_reports(
		    current_data=current_data,
		    reference_data=reference_data,
		    num_features=num_features,
		    cat_features=cat_features,
		    prediction_col=prediction_col
    )

The monitor_data flow orchestrates the data monitoring process. It takes a timestamp ts and an interval (in minutes) as input arguments. The flow consists of the following steps:

  • Prepare the current data by merging it with the corresponding predictions using the prepare_current_data task.
  • Load the reference data from the Parquet file and select the relevant columns.
  • Generate data quality and data drift reports using the generate_reports task.

Let’s dive deeper into the generate_reports task!

@task
def generate_reports(
    current_data: pd.DataFrame,
    reference_data: pd.DataFrame,
    num_features: List[Text],
    cat_features: List[Text],
    prediction_col: Text
) -> None:
    """
    Generate data quality and data drift reports and
    commit metrics to the database."""

    # Prepare column_mapping object for Evidently reports
    column_mapping = ColumnMapping()
    column_mapping...

    # Data quality report
    data_quality_report = Report(metrics=[DatasetSummaryMetric()])
    data_quality_report.run(...)

    # Data drift report
    data_drift_report = Report(metrics=[DataDriftPreset()])
    data_drift_report.run(...)

    # Commit metrics into database
    ...
    commit_data_metrics_to_db(...)

This task generates a set of Evidently metrics related to the data quality and data drift. It takes the current data, reference data, numerical features, categorical features, and the prediction column as input arguments and computes two reports. 

The Data Quality report includes the DatasetSummaryMetric. It profiles the input dataset by computing metrics like the number of observations, missing values, empty and almost empty values, etc.

Evidently Data Summary table
An HTML version of the DatasetSummaryMetric.
🚦 Conditional data validation. In this example, we compute and log the model quality metrics. As an alternative, you can directly check if the input data complies with certain conditions (for example, if there are features out of range, schema violations, etc.) and log the pass/fail test result in addition to metric values. In this case, use Evidently Test Suites

The Data Drift report includes the DatasetDriftPreset. It compares the distributions of the features and predictions between the current and reference dataset. We do not pass any custom parameters, so it uses the default Evidently drift detection algorithm.

Evidently data drift table

In this case, we do not generate the visual reports using Evidently, but instead, get the metrics as a Python dictionary using .as_dict()  Evidently output format. This output includes the metric values, relevant metadata (such as applied drift detection method and threshold), and even optional visualization information, such as histogram bins. 

This task then commits the computed metrics to the database for future analysis and visualization.

💡 Customizing the Metrics. In this example, we use only a couple of metrics available in Evidently. You can browse other metrics in these sample notebooks or explore the list of 100+ Metrics and Tests to choose those suitable for your use case.

3. Model monitoring pipeline

The model performance monitoring pipeline tracks the model quality over time. It uses Evidently to generate model quality metrics and compare the distribution of the model target against the reference period (evaluate target drift).

The code snippet below from the src/pipelines/monitor_model.py demonstrates how to create a Prefect flow for model monitoring:

@task
def generate_reports(
    current_data: pd.DataFrame,
    reference_data: pd.DataFrame,
    num_features: List[Text],
    cat_features: List[Text],
    prediction_col: Text,
    target_col: Text
) -> None:
    """
    Generate data quality and data drift reports
    and commit metrics to the database."""

    # Prepare column_mapping object for Evidently reports
    column_mapping = ColumnMapping()
    column_mapping...

    # Create a model performance report
    model_performance_report = Report(metrics=[RegressionQualityMetric()])
    ...

    # Target drift report
    target_drift_report = Report(metrics=[ColumnDriftMetric(target_col)])
    ...

    # Save metrics to database
    commit_model_metrics_to_db()

@flow(flow_run_name="monitor-model-on-{ts}")
def monitor_model(
    ts: pendulum.DateTime,
    interval: int = 60
) -> None:
    """Build and save monitoring reports."""

    DATA_REF_DIR = 'data/reference'
    ...

    # Prepare current data
    start_time, end_time = get_batch_interval(ts, interval)
    current_data = prepare_current_data(start_time, end_time)

    # Prepare reference data
    ref_path = f'{DATA_REF_DIR}/reference_data_2021-01.parquet'
    reference_data = pd.read_parquet(ref_path)
    ...

    # Generate reports
    generate_reports(
        current_data=current_data,
        reference_data=reference_data,
        num_features=num_features,
        cat_features=cat_features,
        prediction_col=prediction_col,
        target_col=target_col)

The monitor_model Prefect flow generates the relevant metrics and commits them to the database. It consists of the following steps:

  • Prepare the current data by merging it with the corresponding predictions using the prepare_current_data task.
  • Load the reference data.
  • Generate the model performance, data quality, and data drift reports using the generate_reports task.

The generate_reports task generates model performance and target drift reports. It utilizes the following Evidently metrics:

  • RegressionQualityMetric to create a model performance report. It computes metrics like Mean Error, Mean Absolute Error, etc. 
  • ColumnDriftMetric to create a target drift report. We compute this metric for the Target column to check for distribution drift.

Store and visualize ML metrics

Now, let’s look at what happens with the computed metrics.

[fs-toc-omit]1. Store the metrics in PostgreSQL

In this example, we use SQLAlchemy, a popular Python SQL toolkit and Object-Relational Mapper (ORM), to interact with the PostgreSQL database. With SQLAlchemy, you can define the table schema using Python classes and easily manage the database tables using Python code.

We prepared a Python script named create_db.py, that creates the database tables required for storing monitoring metrics.

python src/scripts/create_db.py.

The table models are defined in the src/utils/models.py module. 

For example, the TargetDriftTable class represents a table schema for storing target drift metrics in the PostgreSQL database.

class TargetDriftTable(Base):
    """Implement table for target drift metrics.
    Evidently metric functions:
        - ColumnDriftMetric from target column
    """

    __tablename__ = 'target_drift'
    id = Column(Integer, primary_key=True)
    timestamp = Column(Float)
    stattest_name = Column(String)
    stattest_threshold = Column(Float)
    drift_score = Column(Float)
    drift_detected = Column(Boolean)

This DB table contains the following columns for monitoring purposes:

  • timestamp stores the timestamp when the target drift metrics were computed.
  • stattest_name stores the name of the drift detection method.
  • stattest_threshold stores the threshold value for drift detection method
  • drift_score stores the computed drift score.
  • drift_detected stores a True or False value, indicating whether drift was detected based on the drift score and the threshold value.

It works similarly for other tables in the database.

📊 How do the data drift checks work? You can explore "Which test is the best" research blog to understand the behavior of different drift detection methods. To understand the parameters of the Evidently drift checks, check the documentation.
💡 Why not Prometheus? A popular combination is to use Grafana together with Prometheus. In this case, we opt for a SQL database. The reason is that Prometheus is well-suited to store time series metrics in near real-time. However, it is not convenient to write delayed data. In our case, we compute metrics on a schedule (which can be as infrequent as once per day or week) and compute model quality metrics with a delay. Using Prometheus also adds an additional service to run (and monitor!)

[fs-toc-omit]2. Visualize metrics in Grafana

We prepared Grafana dashboard configurations to visualize the collected metrics and data source configurations to connect it to the PostgreSQL database. You can find them in the grafana/ directory.

Grafana metrics

After you launch the monitoring cluster, Grafana will connect to the monitoring database and create dashboards based on the templates.

As an example, let's explore the Evidently Numerical Target Drift dashboard, which provides insights into the distribution drift of the model target.

Grafana metrics Target drift

The top widgets show the name of the drift detection method (in this case, Wasserstein distance) and the threshold values. 

The middle and bottom widgets display the history of drift checks for each period. This helps identify specific time points when drift occurred and the corresponding drift scores. You can understand the severity of drift and decide whether you want to intervene.

You can easily customize the dashboard widgets and scripts used to build them.

Grafana metrics customization
Alerts. You can also use Grafana to define alert conditions to inform when metrics are outside the expected range. 

Customize for your data

To adapt this example for your machine learning projects, follow these guidelines:

Data inputs. Modify the load_data task to load your dataset from the relevant data source (e.g., CSV, parquet file, or database).

Model inference. Replace the existing model with your trained machine learning model. You may need to adjust the get_predictions task to ensure compatibility with your chosen algorithm and data format.

Monitoring metrics. Customize the monitoring tasks to include metrics relevant to your project. Consider including data quality, data drift, target drift, and model performance checks. Update the generate_reports task with the appropriate Evidently metrics or tests.

💡 Evidently Metrics and Tests. You can browse other metrics in these sample notebooks or explore the list of 100+ Metrics and Tests to choose those suitable for your use case. 

Database setup. Modify the database configuration and the table models in src/utils/models.py to store the monitoring metrics relevant to your project.

Reference dataset. Define a representative reference dataset and period suitable for your use case. This should be a stable data snapshot that captures the typical distribution and characteristics of the features and target variable. The period should be long enough to reflect the variations in the data. Consider the specific scenario and seasonality: sometimes, you might use a moving reference, for example, by comparing each week to the previous. 

Store the reference dataset or generate it on the fly. Consider the size of the data and resources available for processing. Storing the reference dataset may be preferable for larger or more complex datasets, while generating it on the fly could be more suitable for smaller or highly dynamic datasets.

Grafana dashboards. Customize the Grafana dashboards to visualize the specific monitoring metrics relevant to your project. This may involve updating the SQL queries and creating additional visualizations to display the results of your custom metrics.

Architecture pros and cons

Batch ML monitoring architecture

[fs-toc-omit]Pros

Applicable for both batch and real-time. This monitoring architecture can be used for both batch and real-time ML systems. With batch inference, you can directly follow this example, adding data validation or monitoring steps to your existing pipelines. For real-time systems, you can log model inputs and predictions to the database and then read the prediction logs from the database at a defined interval. 

Async metrics computation. Metrics computation is separate from model serving. This way, it does not affect the model serving latency in cases when this is relevant. 

Adaptable. You can replace specific components for those you already use. For example, you can use the same database you use to store model predictions or use a different workflow orchestrator, such as Airflow. (Here is an example integration of how to use Evidently with Airflow). You can also replace Grafana with a different BI dashboard and even make use of the additional visualization information available in the Evidently JSON/Python dictionary output to recreate some of the Evidently visualizations faster.  

[fs-toc-omit]Cons

Might be too “heavy.” We recommend using this or similar architecture when you already use one or two of the mentioned tools as part of your workflow. For example, you already use Grafana for software monitoring or Prefect to orchestrate dataflows. 

However, it might be suboptimal to introduce several complex new services to monitor a few ML models, especially if this is infrequent batch inference.  

[fs-toc-omit]Alternatives

Evidently and Streamlit integration

You can use Evidently to compute HTML reports and store them in any object storage. In this case, you will implement a complete “offline” monitoring setup as a set of monitoring jobs. You will also make use of the rich pre-built Evidently visuals for debugging. 

Here is an example tutorial of using Evidently with Streamlit to host HTML reports.

[fs-toc-omit]Support Evidently
Did you enjoy the blog? Star Evidently on GitHub to contribute back! This helps us continue creating free, open-source tools and content for the community.

⭐️ Star on GitHub ⟶

Summing up

This tutorial demonstrated how to integrate Evidently into production pipelines using Prefect, PostgreSQL, and Grafana. 

  • You built a solution that consists of three consecutive pipelines for data quality checks, model predictions, and model quality monitoring.
  • You learned how to store the monitoring metrics in PostgreSQL and visualize them in Grafana.

You can further work with this example:

  • Adapt it for your data, both for batch and real-time ML systems.
  • Customize the specific monitoring metrics using the Evidently library. 
  • Use this architecture as an example and replace individual components.
https://www.linkedin.com/in/mnrozhkov/
Mikhail Rozhkov

Guest author
https://www.linkedin.com/in/elenasamuylova/
Elena Samuylova

Co-founder and CEO

Evidently AI

You might also like:

May 4, 2023
Last Updated:
November 1, 2023

Batch ML monitoring blueprint: Evidently, Prefect, PostgreSQL, and Grafana

Tutorials
OPEN-SOURCE ML MONITORING
Evaluate, test and monitor your ML models with Evidently.
START ON GITHUB
Get EVIDENTLY UPDATES
New features, integrations, and code tutorials.
Thank you! Please check your email to confirm subscription!
Oops! Something went wrong while submitting the form.

In this code tutorial, you will learn how to run batch ML model inference, collect data and ML model quality monitoring metrics, and visualize them on a live dashboard.  

This is a blueprint for an end-to-end batch ML monitoring workflow using open-source tools. You can copy the repository and use this reference architecture to adapt for your use case.

Code example: if you prefer to head straight to the code, open this example folder on GitHub.

Background

[fs-toc-omit]ML monitoring

ML monitoring metrics

When an ML model is in production, you need to keep tabs on the ML-related quality metrics in addition to traditional software service monitoring. This typically includes:

  • Data quality and integrity metrics, such as share of nulls, schema compliance, etc. 
  • Distribution of the model outputs (prediction drift) and model inputs (data drift) as a proxy for model quality.
  • ML model quality metrics (accuracy, mean error, etc.). You often compute them with a delay after receiving the ground truth labels or actuals.
[fs-toc-omit]Want to learn more about ML monitoring?
Sign up for our Open-source ML observability course. Designed for data scientists and ML engineers. Yes, it's free!

Save my seat ⟶

[fs-toc-omit]ML monitoring architecture

In this tutorial, we introduce a possible implementation architecture for ML monitoring as a set of monitoring jobs.

ML monitoring architecture

You can adapt this approach to your batch ML pipelines. You can also use this approach to monitor online ML services: when you do not need to compute metrics every second, but instead can read freshly logged data from the database, say, every minute or once per hour.  

Tutorial scope

In this tutorial, you will learn how to build a batch ML monitoring workflow using Evidently, Prefect, PostgreSQL, and Grafana. 

The tutorial includes all the necessary steps to imitate the batch model inference and subsequent data joins for model quality evaluation.   

By the end of this tutorial, you will learn how to implement an ML monitoring architecture using:

  • Evidently to perform data quality, data drift and model quality checks. 
  • Prefect to orchestrate the checks and write metrics to the database.
  • PostgreSQL database to store the defined monitoring metrics. 
  • Grafana as a dashboard to visualize the metrics in time.

You will run your monitoring solution in a Docker environment for easy deployment and scalability.

[fs-toc-omit]Pre-requisites

We expect that you:

  • Have some experience working with batch ML models 
  • Went through the Evidently Get Started Tutorial and can generate visual and JSON reports with metrics.

You also need the following tools installed on your local machine:

Note: we tested this example on macOS/Linux.

Installation

First, install the pre-built example. Check the README file for more detailed instructions. 

[fs-toc-omit]1. Fork or clone the repository

Clone the Evidently GitHub repository with the example code. This repository provides the necessary files and scripts to set up the integration between Evidently, Prefect, PostgreSQL, and Grafana.

git clone git@github.com:evidentlyai/evidently.git
cd evidently/examples/integrations/postgres_grafana_batch_monitoring

[fs-toc-omit]2. Create a virtual environment

Create a Python virtual environment to isolate the dependencies for this project. Then, install the required Python libraries from the requirements.txt file:

python3 -m venv .venv
source .venv/bin/activate
echo "export PYTHONPATH=$PWD" >> .venv/bin/activate
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt

[fs-toc-omit]3. Launch the monitoring cluster

Set up the monitoring infrastructure using Docker Compose. It will launch a cluster with the required services such as PostgreSQL and Grafana. This cluster is responsible for storing the monitoring metrics and visualizing them.

docker compose up -d

[fs-toc-omit]4. Create tables for monitoring database

To store the ML monitoring metrics in the PostgreSQL database, you must create the necessary tables. Run a Python script below to set up the database structure to store metrics generated by Evidently.

python src/scripts/create_db.py

[fs-toc-omit]5. Download the data and train model

This example is based on the NYC Taxi dataset. The data and the model training are out of the scope of this tutorial. We prepared a few scripts to download data, pre-process it and train a simple machine learning model.

python src/pipelines/load_data.py     
python src/pipelines/process_data.py 
python src/pipelines/train.py

[fs-toc-omit]6. Prepare “reference” data for monitoring

To generate monitoring reports with Evidently, we usually need two datasets:

  • The reference dataset. This can be data from model validation, earlier production use, or a manually curated “golden set.” It serves as a baseline for comparison. 
  • The current dataset. It includes the recent production data you compare to reference.
Batch ML monitoring pipeline
Source: Evidently DOCS "Data requirements"

In this example, we take data from January 2021 as a reference. We use this data as a baseline and generate monitoring reports for consecutive periods.

python src/pipelines/prepare_reference_data.py
Do you always need the reference dataset? It depends. A reference dataset is required to compute data distribution drift. You can also choose to work with a reference dataset to quickly generate expectations about the data, such as data schema, feature min-max ranges, baseline model quality, etc. This is useful if you want to run Test Suites with auto-generated parameters. However, you can calculate most metrics (e.g., the share of nulls, feature min/max/mean, model quality, etc.) without the reference dataset.

Run ML monitoring example

After completing the installation, you have a working Evidently integration with Prefect, PostgreSQL, and Grafana. 

Follow the steps below to launch the example. You will run a set of inference and ML monitoring pipelines to generate the metrics, store them in the database and explore them on a Grafana dashboard.

[fs-toc-omit]1. Run inference and monitoring pipelines

Set the Prefect API URL environment variable to enable communication between the Prefect server and the Python scripts. Then, execute the scheduler script to automatically run the Prefect flows for inference and monitoring:

export PREFECT_API_URL=http://localhost:4200/api
python src/pipelines/scheduler.py

The scheduler.py script runs the following pipelines:

  • predict.py performs inference by applying the trained model to the input data.
  • monitor_data.py monitors the input data quality by comparing it to the reference data.
  • monitor_model.py evaluates the quality of the model predictions. This pipeline requires ground truth data. We assume the labels arrive with a delay and are available for the previous period. Therefore, this pipeline runs for the prior period.

For simplicity, the scheduler.py script uses the following hardcoded parameters to schedule the pipelines.

START_DATE_TIME = '2021-02-01 02:00:00'
END_DATE_TIME = '2021-02-01 10:00:00'
BATCH_INTERVAL = 60
By fixing the parameters, we ensure the reproducibility. When you run the tutorial, you should get the same visuals. We will further discuss how to customize this example.

Instead of running the scheduler script, you can execute each pipeline individually. This way, you will have more control over the specific execution time and interval of each pipeline.

python src/pipelines/predict.py --ts '2021-02-01 01:00:00' --interval 60
python src/pipelines/monitor_data.py --ts '2021-02-01 01:00:00' --interval 60
python src/pipelines/monitor_model.py --ts '2021-02-01 02:00:00' --interval 60

[fs-toc-omit]2. Check the pipelines in the Prefect UI

Access the Prefect UI by navigating to http://localhost:4200 in a web browser. The Perfect UI shows the executed pipelines and their current status.

ML pipelines in Prefect

[fs-toc-omit]3. Explore the monitoring metrics in the Grafana 

Open the Grafana monitoring dashboards by visiting http://localhost:3000 in a web browser.  The example contains pre-built Grafana dashboards showing data quality, target drift, and model performance metrics.

You can navigate and see the metrics as they appear on the Grafana dashboard.

Monitoring metrics in Grafana

Design prediction and monitoring pipelines 

Now, let’s explore each component of the ML model monitoring architecture.

In this section, we will explain the design of the three Prefect pipelines to monitor input data quality, model predictions, and model performance. You will understand how they work and how to modify them.

First, let’s visualize the pipeline order and dependencies.

ML monitoring pipeline

You execute three pipelines at different time intervals (T-1, T, and T+1). For each period, you make new predictions, run input data checks, and monitor model performance.

The pipelines perform the following tasks:

  1. Predict (t=0): At time T, this task processes the current data and generates predictions for the next period T+1, using the trained machine learning model.
  2. Monitor Data (t=0): This task profiles the current data and compares it to the reference. It checks for data drift or quality issues that might impact the model quality. 
  3. Monitor Model (t=0): This task evaluates the quality of the predictions made by the model at time T by comparing them to the actuals. It checks for model performance degradation or target drift that might require retraining or updating the model. Since the actuals usually arrive with a delay, you will run the model monitoring pipelines at the next time interval (T+1).

[fs-toc-omit]Prefect basics: tasks and flows

In Prefect, tasks are the fundamental building blocks of workflows. Tasks represent individual operations, such as reading data, preprocessing data, training a model, or evaluating a model.

Let’s consider a simple example below:

from prefect import flow, task

@task
def say_hello(name):
    msg = f"Hey {name}, you are awesome!"
    return msg

@task
def do_good_open_source(greeting_msg):
    skills = ['Model Quality', 'Data Drift', 'Target Drift', 'Data Quality']
    msgs = [f'{greeting_msg} - Evidently helps with: {good}! \\n' for good in skills]
    return msgs

@flow
def help_world(name):
    greeting_msg = say_hello(name)
    help_msgs = do_good_open_source(greeting_msg)
    [print(msg) for msg in help_msgs]

if __name__ == "__main__":
    help_world(name="World")

To define a task in Prefect, one can use the @task decorator. This decorator turns any Python function into a Prefect task. This example demonstrates a simple flow containing two tasks:

  • The say_hello task accepts a name argument and creates a greeting message.
  • The do_good_open_source task takes a greeting message and constructs a list of extended messages.

Flows are the backbone of Prefect workflows. They represent the relationships between tasks and define the order of execution. To create a flow, by using the @flow decorator. The help_world first calls the say_hello. The output of this task (the greeting message) is then passed as an argument to the do_good_open_source task. The resulting list of messages from do_good_open_source is printed using a list comprehension.

Running this Python module outputs looks like:

❯ python prefect_demo_tutorial.py

16:01:56.307 | INFO    | prefect.engine - Created flow run 'acrid-angelfish' for flow 'help-world'
16:01:56.953 | INFO    | Flow run 'acrid-angelfish' - Created task run 'say_hello-0' for task 'say_hello'
16:01:56.954 | INFO    | Flow run 'acrid-angelfish' - Executing 'say_hello-0' immediately...
16:01:57.289 | INFO    | Task run 'say_hello-0' - Finished in state Completed()
16:01:57.453 | INFO    | Flow run 'acrid-angelfish' - Created task run 'do_good_open_source-0' for task 'do_good_open_source'
16:01:57.454 | INFO    | Flow run 'acrid-angelfish' - Executing 'do_good_open_source-0' immediately...
16:01:57.729 | INFO    | Task run 'do_good_open_source-0' - Finished in state Completed()

**Hey World, you are awesome! - Evidently helps with Model Quality! 
Hey World, you are awesome! - Evidently helps with Data Drift! 
Hey World, you are awesome! - Evidently helps with Target Drift! 
Hey World, you are awesome! - Evidently helps with Data Quality!** 

16:01:57.855 | INFO    | Flow run 'acrid-angelfish' - Finished in state Completed('All states completed.')

Prefect can automatically log the details of the running flow and visualize them in the Prefect UI.

Running flow in Prefect

1. Prediction pipeline

This pipeline makes predictions using a pre-trained ML model. The predict function is a Prefect flow that generates predictions for a new batch of data within a specified interval.

@task
def load_data(path: Path, start_time: Text, end_time: Text) -> pd.DataFrame:
    """Load data from parquet file."""
		...

@task
def get_predictions(data: pd.DataFrame, model) -> pd.DataFrame:
    """Predictions generation."""
		...

@task
def save_predictions(predictions: pd.DataFrame, path: Path) -> None:
    """Save predictions to parquet file."""
		...

@flow(flow_run_name="predict-on-{ts}",)
def predict(
    ts: pendulum.DateTime,
    interval: int = 60
) -> None:
    """Calculate predictions for the new batch (interval) data."""

    DATA_FEATURES_DIR = 'data/features'

    # Compute the batch start and end time
    start_time, end_time = get_batch_interval(ts, interval)

    # Prepare data
    path = Path(f'{DATA_FEATURES_DIR}/green_tripdata_2021-02.parquet')
    batch_data = load_data(path, start_time, end_time)

    # Predictions generation
    model = joblib.load('models/model.joblib')
    predictions = get_predictions(batch_data, model)

    # Save predictions
    filename = ts.to_date_string()
    path = Path(f'data/predictions/{filename}.parquet')
    save_predictions(predictions, path)

The predict flow orchestrates the entire prediction process. It takes a timestamp and an interval (in minutes) as input arguments. The flow consists of the following steps:

  • Compute the batch start and end time using the given timestamp and interval.
  • Prepare data by loading it from the specified Parquet file using the load_data task.
  • Load the trained machine learning model from a Joblib file.
  • Generate predictions for the batch data using the get_predictions task.
  • Save the generated predictions to a Parquet file using the save_predictions task.

By defining the predict as a Prefect flow, you create a reusable and modular pipeline for generating predictions on new data batches.

💡 Note: The predict flow is decorated with the @flow decorator, which includes the flow_run_name parameter that gives a unique name for each flow run based on the timestamp (ts).

2. Data monitoring pipeline

The data quality monitoring pipeline tracks the quality and stability of the input data. We'll use Evidently to perform data profiling and generate a data quality report.

💡 For ease of demonstration, we check for both input data and the prediction distribution drift as a part of the data quality pipeline. (Both included in the Data Drift Preset). You may want to split these tasks into separate pipelines in your projects.

The code snippet below from the src/pipelines/monitor_data.py shows how to create a Prefect flow to monitor data quality and data drift in a machine learning pipeline.

@task
def prepare_current_data(start_time: Text, end_time: Text) -> pd.DataFrame:
    """Merge the current data with the corresponding predictions."""
    ...

@task
def generate_reports(
    current_data: pd.DataFrame,
    reference_data: pd.DataFrame,
    num_features: List[Text],
    cat_features: List[Text],
    prediction_col: Text
) -> None:
    """
    Generate data quality and data drift reports and
    commit metrics to the database."""
    ...

@flow(flow_run_name="monitor-data-on-{ts}")
def monitor_data(
    ts: pendulum.DateTime,
    interval: int = 60) -> None:
    """Build and save data validation reports."""

    num_features = ['passenger_count', 'trip_distance', 'fare_amount', 'total_amount']
    cat_features = ['PULocationID', 'DOLocationID']
    prediction_col = 'predictions'

    # Prepare current data
    start_time, end_time = get_batch_interval(ts, interval)
    current_data: pd.DataFrame = prepare_current_data(start_time, end_time)

    # Prepare reference data
    DATA_REF_DIR = 'data/reference'
    ref_path = f'{DATA_REF_DIR}/reference_data_2021-01.parquet'
    ref_data = pd.read_parquet(ref_path)
    columns: List[Text] = num_features + cat_features + [prediction_col]
    reference_data = ref_data.loc[:, columns]

    generate_reports(
		    current_data=current_data,
		    reference_data=reference_data,
		    num_features=num_features,
		    cat_features=cat_features,
		    prediction_col=prediction_col
    )

The monitor_data flow orchestrates the data monitoring process. It takes a timestamp ts and an interval (in minutes) as input arguments. The flow consists of the following steps:

  • Prepare the current data by merging it with the corresponding predictions using the prepare_current_data task.
  • Load the reference data from the Parquet file and select the relevant columns.
  • Generate data quality and data drift reports using the generate_reports task.

Let’s dive deeper into the generate_reports task!

@task
def generate_reports(
    current_data: pd.DataFrame,
    reference_data: pd.DataFrame,
    num_features: List[Text],
    cat_features: List[Text],
    prediction_col: Text
) -> None:
    """
    Generate data quality and data drift reports and
    commit metrics to the database."""

    # Prepare column_mapping object for Evidently reports
    column_mapping = ColumnMapping()
    column_mapping...

    # Data quality report
    data_quality_report = Report(metrics=[DatasetSummaryMetric()])
    data_quality_report.run(...)

    # Data drift report
    data_drift_report = Report(metrics=[DataDriftPreset()])
    data_drift_report.run(...)

    # Commit metrics into database
    ...
    commit_data_metrics_to_db(...)

This task generates a set of Evidently metrics related to the data quality and data drift. It takes the current data, reference data, numerical features, categorical features, and the prediction column as input arguments and computes two reports. 

The Data Quality report includes the DatasetSummaryMetric. It profiles the input dataset by computing metrics like the number of observations, missing values, empty and almost empty values, etc.

Evidently Data Summary table
An HTML version of the DatasetSummaryMetric.
🚦 Conditional data validation. In this example, we compute and log the model quality metrics. As an alternative, you can directly check if the input data complies with certain conditions (for example, if there are features out of range, schema violations, etc.) and log the pass/fail test result in addition to metric values. In this case, use Evidently Test Suites

The Data Drift report includes the DatasetDriftPreset. It compares the distributions of the features and predictions between the current and reference dataset. We do not pass any custom parameters, so it uses the default Evidently drift detection algorithm.

Evidently data drift table

In this case, we do not generate the visual reports using Evidently, but instead, get the metrics as a Python dictionary using .as_dict()  Evidently output format. This output includes the metric values, relevant metadata (such as applied drift detection method and threshold), and even optional visualization information, such as histogram bins. 

This task then commits the computed metrics to the database for future analysis and visualization.

💡 Customizing the Metrics. In this example, we use only a couple of metrics available in Evidently. You can browse other metrics in these sample notebooks or explore the list of 100+ Metrics and Tests to choose those suitable for your use case.

3. Model monitoring pipeline

The model performance monitoring pipeline tracks the model quality over time. It uses Evidently to generate model quality metrics and compare the distribution of the model target against the reference period (evaluate target drift).

The code snippet below from the src/pipelines/monitor_model.py demonstrates how to create a Prefect flow for model monitoring:

@task
def generate_reports(
    current_data: pd.DataFrame,
    reference_data: pd.DataFrame,
    num_features: List[Text],
    cat_features: List[Text],
    prediction_col: Text,
    target_col: Text
) -> None:
    """
    Generate data quality and data drift reports
    and commit metrics to the database."""

    # Prepare column_mapping object for Evidently reports
    column_mapping = ColumnMapping()
    column_mapping...

    # Create a model performance report
    model_performance_report = Report(metrics=[RegressionQualityMetric()])
    ...

    # Target drift report
    target_drift_report = Report(metrics=[ColumnDriftMetric(target_col)])
    ...

    # Save metrics to database
    commit_model_metrics_to_db()

@flow(flow_run_name="monitor-model-on-{ts}")
def monitor_model(
    ts: pendulum.DateTime,
    interval: int = 60
) -> None:
    """Build and save monitoring reports."""

    DATA_REF_DIR = 'data/reference'
    ...

    # Prepare current data
    start_time, end_time = get_batch_interval(ts, interval)
    current_data = prepare_current_data(start_time, end_time)

    # Prepare reference data
    ref_path = f'{DATA_REF_DIR}/reference_data_2021-01.parquet'
    reference_data = pd.read_parquet(ref_path)
    ...

    # Generate reports
    generate_reports(
        current_data=current_data,
        reference_data=reference_data,
        num_features=num_features,
        cat_features=cat_features,
        prediction_col=prediction_col,
        target_col=target_col)

The monitor_model Prefect flow generates the relevant metrics and commits them to the database. It consists of the following steps:

  • Prepare the current data by merging it with the corresponding predictions using the prepare_current_data task.
  • Load the reference data.
  • Generate the model performance, data quality, and data drift reports using the generate_reports task.

The generate_reports task generates model performance and target drift reports. It utilizes the following Evidently metrics:

  • RegressionQualityMetric to create a model performance report. It computes metrics like Mean Error, Mean Absolute Error, etc. 
  • ColumnDriftMetric to create a target drift report. We compute this metric for the Target column to check for distribution drift.

Store and visualize ML metrics

Now, let’s look at what happens with the computed metrics.

[fs-toc-omit]1. Store the metrics in PostgreSQL

In this example, we use SQLAlchemy, a popular Python SQL toolkit and Object-Relational Mapper (ORM), to interact with the PostgreSQL database. With SQLAlchemy, you can define the table schema using Python classes and easily manage the database tables using Python code.

We prepared a Python script named create_db.py, that creates the database tables required for storing monitoring metrics.

python src/scripts/create_db.py.

The table models are defined in the src/utils/models.py module. 

For example, the TargetDriftTable class represents a table schema for storing target drift metrics in the PostgreSQL database.

class TargetDriftTable(Base):
    """Implement table for target drift metrics.
    Evidently metric functions:
        - ColumnDriftMetric from target column
    """

    __tablename__ = 'target_drift'
    id = Column(Integer, primary_key=True)
    timestamp = Column(Float)
    stattest_name = Column(String)
    stattest_threshold = Column(Float)
    drift_score = Column(Float)
    drift_detected = Column(Boolean)

This DB table contains the following columns for monitoring purposes:

  • timestamp stores the timestamp when the target drift metrics were computed.
  • stattest_name stores the name of the drift detection method.
  • stattest_threshold stores the threshold value for drift detection method
  • drift_score stores the computed drift score.
  • drift_detected stores a True or False value, indicating whether drift was detected based on the drift score and the threshold value.

It works similarly for other tables in the database.

📊 How do the data drift checks work? You can explore "Which test is the best" research blog to understand the behavior of different drift detection methods. To understand the parameters of the Evidently drift checks, check the documentation.
💡 Why not Prometheus? A popular combination is to use Grafana together with Prometheus. In this case, we opt for a SQL database. The reason is that Prometheus is well-suited to store time series metrics in near real-time. However, it is not convenient to write delayed data. In our case, we compute metrics on a schedule (which can be as infrequent as once per day or week) and compute model quality metrics with a delay. Using Prometheus also adds an additional service to run (and monitor!)

[fs-toc-omit]2. Visualize metrics in Grafana

We prepared Grafana dashboard configurations to visualize the collected metrics and data source configurations to connect it to the PostgreSQL database. You can find them in the grafana/ directory.

Grafana metrics

After you launch the monitoring cluster, Grafana will connect to the monitoring database and create dashboards based on the templates.

As an example, let's explore the Evidently Numerical Target Drift dashboard, which provides insights into the distribution drift of the model target.

Grafana metrics Target drift

The top widgets show the name of the drift detection method (in this case, Wasserstein distance) and the threshold values. 

The middle and bottom widgets display the history of drift checks for each period. This helps identify specific time points when drift occurred and the corresponding drift scores. You can understand the severity of drift and decide whether you want to intervene.

You can easily customize the dashboard widgets and scripts used to build them.

Grafana metrics customization
Alerts. You can also use Grafana to define alert conditions to inform when metrics are outside the expected range. 

Customize for your data

To adapt this example for your machine learning projects, follow these guidelines:

Data inputs. Modify the load_data task to load your dataset from the relevant data source (e.g., CSV, parquet file, or database).

Model inference. Replace the existing model with your trained machine learning model. You may need to adjust the get_predictions task to ensure compatibility with your chosen algorithm and data format.

Monitoring metrics. Customize the monitoring tasks to include metrics relevant to your project. Consider including data quality, data drift, target drift, and model performance checks. Update the generate_reports task with the appropriate Evidently metrics or tests.

💡 Evidently Metrics and Tests. You can browse other metrics in these sample notebooks or explore the list of 100+ Metrics and Tests to choose those suitable for your use case. 

Database setup. Modify the database configuration and the table models in src/utils/models.py to store the monitoring metrics relevant to your project.

Reference dataset. Define a representative reference dataset and period suitable for your use case. This should be a stable data snapshot that captures the typical distribution and characteristics of the features and target variable. The period should be long enough to reflect the variations in the data. Consider the specific scenario and seasonality: sometimes, you might use a moving reference, for example, by comparing each week to the previous. 

Store the reference dataset or generate it on the fly. Consider the size of the data and resources available for processing. Storing the reference dataset may be preferable for larger or more complex datasets, while generating it on the fly could be more suitable for smaller or highly dynamic datasets.

Grafana dashboards. Customize the Grafana dashboards to visualize the specific monitoring metrics relevant to your project. This may involve updating the SQL queries and creating additional visualizations to display the results of your custom metrics.

Architecture pros and cons

Batch ML monitoring architecture

[fs-toc-omit]Pros

Applicable for both batch and real-time. This monitoring architecture can be used for both batch and real-time ML systems. With batch inference, you can directly follow this example, adding data validation or monitoring steps to your existing pipelines. For real-time systems, you can log model inputs and predictions to the database and then read the prediction logs from the database at a defined interval. 

Async metrics computation. Metrics computation is separate from model serving. This way, it does not affect the model serving latency in cases when this is relevant. 

Adaptable. You can replace specific components for those you already use. For example, you can use the same database you use to store model predictions or use a different workflow orchestrator, such as Airflow. (Here is an example integration of how to use Evidently with Airflow). You can also replace Grafana with a different BI dashboard and even make use of the additional visualization information available in the Evidently JSON/Python dictionary output to recreate some of the Evidently visualizations faster.  

[fs-toc-omit]Cons

Might be too “heavy.” We recommend using this or similar architecture when you already use one or two of the mentioned tools as part of your workflow. For example, you already use Grafana for software monitoring or Prefect to orchestrate dataflows. 

However, it might be suboptimal to introduce several complex new services to monitor a few ML models, especially if this is infrequent batch inference.  

[fs-toc-omit]Alternatives

Evidently and Streamlit integration

You can use Evidently to compute HTML reports and store them in any object storage. In this case, you will implement a complete “offline” monitoring setup as a set of monitoring jobs. You will also make use of the rich pre-built Evidently visuals for debugging. 

Here is an example tutorial of using Evidently with Streamlit to host HTML reports.

[fs-toc-omit]Support Evidently
Did you enjoy the blog? Star Evidently on GitHub to contribute back! This helps us continue creating free, open-source tools and content for the community.

⭐️ Star on GitHub ⟶

Summing up

This tutorial demonstrated how to integrate Evidently into production pipelines using Prefect, PostgreSQL, and Grafana. 

  • You built a solution that consists of three consecutive pipelines for data quality checks, model predictions, and model quality monitoring.
  • You learned how to store the monitoring metrics in PostgreSQL and visualize them in Grafana.

You can further work with this example:

  • Adapt it for your data, both for batch and real-time ML systems.
  • Customize the specific monitoring metrics using the Evidently library. 
  • Use this architecture as an example and replace individual components.
https://www.linkedin.com/in/mnrozhkov/
Mikhail Rozhkov

Guest author
https://www.linkedin.com/in/elenasamuylova/
Elena Samuylova

Co-founder and CEO

Evidently AI

You might also like:

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.