📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Community

MLOps Zoomcamp recap: how to monitor ML models in production?

Last updated:

November 27, 2024

Published:

June 19, 2024

contents‍

Start testing your AI systems today

Get demo

Our CTO, Emeli Dral, was an instructor for the ML Monitoring module of MLOps Zoomcamp 2024, a free MLOps course from DataTalks.Club. In case you missed it or fancy a refresher, we put together ML monitoring module course notes for you. Let’s recap!

About MLOps Zoomcamp

MLOps Zoomcamp is a free hands-on MLOps course from DataTalks.Club, a global online community of data enthusiasts with over 40,000 members.

The course covers MLOps best practices and the ML model lifecycle end-to-end, from experiment tracking to building ML pipelines to monitoring machine learning models after deployment. The curriculum of MLOps Zoomcamp is practice-oriented. Students implement best MLOps practices from experiments to production, working with open-source MLOps tools like MLflow, Prefect, Grafana, and Evidently.

MLOps Zoomcamp is built for data scientists and ML engineers, as well as software engineers and data engineers interested in production machine learning.

The course includes six modules. In this blog, we prepared a summary of the ML Monitoring Module. It is self-contained, so you can go through this module without taking the previous ones.

[fs-toc-omit]Want to learn more about ML monitoring?

Sign up for our Open-source ML observability course. Designed for data scientists and ML engineers. Yes, it's free!

Save my seat ⟶

ML Monitoring module overview

Machine learning model monitoring is an essential component of MLOps. Many things can go wrong once you deploy ML models to the real world. To detect and resolve issues before they affect your production ML service, you need an ML monitoring system.

MLOps Zoomcamp ML monitoring module is just about that! It covers the basics of monitoring ML models in production and demonstrates how to implement an ML monitoring system using open-source tools step-by-step.

‍The Module includes eight videos. The first video goes through the key ML monitoring concepts. It also introduces an ML monitoring architecture that uses Evidently for metric calculation and the Grafana dashboard for metric visualization. The following videos walk through the code implementation. They cover training a simple ML model, designing an ML monitoring dashboard, and going through the debugging process when drift is detected.

Below, we will summarize the course notes and link to all the practical videos.

Introduction to ML model monitoring

Let’s take a look at the key monitoring concepts!

‍Video 1. MLOps Zoomcamp: Introduction to ML monitoring, by Emeli Dral.

ML monitoring metrics

When monitoring a production service, one usually keeps tabs on service health metrics like uptime, memory, or latency. While service health monitoring is a must, an extra layer is related to the data and the ML model itself.

ML model performance
ML model performance metrics help to ensure that ML models work as expected. A particular set of metrics will depend on the use case. For example:

You can use ranking metrics in search engine optimization or content recommender systems.
In regression problems, looking at metrics like Mean Absolute Error (MAE) or Mean Absolute Percentage Error (MAPE) is useful.
In classification problems, calculate Log Loss or precision and recall.

Data quality and integrity
Often, ground truth is not available immediately to calculate ML model performance metrics. In this case, you can use proxy metrics. In most cases, when something is wrong with the model, this is due to data quality and integrity issues. Some metrics to look for are the share of missing values, column types, and value range for each column.

Data drift and concept drift
Even if the data is fine, you can still face some problems. ML models work in the real world, and things change. To ensure ML models are still relevant, you can look at data and concept drift. Distribution changes between the current and reference data may signal potential problems with the model.

To sum up. Service health, ML model performance, data quality and integrity, and data and concept drift are good starting points for monitoring ML models in production. Depending on the use case and available resources, you can introduce more comprehensive monitoring metrics like model fairness, bias, outliers, explainability, etc.

Reusing the existing architecture

If there are already some production services deployed that you monitor, or you use some business intelligence tools, consider reusing existing systems to start with ML monitoring. Depending on the infrastructure and systems in place, you can:

Add ML metrics to service health monitoring (i.e., Grafana or Prometheus),
Build ML-focused dashboards (i.e., Tableau, Looker, Plotly, etc.)

‍To sum up. Reusing the existing monitoring architecture for ML models can save time and resources as you don't need to build a new monitoring system from scratch. You can start by adding a couple of dashboards and expand to a more sophisticated system later.

Monitoring batch and non-batch ML models

The way we deploy our models influences how we implement ML monitoring.

Batch models allow calculating metrics in batch mode. For example, to calculate drift metrics, you need to compare two distributions: a reference data set (i.e., validation data or a previous batch) and the most recent batch of data. Model quality metrics (i.e., precision and recall) can also be calculated on top of a data batch.

Non-batch models (e.g., models operating as REST API services) are more complicated. While metrics like missing values or range violations can be calculated in real-time, for data drift or model performance, generating a batch of data is recommended to calculate those metrics.

Pro-tip. You can use window functions for non-batch ML models to perform statistical tests on continuous data streams. Pick up the window function (i.e., moving windows with or without moving reference), choose the window and step size, and “compare” the windows.

ML monitoring scheme

As a practice for the MLOps Zoomcamp ML Monitoring module, we implemented an ML monitoring scheme that can be used for batch and non-batch machine learning models. The following videos will explain step-by-step how to:

Use a prediction service. We will use a service that predicts the duration of taxi trips in NYC. It can be implemented as a REST API service or batch prediction pipeline.
Simulate production usage of the service. We will generate prediction logs and store them in a logging system as local files.
Implement monitoring jobs. We will use Prefect and the open-source Evidently library to calculate metrics.
Load metrics into a database. We will use PostgreSQL as a database.
Build a dashboard. We will use Grafana to build a dashboard on top of the database.

ML monitoring system step-by-step

It’s practice time! Let's now walk through an end-to-end example to connect the dots and implement the ML monitoring scheme using Evidently, Prefect, and Grafana.

You can find the complete code in the MLOps Zoomcamp GitHub repository.

[fs-toc-omit]Environment setup

‍Video 2. MLOps Zoomcamp: Environment setup, by Emeli Dral.

In this video, we set up the environment for the machine learning monitoring system.

‍Outline:
00:00 Create a working environment and requirements
02:00 Create and configure Docker Compose
03:35 Configure services: PostgreSQL, Adminer, and Grafana
07:15 Create and test services with Docker Compose

That’s it! You have successfully created your working environment, installed Python packages, and created a Docker Compose file.

[fs-toc-omit]Prepare reference data and model

‍Video 3. MLOps Zoomcamp: Prepare reference data and model, by Emeli Dral

In this part, we prepare a reference dataset and train a baseline model to use as a reference point in calculating ML monitoring metrics.

‍Outline:
01:31 Import libraries
04:28 Download and load data
11:30 Preprocess data, filter out outliers, check target function distribution
13:25 Select features and train a linear regression model
17:45 Evaluate ML model quality
18:50 Create a reference dataset

Done! Now, we have a reference dataset and a baseline ML model to simulate the production use of our prediction service.

[fs-toc-omit]Calculate ML monitoring metrics

‍Video 4. MLOps Zoomcamp: ML monitoring metrics calculation, by Emeli Dral

In this video, we use Evidently open-source library to calculate ML monitoring metrics.

‍Outline:
00:00 Introduction to Evidently library: Reports and Metrics
04:40 Generate and display Evidently Report in HTML
06:00 How to interpret Evidently Report: data drift detection example
06:50 Display Evidently Report as a Python dictionary and derive selected values

That’s it! We calculated ML monitoring metrics and learned how to display an Evidently Report and derive values from it.

[fs-toc-omit]Build Evidently monitoring dashboard

Video 5. MLOps Zoomcamp: Evidently Monitoring Dashboard, by Emeli Dral

In this video, we use Evidently open-source library to build a monitoring dashboard for your data and ML models.

Outline:

00:00 Import Evidently Data Quality preset
01:35 Create workspace and project
03:11 Build Evidently data quality Report and visualize the results
06:05 Add the report to our project and call Evidently UI
08:50 Configure Evidently monitoring dashboard and add panels
14:52 View the resulting dashboard in Evidently UI

That’s it! We created an ML monitoring dashboard, learned how to configure it and display it in Evidently UI.

[fs-toc-omit]Set up dummy monitoring

Video 6. MLOps Zoomcamp: Dummy monitoring, by Emeli Dral

In this video, we create dummy metrics and set up a database for our Grafana dashboard.

‍Outline:
00:00 Create a Python script for dummy metrics calculation
03:10 Prepare a database and create a table for dummy metrics
06:00 Calculate dummy metrics and load them into the table
07:40 Create a cycle and define the main function
09:00 Add sending timeout to simulate production usage of the service
10:00 Test the script: access the PostgreSQL database and create a dashboard in Grafana

Congratulations! Our configuration files are now correct: we can access our database, load the data, and build a dashboard in Grafana.

[fs-toc-omit]Set up data quality monitoring

Video 7. MLOps Zoomcamp: Data quality monitoring, by Emeli Dral

In this video, we create an actual dashboard for drift detection.

We will use Evidently to calculate the monitoring metrics:

Column drift (ColumnDriftMetric) measures whether the distribution of individual features (columns) has changed.
Dataset drift (DatasetDriftMetric) shows whether the distribution of the entire dataset has changed.
Missing values (DatasetMissingValuesMetric) measure the proportion of missing values in the dataset.

We will use Prefect to orchestrate calculating and storing drift metrics. We will store these metrics in a PostgreSQL database and visualize them using Grafana.

‍Outline:
00:00 Alter the script and load the reference data and the model
02:45 Create Evidently data drift report and derive values of selected metrics
07:00 Test and debug the script
08:30 Transform the script to Prefect pipelines
10:40 Build and customize the Grafana dashboard

[fs-toc-omit]Save Grafana dashboard

Video 8. MLOps Zoomcamp: Save Grafana dashboard, by Emeli Dral

In this video, we show how to save the Grafana dashboard so we can load it every time we rerun the Docker container without rebuilding the dashboard from scratch.

‍Outline:
00:00 Save and reuse Grafana dashboard configs
03:30 Rerun Docker and access the saved dashboard in Grafana

[fs-toc-omit]Debug with Evidently Reports and Test Suites

Video 9. MLOps Zoomcamp: Debugging with Evidently Reports and Test Suites, by Emeli Dral

In this video, we use the open-source Evidently library to go through the debugging process when data drift is detected.

Here is a quick refresher on the Evidently components we will use:

Reports compute and visualize 100+ metrics in data quality, drift, and model performance. You can use in-built report presets to make visuals appear with just a couple of lines of code.
Test Suites perform structured data and ML model quality checks. They verify conditions and show which of them pass or fail. You can start with default test conditions or design your testing framework.

‍Outline:
00:00 How to use Evidently to debug ML models and data
04:20 Load data and model
06:50 Run Evidently data drift Test Suite and visualize the results
09:50 How to interpret the results and analyze data drift with Test Suites
13:30 Build Evidently data drift Report and visualize the results
14:15 How to interpret the results and analyze data drift with Reports

[fs-toc-omit]Want to learn more about ML monitoring?

Sign up for our Open-source ML observability course. Designed for data scientists and ML engineers. Yes, it's free!

Save my seat ⟶

Summing up

ML monitoring is a crucial component of MLOPs. It helps to ensure that machine learning models remain reliable and relevant to the environment in which they function. By tracking data inputs, predictions, and outcomes, we can get visibility into how well the model is doing and resolve issues before they affect an ML service performance.

Start small and add complexity as you scale. Service health, ML model performance, data quality and integrity, and data and concept drift are good starting points for monitoring ML models in production. You can start small and introduce more comprehensive monitoring metrics iteratively.

You can implement the complete ML monitoring workflow using open-source tools. This tutorial demonstrated how to build an ML monitoring scheme for batch and non-batch machine learning models using Evidently, Prefect, PostgreSQL, and Grafana.