Our CTO, Emeli Dral, was an instructor for the ML Monitoring module of MLOps Zoomcamp 2023, a free MLOps course from DataTalks.Club. In case you missed it or fancy a refresher, we put together ML monitoring module course notes for you. Let’s recap!
About MLOps Zoomcamp
The course covers MLOps best practices and the ML model lifecycle end-to-end, from experiment tracking to building ML pipelines to monitoring machine learning models after deployment. The curriculum of MLOps Zoomcamp is practice-oriented. Students implement best MLOps practices from experiments to production, working with open-source MLOps tools like MLflow, Prefect, Grafana, and Evidently.
MLOps Zoomcamp is built for data scientists and ML engineers, as well as software engineers and data engineers interested in production machine learning.
The course includes six modules. In this blog, we prepared a summary of the ML Monitoring Module. It is self-contained, so you can go through this module without taking the previous ones.
[fs-toc-omit]Want to learn more about ML monitoring?
Sign up for our Open-source ML observability course. Designed for data scientists and ML engineers. Yes, it's free!
Save my seat ⟶
ML Monitoring module overview
Machine learning model monitoring is an essential component of MLOps. Many things can go wrong once you deploy ML models to the real world. To detect and resolve issues before they affect your production ML service, you need an ML monitoring system.
MLOps Zoomcamp ML monitoring module is just about that! It covers the basics of monitoring ML models in production and demonstrates how to implement an ML monitoring system using open-source tools step-by-step.
The Module includes eight videos. The first video goes through the key ML monitoring concepts. It also introduces an ML monitoring architecture that uses Evidently for metric calculation and the Grafana dashboard for metric visualization. The following videos walk through the code implementation. They cover training a simple ML model, designing an ML monitoring dashboard, and going through the debugging process when drift is detected.
Below, we will summarize the course notes and link to all the practical videos.
Introduction to ML model monitoring
Let’s take a look at the key monitoring concepts!
Video 1. MLOps Zoomcamp: Introduction to ML monitoring, by Emeli Dral.
ML monitoring metrics
When monitoring a production service, one usually keeps tabs on service health metrics like uptime, memory, or latency. While service health monitoring is a must, an extra layer is related to the data and the ML model itself.
ML model performance
ML model performance metrics help to ensure that ML models work as expected. A particular set of metrics will depend on the use case. For example:
- You can use ranking metrics in search engine optimization or content recommender systems.
- In regression problems, looking at metrics like Mean Absolute Error (MAE) or Mean Absolute Percentage Error (MAPE) is useful.
- In classification problems, calculate Log Loss or precision and recall.
Data quality and integrity
Often, ground truth is not available immediately to calculate ML model performance metrics. In this case, you can use proxy metrics. In most cases, when something is wrong with the model, this is due to data quality and integrity issues. Some metrics to look for are the share of missing values, column types, and value range for each column.
Data drift and concept drift
Even if the data is fine, you can still face some problems. ML models work in the real world, and things change. To ensure ML models are still relevant, you can look at data and concept drift. Distribution changes between the current and reference data may signal potential problems with the model.
To sum up. Service health, ML model performance, data quality and integrity, and data and concept drift are good starting points for monitoring ML models in production. Depending on the use case and available resources, you can introduce more comprehensive monitoring metrics like model fairness, bias, outliers, explainability, etc.
Reusing the existing architecture
If there are already some production services deployed that you monitor, or you use some business intelligence tools, consider reusing existing systems to start with ML monitoring. Depending on the infrastructure and systems in place, you can:
- Add ML metrics to service health monitoring (i.e., Grafana or Prometheus),
- Build ML-focused dashboards (i.e., Tableau, Looker, Plotly, etc.)
To sum up. Reusing the existing monitoring architecture for ML models can save time and resources as you don't need to build a new monitoring system from scratch. You can start by adding a couple of dashboards and expand to a more sophisticated system later.
Monitoring batch and non-batch ML models
The way we deploy our models influences how we implement ML monitoring.
Batch models allow calculating metrics in batch mode. For example, to calculate drift metrics, you need to compare two distributions: a reference data set (i.e., validation data or a previous batch) and the most recent batch of data. Model quality metrics (i.e., precision and recall) can also be calculated on top of a data batch.
Non-batch models (e.g., models operating as REST API services) are more complicated. While metrics like missing values or range violations can be calculated in real-time, for data drift or model performance, generating a batch of data is recommended to calculate those metrics.
Pro-tip. You can use window functions for non-batch ML models to perform statistical tests on continuous data streams. Pick up the window function (i.e., moving windows with or without moving reference), choose the window and step size, and “compare” the windows.
ML monitoring scheme
As a practice for the MLOps Zoomcamp ML Monitoring module, we implemented an ML monitoring scheme that can be used for batch and non-batch machine learning models. The following videos will explain step-by-step how to:
- Use a prediction service. We will use a service that predicts the duration of taxi trips in NYC. It can be implemented as a REST API service or batch prediction pipeline.
- Simulate production usage of the service. We will generate prediction logs and store them in a logging system as local files.
- Implement monitoring jobs. We will use Prefect and the open-source Evidently library to calculate metrics.
- Load metrics into a database. We will use PostgreSQL as a database.
- Build a dashboard. We will use Grafana to build a dashboard on top of the database.
ML monitoring system step-by-step
It’s practice time! Let's now walk through an end-to-end example to connect the dots and implement the ML monitoring scheme using Evidently, Prefect, and Grafana.
You can find the complete code in the MLOps Zoomcamp GitHub repository.
Video 2. MLOps Zoomcamp: Environment setup, by Emeli Dral.
In this video, we set up the environment for the machine learning monitoring system.
00:00 Create a working environment and requirements
02:00 Create and configure Docker Compose
03:35 Configure services: PostgreSQL, Adminer, and Grafana
07:15 Create and test services with Docker Compose
That’s it! You have successfully created your working environment, installed Python packages, and created a Docker Compose file.
[fs-toc-omit]Prepare reference data and model
Video 3. MLOps Zoomcamp: Prepare reference data and model, by Emeli Dral
In this part, we prepare a reference dataset and train a baseline model to use as a reference point in calculating ML monitoring metrics.
01:31 Import libraries
04:28 Download and load data
11:30 Preprocess data, filter out outliers, check target function distribution
13:25 Select features and train a linear regression model
17:45 Evaluate ML model quality
18:50 Create a reference dataset
Done! Now, we have a reference dataset and a baseline ML model to simulate the production use of our prediction service.
[fs-toc-omit]Calculate ML monitoring metrics
Video 4. MLOps Zoomcamp: ML monitoring metrics calculation, by Emeli Dral
In this video, we use Evidently open-source library to calculate ML monitoring metrics.
00:00 Introduction to Evidently library: Reports and Metrics
04:40 Generate and display Evidently Report in HTML
06:00 How to interpret Evidently Report: data drift detection example
06:50 Display Evidently Report as a Python dictionary and derive selected values
That’s it! We calculated ML monitoring metrics and learned how to display an Evidently Report and derive values from it.
[fs-toc-omit]Set up dummy monitoring
Video 5. MLOps Zoomcamp: Dummy monitoring, by Emeli Dral
In this video, we create dummy metrics and set up a database for our Grafana dashboard.
00:00 Create a Python script for dummy metrics calculation
03:10 Prepare a database and create a table for dummy metrics
06:00 Calculate dummy metrics and load them into the table
07:40 Create a cycle and define the main function
09:00 Add sending timeout to simulate production usage of the service
10:00 Test the script: access the PostgreSQL database and create a dashboard in Grafana
Congratulations! Our configuration files are now correct: we can access our database, load the data, and build a dashboard in Grafana.
[fs-toc-omit]Set up data quality monitoring
Video 6. MLOps Zoomcamp: Data quality monitoring, by Emeli Dral
In this video, we create an actual dashboard for drift detection.
We will use Evidently to calculate the monitoring metrics:
- Column drift (ColumnDriftMetric) measures whether the distribution of individual features (columns) has changed.
- Dataset drift (DatasetDriftMetric) shows whether the distribution of the entire dataset has changed.
- Missing values (DatasetMissingValuesMetric) measure the proportion of missing values in the dataset.
We will use Prefect to orchestrate calculating and storing drift metrics. We will store these metrics in a PostgreSQL database and visualize them using Grafana.
00:00 Alter the script and load the reference data and the model
02:45 Create Evidently data drift report and derive values of selected metrics
07:00 Test and debug the script
08:30 Transform the script to Prefect pipelines
10:40 Build and customize the Grafana dashboard
[fs-toc-omit]Save Grafana dashboard
Video 7. MLOps Zoomcamp: Save Grafana dashboard, by Emeli Dral
In this video, we show how to save the Grafana dashboard so we can load it every time we rerun the Docker container without rebuilding the dashboard from scratch.
[fs-toc-omit]Debug with Evidently Reports and Test Suites
Video 8. MLOps Zoomcamp: Debugging with Evidently Reports and Test Suites, by Emeli Dral
Here is a quick refresher on the Evidently components we will use:
- Reports compute and visualize 100+ metrics in data quality, drift, and model performance. You can use in-built report presets to make visuals appear with just a couple of lines of code.
- Test Suites perform structured data and ML model quality checks. They verify conditions and show which of them pass or fail. You can start with default test conditions or design your testing framework.
00:00 How to use Evidently to debug ML models and data
04:20 Load data and model
06:50 Run Evidently data drift Test Suite and visualize the results
09:50 How to interpret the results and analyze data drift with Test Suites
13:30 Build Evidently data drift Report and visualize the results
14:15 How to interpret the results and analyze data drift with Reports
Want a deeper dive into ML observability?
Sign up for our free Open-source ML observability course that starts on October 16, 2023! You can join at any point; all course materials will be available after the course.
📚 We will cover the key concepts of ML monitoring and observability, different types of evaluations, and how to integrate them into ML pipelines.
💻 The course is practice-oriented. We will provide end-to-end deployment blueprints and walk you through the code. You will use tools like Evidently, MLflow, Airflow, and Grafana.
👩🎓 Whether you are a data scientist, ML engineer, technical product manager, or analyst, we hope you will find the course useful!
🗓 Start date: October 16, 2023.
ML monitoring is a crucial component of MLOPs. It helps to ensure that machine learning models remain reliable and relevant to the environment in which they function. By tracking data inputs, predictions, and outcomes, we can get visibility into how well the model is doing and resolve issues before they affect an ML service performance.
Start small and add complexity as you scale. Service health, ML model performance, data quality and integrity, and data and concept drift are good starting points for monitoring ML models in production. You can start small and introduce more comprehensive monitoring metrics iteratively.
You can implement the complete ML monitoring workflow using open-source tools. This tutorial demonstrated how to build an ML monitoring scheme for batch and non-batch machine learning models using Evidently, Prefect, PostgreSQL, and Grafana.