In this code tutorial, you will learn the following:
- What production ML monitoring is, and why you need it.
- How to start with ML monitoring by generating model performance reports.
- How to implement continuous model tracking and host an ML monitoring dashboard.
To complete the tutorial, you must have basic Python knowledge and be comfortable using the terminal. You will go through the end-to-end process of creating an ML monitoring dashboard for a toy model and view it in your browser.
We’ll use Evidently, an open-source Python library for ML model monitoring.
Want to go straight to code? Here is a Python script we will use to create a live ML monitoring dashboard.
📈 Why do you need ML monitoring?
Building an ML model is not a one-and-done process. Many things can go wrong once you deploy them to the real world. Here are some examples:
- Data quality issues. Data pipelines can break. Pre-processing code might contain bugs. Data producers can mess things up. As a result, the model can receive erroneous data inputs – and make unreliable predictions.
- Data drift. As your models interact with the real world, they encounter new subpopulations and data slices that weren't part of their initial training. For example, if a new demographic of users emerges, a demand forecasting model might struggle to make accurate predictions for it.
- Concept drift. As the world evolves, the patterns the model learned during training might not hold anymore. Say, your model predicts shopping preferences based on historical data. As trends change and new products come to market, its suggestions can become outdated.
To address all this, you need ML monitoring – a way to oversee and evaluate ML models in production. Monitoring is an essential component of MLOps. Once you deploy the models, you must keep tabs on them!
By implementing ML monitoring, you can:
- Get visibility into ML model performance. Answering "How is my model doing?" and "Is it even working?" without ad hoc queries goes a long way. It also helps build trust with the business stakeholders.
- Quickly detect and resolve model issues. Is your model getting a lot of nulls? Predicting fraud way too often? Is there a strange spike in users coming from a particular region? You want to notice before it affects the business outcomes.
- Know when it's time to retrain. Models degrade over time. Monitoring helps notice performance dips and shifts in data distribution. This way, you will know when it's time for a model update and get some context to develop the best approach.
Whether it's incorrect inputs or shifting trends, monitoring acts as an early warning system.
🚀 How to start?
It’s easy to talk about the benefits of monitoring but to be fair; this is often the least loved task. Building models is much more fun than babysitting them!
Building a complete monitoring system also sounds like a lot of work. You need both metric computation and a visualization layer which might require stitching different tools.
However, there are easier ways to start.
In this tutorial, we’ll work with Evidently – an open-source MLOps tool that helps evaluate, test, and monitor ML models. We aim to cover the core ML monitoring workflow and how to implement it in practice with the least effort.
To start, you can generate monitoring reports ad hoc. A more "manual" approach can make sense if you have only just deployed your ML model. It also helps shape expectations about the model and data quality before you automate the metric tracking.
You can query your model logs and generate a report with metrics you care about. You can explore them in the Jupyter notebook or save them as HTML files to share with others.
Let’s take a look at how it can work! First, install Evidently. If you work in Google Colab, run:
If you work in a Jupyter notebook, you must install nbextension. Check out the detailed installation instructions.
Next, import a few libraries and components required to run an example.
We will import a toy dataset as a demo. We’ll use the "adult" dataset from OpenML.
In practice, you should use the model prediction logs. They can include input data, model predictions, and true labels or actuals, if available.
Now, let’s split our toy data into two. This way, we will create a baseline dataset to compare against: we call it "reference." The second dataset is the current production data. To make things more interesting, we also introduce some changes to the data by filtering our selection using one of the columns. This is a quick way to add some artificial drift.
In practice, you can pick comparison windows based on your assumptions about the data stability: for example, you can compare last month to training.
Now, let's get to the data drift report. You must create a Report object, specify the analytical preset you want to include (in this case, the "Data Drift Preset"), and pass the two datasets you compare as "reference" and "current."
Once you run the code, you will see the Report directly in your notebook. Evidently will check for distribution drift in all the individual features and sum it up in a report.
If you click on individual features, it will show additional plots to explore the specific distributions.
In this example, we detect distribution drift in many of the features – since we artificially selected only a subset of the data with a particular education level.
If you want to share the Report, you can also export it as a standalone HTML file.
Evidently has other pre-built Reports and Tests Suites. For example, related to data or model quality. You can also design a custom Report or a Test Suite by picking individual checks. Check out the complete introductory tutorial for more details.
This tutorial will focus on Reports, but Test Suites work similarly. The difference is that Tests allow you to verify a condition explicitly: "Is my feature within a specified min-max range?" On the other hand, reports compute and visualize metrics without expectations: "Here is the min, max, mean, and number of constant values, and this is how the distribution looks."
You can go pretty far with report-based monitoring, especially if you work with batch models. For example, you can generate weekly reports, and log them to MLflow. You can run Reports on schedule using an orchestrator tool like Airflow and even build a conditional workflow – for example, generate a notification if drift is detected in your dataset.
However, this approach has its limitations, especially as you scale. Organizing and navigating multiple reports may be inconvenient. When you compute individual Reports for specific periods, there is also no easy way to track the trends – for example, see how the share of drifting features changes over time.
Here is the next thing to add: a live dashboard!
📊 ML monitoring dashboard
Evidently has a user interface component that helps track the metrics over time.
Here is how it looks:
It conveniently sits on top of the Reports like the one we just generated, so it’s easy to progress from ad hoc checks to a complete monitoring setup.
Here is the principle behind it:
- You can compute multiple Reports over time. Each Report captures data or model quality for a specific period. For example, you can generate a drift report every week.
- You must save these reports as JSON "snapshots." You log them to a directory corresponding to a given “project” – for example, to a specific ML model.
- You can launch an ML monitoring service over these snapshots. The Evidently service will parse the data from multiple individual Reports and visualize it over time.
You can run the dashboard locally to take a quick look. Let’s do this right now!
🧪 Demo project
To start, let’s launch a demo project to see an example monitoring dashboard. We’ll now head from the notebook environment to the Terminal: Evidently will then run as a web application in a browser.
1. Create virtual environment
This is an optional, but highly recommended step. Create a virtual environment and activate it.
Run the following command in the Terminal:
2. Install Evidently
Now, install Evidently in your environment:
3. Run demo project
To launch the Evidently service with the demo project, run:
To view the Evidently interface, go to URL http://localhost:8000 in your web browser.
You'll find a ready-made project that displays the performance of a simple model across 20 days. You can switch between tabs, for example, to look at individual Reports. You can even download them! They will be the same as the Data Drift report we generated in the notebook.
Each Report or Test Suite corresponds to a single day of the model’s performance. The monitoring dashboard takes the data from these Reports and presents how metrics evolve.
💻 An end-to-end example
Do you want to run a dashboard like this for your model?
Let's now walk through an end-to-end example to connect the dots and understand how you generate multiple Reports and run a dashboard on top of them.
Here is what you’ll learn to do now:
- Create a new project as if you add a new ML model to monitor.
- Imitate daily batch model inference to compute reports daily.
- Design monitoring panels to visualize the metrics.
Code example. We wrote a Python script that implements the process end-to-end. You can access it here. You can simply run the script and will get the new dashboard to look at.
To better understand what’s going on, we will go through the script step by step. You can open the file and follow the explanation.
Here is what the script does:
- Imports the required Evidently components
- Imports a toy dataset (we will again use the OpenML "adult" dataset)
- Creates a new Evidently workspace and project
- Defines the metrics to log using Evidently Reports and Test Suites
- Computes the metrics iterating over toy data
- Creates several panels to visualize the metrics
Let’s now go through each of the steps.
First, import the required components.
Next, import the data and create a pandas.DataFrame using the OpenML adult dataset.
We separate a part of the dataset as "reference" and call it adult_ref. We will later use it as a baseline for drift detection. We use the adult_cur ("current") to imitate batch inference.
2. Name the workspace and project
Now, let’s name the workspace and project. A project will typically correspond to an ML model you monitor. You will see this name and description in the user interface.
A workspace defines the folder where Evidently will log data to. It will be created in the directory where you launch the script from. This helps organize different logs that relate to one model over time.
3. Define monitoring metrics and log them
You can decide what to log – statistical data summaries, test results, or specific metrics. For example, you can capture the same data drift report shared above and log a new one daily. This way, you can later visualize the share of drifted features or particular drift scores over time – and browse individual Reports from the interface.
You can also capture data quality metrics, such as share of missing values, number constant column, min-max values, etc. You can also compute model quality summaries if you have true labels available.
It’s entirely up to you – you can log whatever you like! You can check the list of Evidently presets, metrics and tests.
To define the monitoring setup, you must create a Report, just like the drift report above, and include the Metrics or Presets you wish to capture.
Here is what we do in our example script:
- We came up with a custom combination of Metrics to track. They include overall dataset drift, the share of missing features, and drift scores and summaries for a couple of specific columns in the dataset – these could be our most important features, for example.
- We generate multiple Reports over time to imitate batch model inference. We repeat computations for i days, each time taking 100 observations. In practice, you must work with actual prediction data and compute the logs as they come.
- We pass the reference dataset. We pass adult_ref to use as the basis for distribution drift detection. This is not always required: you can compute metrics like nulls or feature statistics without reference.
There is one slight difference compared to the earlier ad hoc workflow. Instead of rendering a Report in the notebook, we now save it as a "snapshot." A snapshot is a JSON "version" of the Evidently Report or Test Suite. It contains all the information required to recreate the visual HTML report.
You must store these snapshots in a directory that the Evidently UI service can access. The monitoring service will parse the data from snapshots and visualize metrics over time.
When we generate the Reports inside a workspace (you will see it later in this script), Evidently will automatically generate them as snapshots. As simple as that!
4. Add monitoring panels
You must add monitoring panels to define what you will see on the dashboard for this particular model. You can choose between different panel types: for example, add a simple counter to show the number of model predictions made, a time series line plot to display the share of drifting features over time, a bar chart, and so on.
Here is how we do this in the script.
First, create a new project in the workspace:
Next, add panels to the project. Here is an example of adding a counter metric to show the share of drifted features.
You can check out the complete script to see the implementations for other metrics:
- The number of model predictions (counter).
- The share of missing values (line plot).
- Feature drift scores (bar plot).
After you define the design of your monitoring panels, you must save the project.
Note: these panels relate specifically to the monitoring dashboard. Since Evidently captures multiple metrics in Reports and Test Suites, you must pick which ones to plot. However, many visuals are available out of the box: logged inside Evidently Reports for each period.
5. Create the workspace and project
Finally, we need to create the workspace, the project and generate the snapshots. When you execute the script, Evidently will compute and write the snapshots with the selected metrics to the defined workspace folder, as if you captured data for 5 days. It will also create the dashboard panels as defined above.
6. Run the script and launch the service
Now, once we’ve walked through the complete script, let’s execute it!
Run the command to generate a new example project using the script:
Then, launch the user interface to see it! Run:
To view the service, head to localhost:8000.
You will be able to see a new project in the interface:
After you click on "new project", you can see the monitoring dashboard you just created.
✅ How does this work in practice?
To go through the steps in more detail, refer to the complete Monitoring User Guide.
To start monitoring an existing ML model, you must build a workflow to collect the data from your production pipelines or services. You can also run monitoring jobs over production logs stored in a data warehouse. The exact integration scenario depends on the model deployment type and infrastructure.
Here is one possible approach. You can implement it using a workflow manager like Airflow to compute Evidently snapshots on a regular cadence.
Did you enjoy the blog? Star Evidently on GitHub to contribute back! This helps us continue creating free, open-source tools and content for the community.
⭐️ Star on GitHub ⟶
ML monitoring is a necessary component of production ML deployment. It involves tracking data inputs, predictions, and outcomes to ensure that models remain accurate and reliable. ML monitoring helps get visibility into how well the model functions and detect and resolve issues.
It is possible to implement the complete ML monitoring workflow using open-source tools. With Evidently, you can start with simple ad hoc checks with Reports or Test Suites and then add a live monitoring dashboard as you scale. This way, you can start small and introduce complexity progressively.