📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Evidently

Meet Evidently 0.2, the open-source ML monitoring tool to continuously check on your models and data

Last updated:

January 23, 2025

Published:

December 8, 2022

contents‍

Start testing your AI systems today

Get demo

We are thrilled to announce our latest and largest release: Evidently 0.2.

It is a stable release. We finished a major refactoring and put a lot of learnings into code.

In this blog, we give an overview of the Evidently features.

If you look to implement ML model monitoring, it will help you grasp if Evidently is the right tool. If you already used an earlier version, it explains what exactly changed. We’ll also share some product learnings and the direction we focus on now.

What we are building

We started Evidently two years ago with the vision to create an open-source machine learning monitoring platform. With many ML models finally reaching production, it felt like the right moment to address the “life after deployment.” Once the models are in the wild, you need to track how well they are doing and react when things go wrong. You need to detect and resolve issues with data quality, data drift, and model performance—and you need a tool for that!

It was also clear to us that Evidently should be open-source. This is simply how technical users expect to adopt new tools. In monitoring, there is one more dimension: you do not want a black-box thing to monitor your black-box models. The code should be transparent and auditable—and incorporate the community input on what exactly to check for and how.

The vision remains. We are very humbled to know that Evidently was downloaded over 1.5 million times, and ML teams across industries from e-commerce to healthcare use it.

We reached a point where we believe Evidently does a few things really well: these are batch ML model checks on tabular data.

This is what we want to present with this stable release.

Let’s dive in!

1. Library of ML and data metrics

Many ML teams we talk to start with the question: Which exact metrics should I monitor? How to calculate them? Is there a best practice?

Since the beginning, the “content” of ML monitoring has been our major focus.

At first, we came up with pre-built reports and hard-coded defaults.

It evolved. 39 releases and one Big Refactoring later, we believe we have a much better, scalable design.

We took all the existing Evidently reports and took them apart. We converted them into a set of individual metrics and tests, each with handy parameters to tweak.

Evidently now is a giant set of ML evaluation “lego” bricks.

The pre-built reports are still there (no worries!), but this new architecture treats metric as a first-class citizen.

Each metric evaluates something about the data or ML model and provides an optional visualization. It can be an error metric of a regression model, a summary of the descriptive features stats, a distance metric to measure feature distribution shift, and so on.

You can choose from dozens of metrics that evaluate data quality, drift, and model performance and combine them in a single report.

Evidently pre-built metrics to customize

Turning reports into metrics might sound like an internal technicality, but here is why it’s a big deal: the new architecture makes Evidently extensible. It allows adding new types of evaluations easily. Metrics have a consistent implementation format and share calculations when it makes sense.

We will be adding new evaluation types soon. For example, to check the quality of unsupervised and time series models or model explainability.

Our goal is to comprehensively cover different aspects of the ML model and data performance, incorporate best practices on model evaluation, and provide a reasonable set of default metrics and tests for each.

2. Test suites for continuous evaluation

Visual reports help debug or perform ad hoc analysis. But if you want to keep tabs on a production model and react when things go wrong, you need a more structured and automated approach.

Enter test suites.

This alternative interface sits on top of metrics. It is made to integrate as part of an ML prediction pipeline.

Each test is a metric with a condition. They can explicitly fail or explicitly pass.

You can build a test suite to combine different checks. There are presets and a menu of individual tests to choose from. Once you run the test suite, you get a summary of the results.

With this stable release, test suites work as we envisioned.

You can pass the reference data to generate test constraints. Writing all tests from scratch can be daunting. In Evidently, you can pass an example dataset instead. The tool will automatically learn the shape of your data, generate tests and run them out of the box.

Say you want to compare this week’s data against the past to check if it is good enough. You need to pass your older data as a “reference,” list the things you care about (missing data, column types, new values, etc.), and simply run:

Passing the reference data — *Head to* *Evidently docs* *to see all the details in the User Guide*

You can, of course, customize the test parameters. You can set your own conditions and tweak how underlying metrics are calculated. We redesigned the parameters to make the API simple and declarative.

For example, Evidently has a default algorithm to select a method test for data drift detection. If you want to apply the test of your choice instead, here is how you do that. Change thresholds, choose drift methods (including those community-added), or pass your own:

Custom test suite — *Head to* *Evidently docs* *to see all the details in the User Guide*

Evidently has a handful of test presets and 50+ individual tests to choose from. As we add more metrics, we’ll add more tests, too. Our goal is to assemble best practices on how to test a production ML pipeline and put it into a structured form.

At the same time, we do not plan to add a zillion of metrics and every new data drift, outlier, or correlation method out there. We want to keep the tool maintainable and pragmatic: it is a tool made for production.

But if you want, you can add any method you need! As Evidenty is already a monitoring framework, you can do that on your own!

3. Customizable monitoring framework

Here is one more way to look at what Evidently now is: a framework. It is not just a “drift detection tool,” though you can use it as one. Evidently is also a customizable framework to implement your continuous evaluation approach.

The lego-like architecture makes it easy to extend and add other tests. For example, you can create an Evidently test suite and include a domain-specific, custom metric:

*Example of an Evidently test suite with a custom test*

You might also want to cover an evaluation dimension that Evidently does not cover (yet!), say, prediction bias or fairness.

‍Evidently provides a set of abstractions to standardize the model and data checks. It gives you flexibility (bring your evaluation scenario) and consistency (use the unified test design and structure). You can reuse them across models and stages of the model lifecycle.

How to integrate it with my ML platform?

The simple answer is: Evidently is a Python library. It works wherever the Python code can run.

Over the last year, we explored different architectures of how it best fits into production ML architectures.

We implemented integrations with workflow orchestration and lifecycle tools like Airflow, Mlflow, Prefect, and monitoring tools like Prometheus and Grafana (see documentation). The community contributed more integrations with FastAPI, Kubelfow, Metaflow, ZenML. We even saw how Evidently outputs sit on a Tableau dashboard!

This aligns with our vision, as our goal has always been to fit into different workflows.

There are various ML deployment scenarios and tooling choices. It is impossible to become a true open-source standard if you impose a single way of doing things. We want Evidently to be a universal evaluation component: you can use it with different data sources, for different model types, with our native visualization, or with external BI.

Our current priority is batch model checks. Many ML models in production are batch and do not need real-time monitoring. You can also keep tabs on many real-time models in batch mode.

So this is something we want to perfect first.

*Example continuous ML evaluation workflow*

You can run the Evidently test suites at defined intervals or checkpoints: every time you get a new batch of data, generate the predictions, get new labels, and so on.

*Example of an Airflow DAG with Evidently test suites*

What’s next?

With this stable release, Evidently is now a production-grade open-source tool to continuously test your ML models and data.

It is a library of metrics and tests, a framework, and a visualization tool. You can integrate it with many other components of your ML workflow.

There are still loads of things to cover! As we work towards a full-scale open-source ML monitoring platform, here is where we are heading next:

Text data support and text-specific evaluations
More presets and metrics: from time series analysis to outlier detection
Spark support for distributed computation
More integration blueprints

Does this vision resonate with you? Which metrics should we add, and which integrations should we prioritize? Are you already using Evidently in production and have some suggestions? Jump on the Discord community to share or open a GitHub issue.