📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Evidently

Evidently feature spotlight: NoTargetPerformance test preset

Last updated:

January 23, 2025

Published:

November 29, 2022

contents‍

Start testing your AI systems today

Get demo

In this series of blogs, we are showcasing specific features of the Evidently open-source ML monitoring library.

Let’s start with the test preset called NoTargetPerformance, which helps ML monitor the model with delayed feedback.

Feature use case

NoTargetPerformance is a pre-built test suite that combines several checks for data quality, integrity, and drift. You can use it when you generate model predictions in batches and get the true labels or values with a delay.

Say your ML model scores the client base weekly. The goal is to recommend which marketing offers to include in a personalized newsletter. After the marketing team sends it out, you might have to wait a few extra weeks for the “ground truth” data on which offers were accepted.

Yet, you still need a way to run some sanity checks regarding your model quality even before you know the truth.

Every time you apply the model, you can call the NoTargetPerformance preset. You can compare this week’s data to the reference and make sure the model’s predictions use quality data and are reasonable.

Here is how the preset output looks on a toy dataset:

Code example

To call the preset after installing Evidently and passing the data, simply run:

no_target_performance = TestSuite(tests=[
   NoTargetPerformanceTestPreset()
])
 
no_target_performance.run(reference_data=past_week, current_data=curr_week)
no_target_performance

Evidently will then apply multiple tests to your data and return a summary of the results.

Preset contents

The preset combines several checks that go well together.

Data quality checks. They verify if the input column types match the reference and whether you have features out of range (for numerical columns) or out of the list (for categorical data). The preset also checks for missing data.

If these tests fail, this can be an early indicator of the potential model issues.

For example, if the data schema does not match or there are a lot of nulls, you might need to investigate the data issues with the upstream data producer or check your feature transformation pipelines.

Do some features contain values out of range or new unexpected categories? You might want to look at them manually to judge if the model will handle these new inputs well.

Prediction drift. Assuming all is well with the data, the next step is to check the model outputs. The goal is to identify if there is a distribution shift in the model prediction.

The test automatically picks an appropriate drift detection method (e.g., Kolmogorov-Smirnov, Wasserstein distance, etc.) based on the target type and volume of objects. It returns a drift/no drift result for the model outputs.

In the absence of ground truth, shifts in the model predictions are often the best proxy to judge its quality.

For example, what if the model suggests a single product to half the user base? That is something you’d probably want to notice and check on before sending out the newsletter.

Input data drift. The preset also tests for distribution shifts in the model input features.

If you detect prediction drift, understanding the feature drift can help debug what contributes to it. You can also rely on this as a standalone check to detect and explore meaningful shifts in your data.

Evidently checks each feature for distribution drift and returns the overall share of drifting features (alerting if more than ⅓ of the features drifted). It uses the same automated algorithm to detect an appropriate statistical test for each feature.

NoTargetPerformance share of drifted columns

You can easily override the default drift test conditions. For example, if you want to declare dataset drift only when 50% of features drift or to use a different statistical test.

Say you want to use Population Stability Index and consider values above 0.2 drift on the individual level. Here is how you can pass this condition:

no_target_performance = TestSuite(tests=[
   NoTargetPerformanceTestPreset(stattest='psi', stattest_threshold=0.2 drift_share=0.5)
])

You can also pass a list of specific columns to check for drift (we recommend using your most important features).

Input data stability. One more check specifically verifies the change in the mean values of numerical features. It automatically tests if the means of all numerical columns are within 2 standard deviations from the reference mean.

This is an easily interpretable check. It is also less sensitive than most default drift detection approaches. Instead, it helps quickly identify columns that are far from the expected range and point to the most-changed inputs.

Even if these changes appear in not-so-important features, you might want to take a look to see what is going on.

NoTargetPerformance mean value stability

What is cool about it?

1. This preset provides a reasonable template.

Yes, you can probably tune this approach to better suit your needs and add additional tests or drop some of the existing ones. But this is already a great starting point to perform a set of checks without writing complex rules and thinking too long about it.

‍2. It works out of the box.

You only need to pass the reference and current data: for example, to compare this week to the past. Evidently will automatically infer data schema and feature behavior to perform the comparison.

There is no complex setup. You do not need to manually write all the conditions or think through how to perform a distribution drift test unless you want to.

3. It helps both detect and debug

The test suite explicitly returns which tests passed and which failed. You can then navigate and look at the supporting visuals for the failed tests to see what exactly has gone wrong.

For example, if the prediction drift test fails, you can see the distribution of the outputs to understand what exactly changed.

What else can I do?

A lot!

You can adapt this preset to your needs. For example, pass only the most important features to check for drift, choose a different statistical test, set a different threshold, etc. Each of the tests has convenient parameters you can select.
Change the contents of the preset. If the example inspires you but it does not precisely match what you want to see, you can create a custom test suite by selecting tests from those available in the Evidently library. ‍
Automate the check as part of the prediction pipeline. You can integrate this check in your prediction pipeline using tools like Airflow.