TL;DR: Meet the new feature in the Evidently open-source Python library! You can now use it to evaluate, test, and monitor text data. This release also includes a new Text Overview preset and new drift detection methods.
What is it?
Evidently now supports raw text data as input. Previously, Evidently only worked with tabular data, expecting categorical or numerical features. Now, you can pass raw text data as well. For example, user reviews, emails, product descriptions, or any other texts your ML models use.
What’s more, you can pass multi-modal data that combines features of different types in a single dataset.
How does it work?
If you want to see the code, you can use this example notebook that shows how different metrics work for text data. If you are new to the tool, we suggest first completing the general Getting Started tutorial first.
All relevant Evidently presets, tests, and metrics will now support text data. For example, you can pass your reference and current data and call the Data Drift report to compare the distributions. You only need to specify which columns contain text data using column mapping.
Here is an example column mapping for a dataset with different feature types:
Then, call the Report or Test Suite as usual:
Here is how the Data Drift report will look if you pass the text data in some columns.
You can use other presets and metrics too. For example, you can call a Data Quality preset to understand and explore your data. This report provides a dataset overview and various descriptive stats and visualizations for each feature. If your data contains text columns, they will now get a corresponding summary as well:
If you work exclusively with text data, you can run a new Text Overview preset. It combines various metrics related to text data quality and data drift in a single place. You can use it to compare two datasets or to get an overview of a single text dataset.
As always, the interactive visual report is only one of the output options. You can also use Evidently to calculate these metrics as JSON or Python dictionary or run drift checks as part of a pipeline using Test Suites. You can generate the visual report only if something is wrong and you want to debug it.
Drift detection on text data
You are probably wondering what is under the hood of this new text drift detection. How can you decide if two sets of texts are similar or different?
There are two answers here.
First, the default Evidently method for drift detection is model-based drift detection. It is also known as a domain classifier method. This approach applies to any text columns that appear in the Data Drift table.
In this case, Evidently tries to build a binary classification model that can discriminate between data from reference and current distributions. If the model can confidently identify which text samples belong to the “newer” data, you can consider that the two datasets are significantly different.
Text Content Drift.
You can read more about the domain classifier approach in the paper “Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift.” We created a specific implementation for text data in the Evidently library. This new drift detection method is called Text Content Drift.
The drift score, in this case, is the ROC AUC score of the domain classifier computed on a validation dataset. For small data (<1000 objects), Evidently will compare the ROC AUC to the ROC AUC of the random classifier at a set percentile. We repeat the calculation 1000 times with randomly assigned target class probabilities to ensure the result is statistically meaningful. This produces a distribution with a mean of 0,5. We then take the 95th percentile of this distribution and compare it to the ROC-AUC score of the domain classifier. If the classifier score is higher, we consider the data drift to be detected. (You can also set a different percentile as a parameter).
For larger data, the ROC AUC of the obtained classifier is directly compared against the set ROC AUC threshold.
This approach helps identify how well a machine learning model can distinguish between current and reference data. It also protects against false positive drift results for small datasets since we explicitly compare the classifier score against the “best random score” we could obtain.
If the drift is detected, Evidently will also calculate the top features of the domain classifier. The resulting output contains specific characteristic words that help identify whether a given sample belongs to reference or current. They normalized based on vocabulary, for example, to exclude non-interpretable words like articles.
Exploring specific distinctive words can help to debug the data drift. For example, it can surface words that became more popular with time or a change in the overall sentiment.
Second, Evidently can also detect drift in the text characteristics. With this approach, Evidently first calculates several features that help describe the text itself. We call them text descriptors.
Currently, there are the following features:
- Text Length
- Share of out-of-vocabulary words (OOV %)
- Share of non-letter characters
You can also use text descriptors to checks if the texts contain specific words from the list (Trigger Words Presence).
Then, Evidently evaluates the distribution drift in these descriptive features. This helps uncover specific changes, such as if the text length patterns are different between the two datasets.
Text Descriptors Drift.
In this case, Evidently compares the distribution of these text descriptors between the two datasets (reference and current) using different statistical tests or distance metrics. It works the same way as drift detection in other numerical features and follows the same algorithm.
Basically, we treat test descriptors as additional (auto-generated) features.
This drift detection method is optional. If you want to apply it, you should call a metric called TextDescriptorsDriftMetric and include it in your custom Report. Alternatively, you can call the TextOverviewPreset that contains this metric.
Text data quality and integrity
If you want to focus on descriptive statistics instead of statistical distribution shifts, you can do that too.
Evidently calculates various descriptive statistics for each text feature in the dataset. As shown above, the ColumnSummaryMetric for text includes:
- Share of Missing Values
- Text Length (min, mean, max)
- Out of Vocabulary % (min, mean, max)
- Non-letter character % (min, mean, max)
You can get these metrics visually in a table or export them in JSON or Python dictionary format. You can then log and use these metrics however you want, for example, to compare the average length of the text or set a condition on what you expect. (Note: we’ll make this even easier in the next release by giving the ability to pass the auto-generated metrics into existing Tests and Test Suites).
You can also run existing Data Integrity and Data Quality tests on text data. For example, you can test the dataset for missing values as usual.
What is cool about it?
Multi-modal data. You might often deal with mixed datasets containing text and non-text features. This implementation makes it possible to evaluate, test, and monitor these datasets in the same manner, using the same tool. And you can use it for purely text data, too!
Familiar API. We put a lot of effort into ensuring that existing Evidently metrics, tests, and presets are fully compatible with text data. You do not need to install a separate package or learn a different syntax. If you have already been using Evidently with tabular data, using it for texts requires no additional effort. We adapted existing Metrics and Tests to work with text data or auto-generated descriptor features whenever possible.
Interpretable drift detection for texts. Both new drift detection methods are interpretable. They help detect the change and understand what drives it, either by detecting specific characteristic words or understanding changes in the text descriptors.
From data evaluation to NLP model monitoring. You can use these new evaluation methods during initial data analysis or when debugging model issues: in this case, generate a visual HTML report. You can also perform the checks in a batch manner, orchestrating it using a tool like Airflow. For example, you can perform a drift detection test every time you get a new batch of data. In this case, use a Test Suite.
How can I use it?
If you want to try it out, pip install Evidently, and give it a spin! If you are new to the tool, head to docs to review the general Getting Started tutorial first. If you have already been using Evidently on the tabular data, you can quickly go through the example notebook.
You can run this example notebook that shows how to use different Evidently metrics, tests, and presets with text data. For a more learning-oriented experience, head to the complete tutorial that walks you through the process of building, breaking, and debugging an NLP model on a toy dataset.
[fs-toc-omit]Get started with open-source ML monitoring
Evaluate, test, and monitor ML models in production with Evidently. From tabular data to NLP and LLM. Built for data scientists and ML engineers.
Get started ⟶
Cloud waitlist ⟶