TL;DR: Meet the new Data Quality report in the Evidently open-source Python library! You can use it to explore your dataset and track feature statistics and behavior changes. It is available both as an HTML dashboard and a JSON profile.
What is it?
We are happy to announce a new addition to the Evidently open-source Python library: an interactive report on Data Quality.
The Data Quality report helps explore the dataset and feature behavior and track and debug data quality when the model is in production.
You can generate the report for a single dataset. For example, when you are doing your exploratory data analysis.
It will quickly answer the questions like:
- How many features of each type do we have?
- How many features are (mostly) missing or constant?
- How is each of the features distributed?
- Which features are strongly correlated?
You can also generate the report for two datasets and contrast the properties of each feature and whole data side by side.
It will then help you answer the comparison questions:
- Are two datasets similar?
- If something has changed, what exactly?
You can use the comparison feature to understand different segments in your data. For example, you can contrast data from one geographic region against another. You can also use it to compare older and newer data batches: for example, when evaluating different model runs.
The report is available in two formats:
- A visual HTML dashboard. You can spin it up in Jupyter notebook or Colab or export it as a separate HTML file.
- A JSON profile. It provides a snapshot of metrics you can log or use elsewhere, e.g. to integrate data quality check as a step in an Airflow DAG.
You are reading a blog about an early Evidently release. This functionality has since been improved and simplified. You can read more about migrating to a single Reports object instead of Dashboards and JSON profiles and check out the current documentation for details.
How is it different from the data drift report?
One might ask, how is it different from the Data Drift report in Evidently?
The data drift report performs statistical tests to detect changes in the feature distributions between the two datasets. It helps visualize distributions but does not go into further detail on feature behavior.
The data quality report looks at the descriptive statistics and helps visualize relationships in the data. Unlike the data drift report, the data quality report can also work for a single dataset.
If you are looking to evaluate the data changes for your production model, you might use both reports together as they complement each other.
How does it work?
To generate the report Evidently needs one or two datasets. If you are working with a notebook, you should prepare them as Pandas DataFrames. If you are using a command-line interface, prepare them as .csv files.
If you use two datasets, the first dataset is "Reference," and the second is "Current." You can also prepare a single dataset and then explicitly state where the rows belong to perform the comparison.
Once you import Evidently and its components, you can spin up your report with just a couple of lines of code:
You might need to specify column mapping to ensure all features are processed correctly. Otherwise, it would work automatically by deriving the feature type from the pandas data type.
Pro tip: if you have a lot of data, you might want to apply some sampling strategy or generate the report only for some of the features first.
Let's have a look at what's inside!
The first table quickly gives an overview of the complete dataset (or two!). You can immediately spot things like a high share missing or constant features.
What's cool here: note "almost missing" and "almost constant" rows. It is often relevant to detect such issues for real-world datasets to sort out features that would be hard to rely on.
Next, you will see a statistical overview and a set of visualizations for each feature. They include descriptive statistics, feature distribution visuals, distribution of the feature in time, and distribution of the feature in relation to the target.
What's cool here:
- For each feature type (numerical, categorical, and datetime), the report generates a set of custom visualizations. They highlight what is most relevant for a given feature type.
- If you are performing the comparison, it also helps detects the changes quickly. For example, notice the number of new categories for a categorical feature.
- The visualization of the feature's relationship with the target helps build intuition on how useful the feature is or detect a target leak.
What's more, each plot is interactive! Evidently uses Plotly on the back end, and you can zoom in and out as needed, or switch between logarithmic and linear scale for a feature distribution, for example.
For example, here is how the summary widget for a numerical feature might look:
Here is the numerical feature distribution in time that highlights the values that belong to the reference and current distribution:
The feature by target functionality helps explore the relationship between the feature and the target and its changes between the two datasets. Here is an example of a categorical feature:
The report also generates a table summary of pairwise feature correlations and correlation heat maps.
What's cool here:
- It explicitly lists the top-5 most correlated numerical and categorical features.
- If you perform a comparison, it lists the features where correlation has changed between the reference and current datasets.
This way, you can quickly grasp the properties of your dataset and select the features that need a closer look (or should be excluded from the modeling).
And of course, the visuals:
You can check out the complete documentation for more details and examples.
Can I modify the report?
Of course! All Evidently reports can be customized.
You can mix and match the existing widgets however you like or even add a custom widget. Here is the detailed documentation on customization options.
Pro tip: check our recent release with the text widget. You can use it to annotate the report and highlight some of the findings. For example, if you want to store the report as documentation or share it with the other team members.
What about JSON profiles?
Business as usual! You can get the report output as a JSON summary.
You can use it however you like. For example, you can generate and log the data quality snapshot for each model run and save it for future evaluation. You can also build a conditional workflow around it: maybe generate an alert or a visual report, for example, if you get a high number of new categorical values for a given feature.
When to use the report
Here are a few ideas on how to use the Data Quality report:
- Exploratory data analysis. This report provides many insights that can help even before you build a model. Use it to understand the data and decide if is good enough to start.
- Feature selection. You can use the report to quickly sort out the empty and constant (or almost empty and constant) features. You can automate this selection process using the JSON profile, for example, to repeat whenever you retrain the model.
- Dataset comparison. You often want to contrast the datasets during modeling, testing, or production evaluation. For example, you might use it to compare data in the train and validation split or in between batches. Use the dashboard for visual comparison and JSON profiles in automatic evaluation.
- Data profiling. You can log a JSON profile of the data used in each model run for future evaluation or analysis.
- Rule-based data quality monitoring. You can build a conditional workflow if you detect a change in your data properties, e.g., an increase in constant values. In this case, you can rely on Evidently to calculate the metrics and then define the logic around it. (Here is an example of how to use Evidently with Airflow).
- Data documentation. This report can document your data properties for future model governance. For example, you can use it to describe the data used in model training.
- Data drift debugging. If you detect data or target drift for your production model, you usually need to drill down into the feature changes. The data quality report can provide additional details for each drifting feature.
- Production model debugging. If you are directly monitoring the model quality and notice a decay, you can spin up this report to dig into the details of the data changes.
How can I try it?
[fs-toc-omit] Want to stay in the loop?
Sign up to the User newsletter to get updates on new features, integrations and code tutorials. No spam, just good old release notes.
For any questions, contact us via email@example.com. That's an early release, so let us know of any bugs! You can also open an issue on Github.