contents
We are excited to announce our first release. You can now use Evidently open-source python package to estimate and explore data drift for machine learning models.
It helps you quickly understand: did my data change, and if yes, where?
As an interactive report right in the Jupyter notebook.
You need to prepare two datasets. One is the reference: you will use it as a baseline for comparison. Pick something you consider a good example, and where your model performed reliably. It can be your training or validation data. Or, production data from some past period.
The second dataset is the most recent (current) production data you want to evaluate.
Import your data as a Pandas DataFrame. You can have two DataFrames, or a single one where you explicitly select which rows belong to the reference, and which to the production data.
Then, you can use Evidently to generate an interactive report like this:
We show the drifting features first. Using different statistical tests and metrics, Evidently makes a drift/no drift decision for each feature individually.
You might want to explore them all or look into your key drivers.
By clicking on each feature, you can explore the values mapped in a plot. The green area covers one standard deviation from the mean, as seen in the reference dataset.
Or, you can zoom on distributions to understand what has changed:
You are reading a blog about the first Evidently release. The tool has evolved since then! It supports various ML monitoring metrics and architectures. Check out the current documentation.
We wrote a whole blog about Data and Concept Drift. In short, things change, and this can break your models. Detecting this is key to maintaining good performance.
If there are data quality issues, Evidently will also pick this up. When your data goes missing or features break, this usually shows in data distributions. We will soon add more fun reports to explore features and analyze data quality. But this one can already serve as a proxy.
We implemented different statistical tests and drift detection methods, so you don't need to think them through. We know these are quite cumbersome to write, and there is quite some chance to mess it up. Solved.
By default for small datasets, Evidently will use a two-sample Kolmogorov-Smirnov test for numerical features and the chi-squared test for categorical features, both at 0.95 confidence level. To detect drift in larger datasets, it will use other metrics like Wasserstein distance. You can select any of the other methods available in the library.
The visuals are helpful, and would otherwise take considerable time to code in Plotly or Matplotlib. Here, each feature gets an interactive plot you can explore to understand its behavior.
What's more, you can share this report around as an HTM file. If you ever had a back-and-forth exchange of screenshots with another department, you will like this one:
Finally, it is dead simple to install and use. No new tool to learn, no service to maintain. Just open your notebook and try it out!
You can also export the results as JSON or Python dictionary to integrate easily with your prediction pipelines.
Of course, when your model is in production. But also before.
Here are a few ideas on how you can use the data drift tool:
Go to GitHub and explore the tool in action using sample notebooks.
Head to docs for more details.
If you like it, spread the word.
Sign up to the User newsletter to get updates on new features, integrations and code tutorials. No spam, just good old release notes.
Subscribe ⟶
If you have any questions or thoughts, write to us on [email protected]. That's an early release, so send any bugs that way, or open an issue on Github.