July 28, 2022
Last Updated:
September 19, 2022

Introducing Evidently 0.0.1 Release: Open-Source Tool To Analyze Data Drift

Evidently
more blogs liKe this
Thank you! Please check your email to confirm subscription!
Oops! Something went wrong while submitting the form.

We are excited to announce our first release. You can now use Evidently open-source python package to estimate and explore data drift for machine learning models.

It helps you quickly understand: did my data change, and if yes, where?

How does it work?

As an interactive report right in the Jupyter notebook.

You need to prepare two datasets. One is a reference: we will use it as a baseline for comparison. Pick something you consider a good example, and where your model performed reliably. It can be your training data. Or, production data from some past period.

The second dataset is the most recent (current) production data you want to evaluate.

Import your data as a Pandas DataFrame. You can have two DataFrames, or a single one where you explicitly select which rows belong to the reference, and which to the production data.

Then, you can use Evidently to generate an interactive report like this:

A table with the list of features and a column that says whether each of them drifts.

We show the drifting features first, sorting them by P-value. Using a statistical test, we make a drift/no drift decision for each feature individually.

You might want to explore them all or look into your key drivers.

By clicking on each feature, you can explore the values mapped in a plot. The green area covers one standard deviation from the mean, as seen in the reference dataset.

A plot that shows data drift for an individual feature.

Or, you can zoom on distributions to understand what has changed:

A plot that shows data distribution for an individual feature.

Why is it important?

We wrote a whole blog about Data and Concept Drift. In short, things change, and this can break your models. Detecting this is key to maintaining good performance.

If there are data quality issues, our tool will also pick this up. When your data goes missing or features break, this usually shows in data distributions. We will soon add more fun reports to explore features and analyze data quality. But this one can already serve as a proxy.

What is cool about it?

We implemented the statistical tests for you, so you don't need to think them through. We know these are quite cumbersome to write, and there is quite some chance to mess it up. Solved.

We use a two-sample Kolmogorov-Smirnov test for numerical features and the chi-squared test for categorical features, both at 0.95 confidence level. We will add some levers later on, but this is a good enough default approach.

The visuals are helpful, and would otherwise take considerable time to code in Plotly or Matplotlib. Here, each feature gets an interactive plot you can explore to understand its behavior.

What's more, you can share this report around as a .html file. If you ever had a back-and-forth exchange of screenshots with another department, you will like this one:

Text of email to a marketing department with an .html drift report attached.

Finally, it is dead simple to install and use. No new tool to learn, no service to maintain. Just open your notebook and try it out!

An update: you can also export the results as a JSON profile to integrate easily with your prediction pipelines.

When should I use it?

Of course, when your model is in production. But also before.

Here are a few ideas on how you can use the data drift tool:

  • Support your model maintenance. Understand when it is time to retrain your model, or which features to drop when they are too volatile.
  • Before acting on model predictions. Validate that your input data is from the same distribution, and you are not feeding anything outrageously different into your model.
  • When debugging model decay. If the model quality dropped, use the tool to explore where the change comes from.
  • In A/B test or trial use. Detect training-serving skew and get better context to interpret test results.
  • Before deployment. Understand drift in the offline environment. Explore past shifts in the data to define your future retraining needs and monitoring strategy.
  • To find useful features when building a model. Get creative: you can also use the tool to compare feature distributions in your positive and negative class. This will quickly surface the best discriminants.

How can I try it?

Go to Github and explore the tool in action using sample notebooks. You can explore how it works on the Iris dataset (Jupyter notebook, Colab) and the California housing dataset (Jupyter notebook, Colab). More examples are here.

Head to docs for more details.

If you like it, spread the word.

________________

If you have any questions or thoughts, write to us on hello@evidentlyai.com. That's an early release, so send any bugs that way, or open an issue on Github.

Want to stay in the loop?

https://www.linkedin.com/in/emelidral/
Emeli Dral

Co-founder and CTO

https://www.linkedin.com/in/elenasamuylova/
Elena Samuylova

Co-founder and CEO

You Might Also Like:

Best MLOps content

Get a roundup with the best blogs, events, and product updates every few weeks. No spam.

Thank you! Please check your email to confirm subscription!
Oops! Something went wrong while submitting the form.

By signing up you agree to receive emails from us. You can opt out any time.

November 30, 2020
Last Updated:
September 19, 2022

Introducing Evidently 0.0.1 Release: Open-Source Tool To Analyze Data Drift

Evidently
Get PRODUCT UPDATES
Thank you! Please check your email to confirm subscription!
Oops! Something went wrong while submitting the form.

We are excited to announce our first release. You can now use Evidently open-source python package to estimate and explore data drift for machine learning models.

It helps you quickly understand: did my data change, and if yes, where?

How does it work?

As an interactive report right in the Jupyter notebook.

You need to prepare two datasets. One is a reference: we will use it as a baseline for comparison. Pick something you consider a good example, and where your model performed reliably. It can be your training data. Or, production data from some past period.

The second dataset is the most recent (current) production data you want to evaluate.

Import your data as a Pandas DataFrame. You can have two DataFrames, or a single one where you explicitly select which rows belong to the reference, and which to the production data.

Then, you can use Evidently to generate an interactive report like this:

A table with the list of features and a column that says whether each of them drifts.

We show the drifting features first, sorting them by P-value. Using a statistical test, we make a drift/no drift decision for each feature individually.

You might want to explore them all or look into your key drivers.

By clicking on each feature, you can explore the values mapped in a plot. The green area covers one standard deviation from the mean, as seen in the reference dataset.

A plot that shows data drift for an individual feature.

Or, you can zoom on distributions to understand what has changed:

A plot that shows data distribution for an individual feature.

Why is it important?

We wrote a whole blog about Data and Concept Drift. In short, things change, and this can break your models. Detecting this is key to maintaining good performance.

If there are data quality issues, our tool will also pick this up. When your data goes missing or features break, this usually shows in data distributions. We will soon add more fun reports to explore features and analyze data quality. But this one can already serve as a proxy.

What is cool about it?

We implemented the statistical tests for you, so you don't need to think them through. We know these are quite cumbersome to write, and there is quite some chance to mess it up. Solved.

We use a two-sample Kolmogorov-Smirnov test for numerical features and the chi-squared test for categorical features, both at 0.95 confidence level. We will add some levers later on, but this is a good enough default approach.

The visuals are helpful, and would otherwise take considerable time to code in Plotly or Matplotlib. Here, each feature gets an interactive plot you can explore to understand its behavior.

What's more, you can share this report around as a .html file. If you ever had a back-and-forth exchange of screenshots with another department, you will like this one:

Text of email to a marketing department with an .html drift report attached.

Finally, it is dead simple to install and use. No new tool to learn, no service to maintain. Just open your notebook and try it out!

An update: you can also export the results as a JSON profile to integrate easily with your prediction pipelines.

When should I use it?

Of course, when your model is in production. But also before.

Here are a few ideas on how you can use the data drift tool:

  • Support your model maintenance. Understand when it is time to retrain your model, or which features to drop when they are too volatile.
  • Before acting on model predictions. Validate that your input data is from the same distribution, and you are not feeding anything outrageously different into your model.
  • When debugging model decay. If the model quality dropped, use the tool to explore where the change comes from.
  • In A/B test or trial use. Detect training-serving skew and get better context to interpret test results.
  • Before deployment. Understand drift in the offline environment. Explore past shifts in the data to define your future retraining needs and monitoring strategy.
  • To find useful features when building a model. Get creative: you can also use the tool to compare feature distributions in your positive and negative class. This will quickly surface the best discriminants.

How can I try it?

Go to Github and explore the tool in action using sample notebooks. You can explore how it works on the Iris dataset (Jupyter notebook, Colab) and the California housing dataset (Jupyter notebook, Colab). More examples are here.

Head to docs for more details.

If you like it, spread the word.

________________

If you have any questions or thoughts, write to us on hello@evidentlyai.com. That's an early release, so send any bugs that way, or open an issue on Github.

Want to stay in the loop?

https://www.linkedin.com/in/emelidral/
Emeli Dral

Co-founder and CTO

https://www.linkedin.com/in/elenasamuylova/
Elena Samuylova

Co-founder and CEO

You Might Also Like:

Long live models!

Get a roundup with product updates, events, and the best blogs every few weeks. No spam.

Thank you! Please check your email to confirm subscription!
Oops! Something went wrong while submitting the form.

By signing up you agree to receive emails from us. You can opt out any time.

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.