📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

ML Monitoring

An MLOps story: how DeepL monitors ML models in production

Last updated:

February 18, 2025

Published:

September 25, 2023

contents‍

Start testing your AI systems today

Get demo

How do different companies start and scale up their MLOps practices?

We teamed up with Dayle Fernandes, an MLOps engineer at DeepL, who shared their story on implementing their MLOps platform and production ML monitoring.

Keep reading to learn more about:

ML applications at DeepL,
how they built an ML platform with open-source tools,
how they monitor ML models in production,
examples of when ML monitoring alerts count,
and lessons learned on DeepL’s MLOps journey.

Machine learning at DeepL

*DeepL Translator interface. Source:* *deepl.com*

DeepL is a translation and text editing service on a mission to eliminate language barriers worldwide using AI. It supports over 30 languages, and more than one billion people have used DeepL's services.

Deep learning is central to the success of DeepL’s core product — translation. However, there are also many operational use cases where teams use machine learning to get actionable information and optimize for revenue.

For example, they use machine learning to predict churn, personalize content, and provide product recommendations to their users. The company also looks to add anomaly detection and fraud prevention to its arsenal.

To effectively manage such a variety of ML tasks, they work in cross-functional teams. For example, some data scientists focus on experimentation, some purely on metrics, and others on machine learning — they are all scattered across teams. ML and data engineers switch from one project to another, helping data scientists with ideation, scaling resources, and production deployments.

A data platform team manages the data warehouse and data streams, while MLOps takes care of infrastructure and processes throughout the ML lifecycle.

“Being an MLOps engineer at DeepL means wearing many hats, from scheduling a pipeline to maintaining a feature store to controlling how the code is shipped to production.”

Building the MLOps platform

To make this distributed approach work, a company needs a solid ML infrastructure and streamlined processes, from experiment management to model deployment and monitoring.

DeepL's ML platform is built using open-source tools.

DeepL's MLOps stack — *The current MLOps stack at DeepL*

Here are the platform’s key components:

ML lifecycle management. MLflow is used as a core tracking server. They log everything there, from training data to the Evidently monitoring reports.

Data warehouse. All the data resides in ClickHouse, and they use Kafka to move it around.

Orchestration. They have a custom Kubernetes setup and use Argo to orchestrate workflows. With additional abstractions on top of it, data scientists can use Python to get the resources they need without having to deal with Argo files.

Feature store. They built a custom solution on top of ClickHouse as a batch feature store.

Visualization. Metabase is the shared dashboarding service, open to everyone in the company. They also use Grafana to monitor the Kubernetes pods.

ML monitoring. Evidently is their core monitoring solution. The team uses it both for EDA and to detect and monitor data drift. They currently use Reports and Test Suites but will soon deploy the Evidently monitoring dashboard to track metrics over time.

As a rule, there are no Jupyter notebooks in production.

“Data scientists do the exploration in the notebook, but as soon as it goes to training, we provide them with the resources they need. Their models with all the EDA and reports will show at MLflow. We have good processes for using our model registry and experiment tracking. Notebooks never go to production, and code gets refactored to fit a particular solution and have tests in place.”

Production ML Monitoring

DeepL relies on Evidently to monitor models in production, detect data drift, and ensure data stability and integrity. They use the tool even before the model gets to production – starting from the initial data exploration.

“I think Evidently is a very well-built and polished tool. It is like a Swiss army knife we use more often than expected.”

Exploratory data analysis (EDA). During model training, data scientists use Evidently Reports to discover patterns and check the initial assumptions about the data.

“A data scientist can share the report across teams, for example, with the product managers and other stakeholders: “Hey, take a look. Here is what the data looks like.”

Data integrity checks. Before an ML model is fed with training data, they use Evidently Test Suites to ensure the data is stable: for example, that the feature types are correct, data distribution is fine, and there are no outliers.

“What I like about Test Suites is that I can quickly run a bunch of tests and configure them as I like. It takes away a lot of headache of building monitoring suites, so I can focus on how to react to monitoring results.”

*Example of the Evidently data integrity Test Suite on a toy dataset*

Data drift monitoring. DeepL runs ML models in batch mode. On each daily run, they use Evidently to compare the current data to a reference dataset stored during the training. They pull the data from the model registry and check for data distribution drift.

“We use Evidently to monitor data drift to ensure that the underlying assumptions we used when we trained the ML model haven’t changed. And if they have, for how much? Do we need to take action? We use Evidently Reports daily to keep an eye on such things.”

They also set up an alerting system. If there is no data drift, they simply send a link to the report. If data drift is detected, it triggers a Slack alert telling which tests have failed. Data drift reports are then shared across the data science team and product managers to understand if there is a need for change in case the drift is detected.

“I like how Reports are structured: what went wrong and how it went wrong. It gives all the information in one glance. Even if you use it without configuration, the tool works really well and gives you actionable information.”

*Example of the Evidently data drift Report on a toy dataset*

Retraining triggers. Understanding drift in production helps decide when it’s time to update the model. Right now, it is up to a data scientist to make the call, but they are looking to automate it.

“We currently use Evidently to make sure that the models we are shipping make sense, are not far off or stale, and that there is not so much drift. In the future, we are looking to introduce automatic retraining — say, the drift hits a particular threshold, and the automatic retraining is triggered. Currently, it comes to knowing when something is wrong and finding out what is wrong to start retraining, A/B testing the new model, and then shipping it.”

When ML monitoring pays off

ML monitoring prevents models from failing silently – or even helps better understand the real-world process behind the ML models. Here are a few examples.

Detecting data quality bugs. Once, the team spotted a significant change in a predicted feature. They drilled down to find a root cause and discovered a bug in how the data was pulled from the data warehouse. This helped catch the issue before it affected end users and started the conversation about data cataloging, documenting changes, and improving the overall data strategy.

“The bug in the query system was a big issue for us as our recommender system was operating on completely wrong assumptions. We found out about it only once we plugged in Evidently. <…> It helped to start a conversation about improving our systems.”

Insights into user behavior. Keeping tabs on the incoming model data also helps to better understand the processes behind it. As translation services have seasonality, DeepL's data is also time-sensitive. When tax season or quarterly deadlines are approaching, service usage has a massive spike. On the contrary, during summer holidays or at Christmas, the platform experiences a dip in product usage.

“If you retrain the model at the wrong time, your assumptions will be wrong, and you won’t be able to catch those changes. With Evidently, we were able to raise these issues with the sales team: “Have you seen those changes in how people interact with our product?”

Understanding process changes. Monitoring might also help catch important changes to the business process. For example, DeepL offers PRO services in a continuously increasing list of countries. Once, a monitoring report helped notice that a new PRO country was launched, which was not available in the training dataset.

“That helped us catch issues before we started spamming or providing people with irrelevant information.”

Lessons learned

After going through the process of the internal ML platform design, there are a few lessons to share.

Do not discount open-source. It is possible to design a complete ML platform using open-source tools.

“People tend to discount open-source. I would say that the current offerings of the open-source tools are very much comparable to what commercial providers offer.”

There is also no need to re-implement everything. For example, DeepL considered building an ML monitoring tool from scratch. However, the team decided it would take too much time and effort.

“I noticed how quickly Evidently evolved. So we realized that by the time we would build the very early release, Evidently would go much more forward. It didn’t make sense to reinvent the wheel. <…> It was clear that the tool would be modular and flexible: that requires a lot of design thinking.”

Think about monitoring early. Without it, ML systems risk operating under the wrong assumptions and providing erroneous results. You might even discover that many of your initial hypotheses are wrong – but you need monitoring to become this “magnifying glass” to bring visibility into your production data.

“Building monitoring infrastructure is very non-spoken about in MLOps. I know DevOps talks a lot about it, but in ML, it is more about how you serve your model, what the latency is, etc. However, when you start monitoring, you discover a completely different set of things you need to worry about. <…> To be honest, I’ll put monitoring before latency. I don't mind waiting for a response for more than five seconds. But if you have a really bad model that gives terrible predictions, you will lose far more money.”

ML monitoring should be end-to-end. It starts with logging the right things.

“As we have gone through several iterations of our infrastructure evolving, we always had this: “We wish we had done our logs in a certain way.” Had we spent more time thinking about what and how we need to monitor, I wonder if some issues wouldn’t exist.”

ML monitoring covers the ML lifecycle end-to-end: How does the data look? What is the output? Does the final outcome make sense? To be able to answer these questions, one needs to have a well-designed ML monitoring infrastructure in place.

‍“Before shipping, I would think about monitoring. You need to know what and why you are doing. Monitoring infrastructure is very important. We’ve learned it the hard way. <…> At the end of the day, it is called “infrastructure” for a reason: it is not just the tool. There is a lot of heavy lifting that happens behind the scenes.”

______________

To learn more about DeepL, check their website and careers page.

To learn more about Evidently open-source tools for ML model evaluation and monitoring, check out GitHub and documentation.