📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Community

AMA with Lina Weichbrodt: ML monitoring, LLMs, and freelance ML engineering

Last updated:

November 27, 2024

Published:

May 8, 2023

contents‍

Start testing your AI systems today

Get demo

We invite ML practitioners to share their experiences with the Evidently Community during the Ask-Me-Anything series.

This time, we talked with Lina Weichbrodt. Lina is a pragmatic ML consultant developing and running ML applications with millions of users. As a freelance ML engineer, she works on various use cases: from predicting traffic problems from tweets to implementing underwriting risk scoring. Previously, she developed real-time personalization models at Zalando and introduced MLOps best practices at the German online bank DKB.

We chatted about ML monitoring, adopting LLMs, and the challenges of being a freelance ML engineer.

‍Sounds interesting? Read on for the recap of the AMA with Lina.

ML monitoring and debugging

What monitoring advice would you give to a team that has deployed a model in production for the first time? What are the metrics they need to keep tabs on?

Without knowing the type of use case, model, and domain, I would point you to my blog post on monitoring ML applications.

Does this team already have regular backend monitoring? If not, they need to add it first. Unless it is not important because the model only does once-a-day batch scoring, and the results are not directly returned to customers :)

ML monitoring pyramid — *Machine Learning monitoring is added on top of traditional backend monitoring. Source: Lina’s blog “A Practitioner’s Guide to Monitoring Machine Learning Applications*”

Could you share an ML monitoring use case that has been the most challenging/tricky to detect in your experience?‍

I found recommender systems a bit hard since there is no ground truth. But on a positive note, monitoring is not that difficult once you start. It is just a new field, so there are no established best practices yet.

In one of your presentations, you spoke about monitoring heuristics-based proxy metrics that help capture when an ML product is not performing well. For example, if you have a recommender system, you can check whether it ranks the users’ most popular items high enough. Do you have some favorite examples of such heuristics you discovered when working on specific use cases?

This was a good example by Spotify. They monitored the position of a user's most used box on the homepage.

For recommendation, I had the heuristics like:

% of the first four articles being personalized,
% of terrible responses (empty, top sellers, etc.).

These heuristics are often stable. So if you usually have 5% of bad responses and suddenly it jumps to 10%, something has happened.

‍How can one prioritize what to check for first when debugging the model performance drop? If the model starts to degrade, is there a general approach to follow to find a root cause?

The first goal would be to detect it fast so it is easier to find the root cause. If you have a live service and monitor its returned values, you can check the following:

Was there a recent deployment of your code or model?
If not, did the input data change?

Most of the bugs I observed were due to our team making changes.

Another technique is to monitor segments (iOS vs. Android, different countries, etc.). This can also give you hints about what might be going wrong.

If you monitor the outputs of a service, you get real-time detection. Monitoring the ground truth performance is also useful, but the real performance is often only available with a delay. So I recommend doing both. Here are the play-by-play steps to take.

ML monitoring metrics prioritization — *Symptom-based monitoring: prioritize backward from the output. Source: Lina’s blog “A Practitioner’s Guide to Monitoring Machine Learning Applications*”

How do you choose the right toolkit for each project, and what factors do you consider when making that decision?

First, the industry in which I operate. If you work in healthcare or finance, the rules can be strict. In this case, half automation might be the way to go and have humans be "the tool." E.g., all positive decisions like approvals are decided by a model, and a person decides on rejections.

If the industry is not critical, monitoring on aggregates is acceptable.

I also have to figure out how much time my team can spend on monitoring and if someone is interested in following up with findings. If no one is interested, I only do basic monitoring of the outputs to detect big issues. I have not yet had a company/use case where the team, data operations, and responsibilities required in-depth monitoring of all steps, including inputs.

In conclusion, I go with their existing monitoring framework and add useful metrics like distributions of the outputs. This can be done with any metric library the team already uses.

Then I close the loop by connecting ground truth with predictions (if available).

I usually start without specific tools and use backend monitoring tools. But I can imagine adding other tools if the team is equipped to follow up with proper data ops and the product cares.

Working with LLMs

With LLMs everywhere on social media, are companies asking today for LLM-related projects? How has the LLM wave impacted the type and scope of projects requested?

Yes, I work on LLMs now.

‍With traditional ML, it took some time for the companies (especially from more traditional industries) to adopt it successfully and get to production use cases. Do you believe it will be different with LLMs? Are the companies moving faster now? What are the most pragmatic business use cases you expect to see?

LLMs have shaken my companies out of their sleep on AI topics. I hope they are open to trying AI more. Though, my realistic expectation is that they will try LLMs (probably the worst thing to start your ML journey) and then use it as a gimmick, find it lacking, and then abandon the AI questions because they "did not work" :)

But I may be a slight cynic after seeing too many poorly set up AI projects in the past. Some companies really get it, but these companies already got it before.

However, LLMs are indeed a breakthrough in language projects. Like the breakthrough we got in computer vision in 2017 or so. So I expect great projects and products in the language space from companies who get it. The others will probably buy these projects as products or services.

‍What's the best way to monitor custom domain-specific LLMs? Note that these are Conversational AI models.

I am working with LLMs and find them challenging.

One idea I came up with: encode expected responses as embeddings and compare new responses to them in cosine space. New responses to the same question should be roughly similar (otherwise, it means the model gives users wildly differing responses).

For easier tasks like summarization, the models work well. For open-ended conversations, I found them very difficult to productionize and would aim to replace them with more specialized, controllable models if possible.

Other techniques for conversation are:

Use the OpenAI moderation endpoint to detect special problems they cover.
Use another LLM prompt that is like a "watcher." However, that gets expensive.
I talked to someone using heuristics, e.g., the model outputs numbers or certain names.

It's a difficult subject and also depends a bit on your domain.

Being a freelance ML engineer

What inspired you to become a freelance ML engineer, and what do you enjoy most about your work?

My hate of open-plan offices. Only half joking :)

Also, I like making my own decisions about who to work with and ensuring the product idea makes sense.

How/where to get started as a freelance ML engineer?

I started by making many connections and then working with them, but I also got contacted by recruiters on LinkedIn looking for freelancers for their clients. You can also list yourself on freelance websites.

Are there "Top 10 ML Hits" for freelancing projects that clients constantly come and request? Such as a recommender system or some classifier?

The classics vary by industry. In finance, it is often scoring for credits. In marketing, it is often affinity prediction, etc.

The most common use cases for ML nowadays are sales and marketing. These are also industry agnostic (a bank has marketing too).

Then there are industry use cases and company use cases. E.g. typical banking use cases like fraud prediction are used in most banks.

The company-specific ones are based on what the company values and what they do. E.g., if they focus on investment banking, they have other use cases than if they focus on small business loans. I try to look at where the models can contribute value.

To sum up:

industry agnostic, e.g., marketing, sales, customer service -> nearly all companies have these use cases,
industry-specific, e.g., fraud prediction and credit scoring in banking,
company-specific (these can be very impactful and depend on the biggest value drivers of your company).

I am curious about the ML freelancing market. A few years ago, I tried to be a freelancer in the space. The biggest challenge was that companies I spoke to considered ML a "core competence" where the know-how and expertise must be grown in-house. On top of this, getting access to the data, infrastructure, and more made it difficult to act independently as a satellite. What have been your observations on the current ML freelancing market? How easy is it today to acquire clients? And what are they asking?

Yes, I agree with these observations. My technique has been to work with startups because they need to get started fast and sometimes don't mind a senior expert to kick off their efforts. From my friends in the same space, I also see that it is more difficult to find clients than for other types of developers.

‍You solved lots of ML use cases. What is your favorite ML project you worked on?

Color-based cross-selling in Zalando. That was both fun and girly:)

Keeping up with ML trends

ML is a gigantic domain now, and it is impossible to have an overview of everything, let alone be proficient. In which areas have you specialized, and in the future, where do you want to develop expertise?

I know the daily stress of keeping up with the latest development is real.

I specialize in pragmatic business advice. Apparently, that is a niche. My customers tell me they talk to other AI engineers, and most of them propose complex state-of-the-art long-ranging projects while I focus on getting fast heuristics or baselines first and then figuring out where to double down. I also use ready-made services and do the whole backend integration so they get value as soon as possible.

Customers often need advice on how to navigate AI rather than someone who knows the latest state of the art models.

‍I guess everyone in the industry is overwhelmed with so many things happening at once. Do you have any tips on staying updated without wasting too much time reading every new paper and playing with every new tool?

It's impossible to stay up to date with the latest developments. I like to try things I suspect might be useful for my job. This motivates me, and if it works, I know how to use them. Other people might prefer to drill down on an area.

I am also still looking for newsletters that summarize well. Does anyone have one? Otherwise, I randomly read my LinkedIn and the MLOps chat. The MLOps chat often shares nice articles.

And I try not to stress. It is impossible to try and read everything. You can read in-depth once you have a specific problem to solve.

‍Are there new technologies or ideas in the ML or data space that you recently discovered and are excited about?

Probably like everyone, generative AI:) I am waiting for the day when I can create a frontend for my backend services.

‍

* The discussion was lightly edited for better readability.

[fs-toc-omit] Join 2500+ data scientists and ML engineers

Jump to our Discord community to get support, contribute, and chat about AI products.

Join Discord community ⟶

AMA with Neal Lathia: data science career tracks and Monzo ML stack

Community

AMA with Neal Lathia: data science career tracks, shipping ML models to production, and Monzo ML stack

In this blog, we recap the Ask-Me-Anything session with Neal Lathia. We chatted about career paths of an ML Engineer, building and expanding ML teams, Monzo’s ML stack, and 2023 ML trends.

AMA with Stefan Krawczyk: from building ML platforms at Stitch Fix to an open-source startup on top of Hamilton

Community

AMA with Stefan Krawczyk: from building ML platforms at Stitch Fix to an open-source startup on top of the Hamilton framework

In this blog, we recap the Ask-Me-Anything session with Stefan Krawczyk. We chatted about how to build an ML platform and what data science teams do wrong about ML dataflows.

🏗 Free course "LLM evaluations for AI builders" with 10 code tutorials. Sign up ⟶

Start testing your AI systems today

Book a personalized 1:1 demo with our team or sign up for a free account.

Get demo

No credit card required

Community

AMA with Lina Weichbrodt: ML monitoring, LLMs, and freelance ML engineering

ML monitoring and debugging

Working with LLMs

Being a freelance ML engineer

Keeping up with ML trends

[fs-toc-omit] Join 2500+ data scientists and ML engineers

Dasha Maliugina

You might also like

Community

AMA with Neal Lathia: data science career tracks, shipping ML models to production, and Monzo ML stack

Community

AMA with Stefan Krawczyk: from building ML platforms at Stitch Fix to an open-source startup on top of the Hamilton framework

Start testing your AI systems today