📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Community

AMA with Bo Yu and Sean Sheng: why ML deployment is hard

Last updated:

November 27, 2024

Published:

September 12, 2022

contents‍

Start testing your AI systems today

Get demo

We host regular Ask-Me-Anything series, where ML practitioners share their experiences with the Evidently Community.

Last week, we had two guests — Bozhao Yu, Co-founder and Head of community, and Sean Sheng, Head of Engineering, at BentoML. BentoML is an open-source unified model serving framework. It aims to simplify model serving and accelerate deployment.

We chatted about why deploying a model is hard, beginner deployment mistakes and how to avoid them, the challenges of building an open-source product, and BentoML's roadmap.

Sounds interesting? Read on for the recap of the AMA with Bo and Sean.

How to do ML model deployment right

What are top-3 things that make deploying a model so hard?

Sean:

Reproducibility of models in production. Models may behave differently in production if a set of dependencies or a programming language is inconsistent between training and serving environments.
In traditional software engineering, your main concern is the code. In data engineering, your main concerns are the code + the data. In ML engineering, your main concerns are the code + the data + the model.
Ever-changing data in production. Tools like Evidently are great for detecting data drift and monitoring model performance. Frequently retraining and redeploying the model can help combat data draft.

This blog from our colleague, Tim Liu, covers the challenges of model deployment in production quite well.‍

ML model deployment sits at the intersection of data science and engineering. How can data scientists and ML engineers work together to ship models to production faster and more efficiently? What are your tips?‍

Sean: Using the right tools for collaboration, delivery, and deployment is half the equation. Intrinsically, the data science and engineering teams have different areas of focus regarding the model serving. One of the goals of BentoML is to streamline the contacts and create a separation of concerns between the data science and engineering teams.‍

If I am a data scientist about to deploy an ML model for the first time, is there a set of questions I should think through to decide on the optimal deployment architecture? Are there some obvious mistakes people frequently make?‍

Sean: We find deploying models to production is sometimes an afterthought for a data science team. However, when it comes to delivering values, deployment and serving are just as critical as model training and feature engineering.

Teams should consider typical metrics on availability, security, and efficiency of the serving frameworks. In addition, consider if the frameworks offer standardization and collaboration to scale with your team and future development. Last but not least, consider the ease-of-use aspect and time to production.‍

I know it is a range, but what is the typical number of models you see companies deploy? Is it 1-10 models, or do you see many companies that have 100s of models in use?‍

Bo: We see a healthy distribution of all of them. A new startup surprised us with hundreds of models.

Sean: It's all over the place and is really depending on the use case. There are use cases where a model is trained for each user, so the number of models needs to scale with their active user growth.‍

Is there an ultimate set of tools (preferably open-source) you can recommend for a smooth development and model deployment to cover the complete cycle?‍

Sean: There are way too many tooling in the ecosystem, and it will depend on each use case. What we can say is that Evidently is great for model monitoring, and BentoML is great for a model in production :)‍

Do you see potential in creating a unified framework or a transpiler for ML models to unify the ML landscape, similar to efforts in javascript? For example, if I write using the BentoML framework, will I get the correct output code in Tensorflow, PyTorch, or Jax?‍

Bo: I will answer the second part first. BentoML is the framework that helps you put models into production. It will not compile out to any ML framework model like Pytorch or Jax. BentoML is ML framework agnostic. It can handle any model from any ML framework.

I definitely see there is a need for a unified ML framework. I think there are a couple of projects working on that. I see a unification or standardization on the model serving front rather than the model front. Data scientists should use the best framework for their job. When putting those models into production, there should be a standard. That's why we created BentoML, the unified model serving framework‍

There are efforts to serve Jupyter notebooks as REST microservices, with the idea of never having to refactor Notebook spaghetti code to go into production. Where do you stand on that?‍

Bo: I think Jupyter notebook is a great tool for experimentation and exploring. That's the job to be done for Jupyter. The production end brings a different set of challenges. It could be a good temporary solution that gives you time to search and evaluate a more solid one.‍

What are the best practices for implementing a multi-model prediction? For example, you have a model prediction, a database lookup, and another model prediction. How do you stitch these services up if the models and database are microservices?‍

Bo: I would say the best practice is to use BentoML :)

I think any solution that allows you to write this service code easily and that will take care of splitting those steps into microservices (if you are deploying in a cluster) or splitting them into process/thread (if within a container) will do the job.

With BentoML, it is extremely easy:

Sean: Async is definitely your friend here to pipeline potentially IO-intensive logic for better throughput and efficiency, along with error handling and SLO implementation when involving multiple microservices.

Building BentoML in the open

How did BentoML come about? Were there specific requirements or frustrations that led to its creation?‍

Bo: Chaoyu and I were frustrated from our perspectives (vendor and user). And we started working on an end-to-end solution for deploying and managing services in production. When we open-sourced BentoML, we received good feedback from the community and decided to focus on that problem.‍

As an open-source tool, you must get many (often conflicting) feature requests from your users. How do you decide on the product roadmap? Is there some process behind it?‍

Bo: That's something we are constantly balancing. For BentoML, we want to ensure the features are providing benefits to all users. That's always our north star. We also want to make sure the framework is extremely easy to customize. We follow the saying: keep simple things simple, complex things possible.‍

Which advice would you give to your younger self who is about to start an open-source ML library?‍

Bo: Definitely ask yourself why you open source it. Listen to your users and make sure you provide value to them. Always try to build trust.‍

What are the most unexpected things you've learned while talking to your users (i.e., how they use the framework, perceived difficulty of various parts of the deployment process, etc.)? Can you share a couple of stories?‍

Bo: It is always fun to learn from our users. I am also in awe of the scale that our users are working on. We see energy companies deploying thousands of models for anomaly detection. That's very surprising from such a "traditional" company.

BentoML roadmap

Do you envision anything around the hot-swap of models? If I have my service running in production and I want to replace my model, what is the best way to minimize downtime and keep it as straightforward as possible, i.e., just swapping files?‍

Sean: It is not currently supported in BentoML, but we envision building the capability to hot-swap models. While convenient, it is a potentially dangerous operation. In addition to model hot swapping, a good solution would involve resource ensuring and traffic orchestration.‍

What are your plans for the model store? Have you considered ideas around the discoverability of models akin to service discovery in microservices? Suppose I am a software engineer whose ML team has already stored dozens of models on BentoML. In that case, I only want to start using one model that meets specific characteristics without overthinking it (and all this programmatically).‍

Sean: That's a very interesting use case. We are considering capabilities like scale-to-zero to support multi-model discovery use cases like this.‍

Do you plan to integrate BentoML with Triton Inference Server? I'm curious about what the timeline looks like and what performance gains you might anticipate.‍

Sean: We are actively working on the design of the integration with the Triton Inference Server. We can't promise a timeline yet, but things are in the works. The idea is to combine BentoML's ease of use by keeping BentoML's programming paradigm while taking advantage of the Triton Inference Server's performance as a runner.‍

One of the things I enjoy most about BentoML is how easy it is to get started and how you don't immediately force Kubernetes down the throats of users on day one. I'm curious how you decided to stave off K8S until a later stage in the user's journey.‍

Bo: We want to create a standard for serving models. That's the idea behind BentoML. We build related projects that could take advantage of the new architecture and this standard. For example, we have the Yatai project. It is our production-first ML platform on Kubernetes.‍

Does Yatai handle concepts like A/B testing, Canary deployment, or Shadow deployment? Do you see any way to integrate these concepts with BentoML/Yatai?‍

Bo: Those features are on our Yatai roadmap. Stay tuned!‍

When is the BentoML + Evidently Integration?‍

Bo: Soon I hope. I was just messaging Elena Samuylova the other day. With BentoML 1.0, we have a good foundation that we can start working on integrations.‍

* The discussion was lightly edited for better readability.