We invite ML practitioners to share their experiences with the Evidently Community as a part of the Ask-Me-Anything series. Our recent guest was Hamza Tahir, a Co-founder of ZenML, an open-source MLOps framework for creating reproducible pipelines.
We chatted about MLOps challenges and trends, tools, the future of real-time ML, and Hamza's experience in building an open-source startup.
Sounds interesting? Read on for the recap of the AMA with Hamza.
MLOps challenges and trends
I imagine you often chat with ML teams that use ZenML or consider it. Could you share some of the curious learnings? Was there anything that surprised you? Something along the lines of "I expected more ML teams to do X but was surprised they were not!"
My biggest learning is how even the most technically gifted teams out there sometimes do not know how to handle the massive MLOps challenges with a finite amount of resources and an expected ROI at the end, which is feasible for a business. The landscape is immense and hard to navigate. No wonder new entrants into the field find it so daunting. I hope we can carve out an educational path for teams that is easier in the coming years.
MLOps has been a hot topic for some time now, but the gap between building a good model and successfully launching it in the real world remains. What do you think are TOP-3 factors that make companies fail when deploying ML models into production?
Tough to pick just 3. Let's see:
- The people writing the ML code worked separately (or there was no definition of "production" in the first place), so when they got to the point of production, it just blew up in complexity, and they couldn't deal with it.
- Skewed data in training and serving, or it's hard to replicate what features to serve in training vs. serving.
- No lineage between training and production. When things fail, teams don't know why, how, or who.
- The teams lack the skills or know-how to deploy in the first place, so they throw it over the wall to other departments or go into a wall of research of the 1000's of MLops tools out there XD
ML monitoring is a popular topic, but it's still new for many. Which trends do you see here? Is monitoring still an afterthought for ML teams, or do you already see teams that set up ML monitoring properly from the beginning?
Monitoring is often the last thing on most teams' minds whom I talk to. They usually start with research and notebooks and then slowly and painfully build their way up to production. Often when I meet these teams, they barely have basics in place, and monitoring is something they do either manually or don't do at all.
Teams further along the MLOps maturity line have a better grasp of things. Still, I find they rarely monitor "properly" but rather do a lot of logging and force re-training to smooth out the brunt of the quality decrease that happens when things drift in production.
However, I believe this will change very soon (with tools like Evidently now making this easier!)
Which trends do you see in ML deployment?
In my opinion, ML deployment itself is getting more and more trivial. Wrapping a model in an API server and pushing out an endpoint is no longer complicated. Even for the more complex models, managed services make it easier to scale these deployments.
However, it is still hard to deploy models reliably. This, in essence, is why MLOps is a thing. It isn't about doing it once; it's about continuously deploying it in a robust manner. That's still hard because data is hard. That's why tools like Evidently are so popular and critical right now, to help us with data questions.
So here is what I see happening:
- The ML industry will conform around a canonical MLOps stack with standardized hand-over points/contracts.
- We will get a deeper understanding of how we can handle silent and non-silent model failures in production as we go from being model-centric to data-centric.
- Once that shift is over, we'll see a Cambrian explosion of model deployments of immense value to the real world.
- The MLOps stack will come from the open-source movement.
Where do you think the future of ML lies concerning batch inference, real-time inference, and real-time learning? Do you believe batch is a relic of the past, and NA is just slow at keeping up?
While it is way harder to do real-time machine learning (especially online training) than batch machine learning, it doesn't mean it is always appropriate to use real-time ML. Some use-cases are either unsuitable or not worth going the real-time route. E.g., when we were building a predictive maintenance system for my previous startup, I did not need to build a real-time component into the system because the training happened, let's say, once a week, and inference occurred once a day. There was no need to tackle that challenge because the use case did not call for it. More complicated doesn't always mean better :)
Conversely, recommender systems, the classical real-time use case, greatly benefit from real-time ML technology, and therefore those ML teams implementing it should be paying attention to the trends.
I always point to Chip Huyen's real-time ML blog whenever this topic is broached. She did a good job summarizing the challenges we face in adopting real-time ML in the industry. I'd say we are still five or so years away from mass adoption.
How many tools are too many? Do you think there is a limit to the number of tools ML teams can use in their workflow? Is it 5, 10, or 25?
The question is "What does it mean to be a tool in the workflow?" Is Git a tool? GitHub? The cloud infrastructure? It's hard to distinguish unless we have a precise definition.
Even if I had a precise definition, I think I wouldn't be able to answer this question differently than "It depends." I've talked to teams who've had maybe 50 tools, and all of them are out of necessity. Others are on the lower end, and that, I'd say, isn't too bad a decision (especially if you're early). I think what's more important is not to get locked into certain decisions because teams often switch stacks entirely in 1 to 2 years on average.
We all do the same steps: 1) pre-processing steps (handling missing data, formatting, dealing with outliers, etc.), 2) train models and 3) test. Is there an easy way to build pipelines and automate this stuff?
There are many pipeline-ing tools out there that do a good job of helping with all this. I'll be shameless and say try ZenML. It's an MLOps framework that lets you create portable and simple MLOps pipelines. Its geared toward data scientists and ML workflows in particular. If you cant make it work, let me know! :)
I didn't know too much about MLOps when I joined ZenML. But in my first week, I remember you were already singing the praises of Evidently as a great open-source tool. What was it that got you so excited about what the team was building?
When we first met, it was not long after I met Elena Samuylova and Emeli Dral, co-founders of Evidently AI, for the first time. I was struck by how similar both our teams' backgrounds were. Both came from solving industrial ML challenges, saw the lack of simplicity in the market, and believed in open-source.
It was the focus on simplicity that made me believe in Evidently. I'd seen many other platforms do similar things but Evidently was similar to how I'd want it to work in an ML pipeline. I also felt like we started at the same time, so there was a certain camaraderie there. I hope that's still alive within both teams.
In our company, we have not yet made a move to Kubernetes (and it is pretty daunting). Is ZenML still a facilitator for us to programmatically serve ML models, maybe interacting with the Docker daemon?
This is a great question! When teams ask me this question, it is important to answer it with an explanation in terms of orchestration.
An orchestrator in MLOps is the system component that is responsible for executing and managing the execution of an ML pipeline. You can run ZenML pipelines on multiple orchestration systems.
There are standard orchestrators that ZenML supports out-of-the-box, e.g., Kubeflow, which is K8s-based, and therefore you have to use K8s to use it. However, we have more orchestrators (like Airflow, upcoming GitHub Actions, Vertex AI, and AWS Step Functions) that do not need K8s.
Also, we made it super simple to write your orchestrator too! If you want to spin up a VM and run the pipeline, it's very easy (Hit me up, and we can maybe hack it together in a few hours!). Extensibility is key in the framework.
The above argument also applies to other stack components that function on K8s. E.g., for model deployment, you can use the standard MLflow (which is non-k8s) or Seldon (K8s based) deployers, but we plan to add more like BentoML that are non-K8s based too! Similarly, it is easy enough to write your model deployer. You can interact with the docker daemon and serve models that way if you'd like.
I hope this was clear. If not, I'd encourage you to look at the ZenML website and docs. We are updating them to make this particular point way easier to understand!
Building ZenML in the open
What were the lessons learned when trying to bring out the name for ZenML?
Oh, many, I'd say:
- Measure success and set clear goals. Remember that your metrics will often lag 2-3 weeks, so give yourself room to breathe.
- Be consistent. There will be down weeks and up weeks, but the key is to follow a schedule and keep at it.
- Produce high-quality content. It's better to produce one high-quality content than three low-quality ones.
- Writing code is useless if there is no tutorial/example/video/docs to follow to execute it.
- KISS: Keep it simple, silly. Do not over-engineer.
- Nobody knows the answer. Trust your gut, and be fast. Be data-driven.
- Embrace uncertainty. Building an open-source MLOps framework in 2022 ain't easy.
What got you interested in MLOps instead of just the training part of machine learning?
In 2017 I met Adam Probst, and we started a company together. It used deep learning to predict failures of trucks and buses on the road using their sensor data. I was always more of an engineer by inclination, so I viewed this as a way to sharpen my engineering skills in the ML world.
After three years of doing that stuff, I was well aware of the challenges ML in production faced and had time to think about them at length. It was fascinating how early it was, and I think we all saw it as the "next big thing" in ML. I remember, e.g., using frameworks like TFX when they were at version 0.13.0 and then meeting the Google team behind it in Paris, when we were maybe one of the few teams in the world actively trying to use it in production. It showed us how much on the bleeding edge we were, and this was (and still is) truly exciting!
Why did you choose an open-source model to build ZenML? What do you consider the most challenging part of developing open-source projects?
One of our goals with ZenML was to standardize (yes, standardize) the MLOps pipeline. This goal is too big for just us nerds sitting here in one part of the world. We'd seen enough platforms try to do this before, so we knew what we wanted to build was a framework. Inspired by Next JS, react, FastAPI in other parts of Software development, we believe that the next big MLOps company would be open-source and would have to be extensible and modular. It is way easier to do if you're an open-source.
* The discussion was lightly edited for better readability.
[fs-toc-omit] Want to join the next AMA session?
Join our Discord community! Connect with maintainers, ask questions, and join AMAs with ML experts.
Join community ⟶