How to approach building an internal ML platform if you’re not Google? We put together stories from 10 companies that shared their platforms’ design and learnings along the way.
In the past few years, top tech companies invested in ML platforms to make training and deploying ML models at scale easier and faster. Uber’s Michelangelo, Facebook’s FBLearner, and Airbnb’s Bighead pioneered the space. Since then, many other companies have launched internal ML platform teams.
In this blog, we'll share 10 examples of how companies set off on a journey of building ML platforms.
Want even more ML platform examples?
We went through 150+ blog posts and conference talks and compiled a list of 45 internal ML platforms from top engineering teams. We also curated links to the best articles and videos where they share their learnings.
Bookmark the list and enjoy the reading ⟶
Industry: Food delivery
Company size: 6000+ people
Use cases: demand forecasting, fraud detection, search ranking, menu classification, and food recommendations.
Working on its platform, DoorDash started small. First, they focused on a single core component that would bring meaningful business results — a prediction service called Sibyl. Later they would add ML training infrastructure, observability, and feature engineering components to the platform. Now the platform supports ~200 ML models in production and deals with 500+ billion predictions per week.
The ML platform team consists of three subteams: model development and platform team responsible for feature engineering and model training, model serving team responsible for online and offline predictions, and ML applications team.
DoorDash’s ML Platform:
[BLOG] 3 Principles for Building an ML Platform That Will Sustain Hypergrowth (2022)
[VIDEO] MLOps at DoorDash. Data + AI Summit (2022)
[BLOG] Ship to Production, Darkly: Moving Fast, Staying Safe with ML Deployments (2022)
[BLOG] Introducing Fabricator: A Declarative Feature Engineering Framework (2022)
[BLOG] Maintaining Machine Learning Model Accuracy Through Monitoring (2021)
[BLOG] Meet Sibyl – DoorDash’s New Prediction Service – Learn about its Ideation, Implementation and Rollout (2020)
[BLOG] DoorDash’s ML Platform – The Beginning (2020)
Industry: Mobility services
Company size: 4500+ people
Use cases: price optimization, fraud detection, safety, ETAs, mapping, incentives.
Lyft has built its ML platform on top of Flyte, a tool to manage ML and data workflows that they open-sourced in 2020.
The company also shared how they implemented other components of the platform: a system for deploying and serving ML models, a model training infrastructure built on Kubernetes, a feature service that hosts several 1000s of features across a large number of models, and a full-spectrum model monitoring.
Lyft’s ML platform:
[BLOG] Powering Millions of Real-Time Decisions with LyftLearn Serving (2023)
[BLOG] Full-Spectrum ML Model Monitoring at Lyft (2022)
[VIDEO] Distributed Machine Learning at Lyft (2022)
[BLOG] ML Feature Serving Infrastructure at Lyft (2021)
[BLOG] LyftLearn: ML Model Training Infrastructure built on Kubernetes (2021)
[BLOG] Lyft’s End-to-End ML Platform (2021)
[BLOG] Introducing Flyte: A Cloud Native Machine Learning and Data Processing Platform (2020)
Industry: Social platforms
Company size: 21,000 people
Use cases: job recommendations, serving relevant ads, content moderation, predicting lead conversion rates.
LinkedIn started to work on the platform in 2017 and introduced Pro-ML in 2019. With hundreds of ML models in production, the goal of Pro-ML was to double the effectiveness of ML engineers. Initially, Pro-ML catered to the core set of ML developers. However, as AI and ML tooling became more accessible to other teams, the platform team stepped up to support the needs of this broader audience.
As a result, LinkedIn launched Pro-ML Workspace, one of the major components for user interfaces and a one-stop MLOps portal.
The company also shared how they implemented specific components of their ML platform, including model health assurance, ML fairness toolkit, and embedding feature platform.
LinkedIn’s ML platform:
[BLOG] One-stop MLOps portal at LinkedIn (2022)
[BLOG] DARWIN: Data Science and Artificial Intelligence Workbench at LinkedIn (2022)
[BLOG] Open sourcing Feathr – LinkedIn’s feature store for productive machine learning (2022)
[BLOG] Model health assurance platform at LinkedIn (2021)
[BLOG] Using the LinkedIn Fairness Toolkit in large-scale AI systems (2021)
[BLOG] Pensieve: An embedding feature platform (2020)
[BLOG] Scaling Machine Learning Productivity at LinkedIn (2019)
[BLOG] An Introduction to AI at LinkedIn (2018)
Industry: Fintech and banking
Company size: 1800+ people
Use cases: preventing financial crime, fraud detection, credit scoring, enhancing customer experience
Monzo wants to empower data scientists to work end-to-end. This includes being able to deploy an ML model into production independently rather than handing it over to a backend engineer. At the same time, Monzo trusts data scientists to pick the right tool for the job without framework constraints. They also try to reuse the existing infrastructure and tooling instead of rebuilding it.
Monzo writes an annual blog about how ML stack and applications evolve every year: here are the blogs for 2020, 2021, and 2022. Here is the latest version of the stack:
Monzo’s ML stack:
[BLOG] Monzo’s machine learning stack (2022)
[BLOG] An introduction to Monzo’s data stack (2021)
[BLOG] Machine Learning at Monzo in 2021 (2021)
[BLOG] Machine Learning at Monzo in 2020 (2020)
[BLOG] Building a feature store (2020)
Industry: Streaming service
Company size: 10000+ people
Use cases: recommendations, personalized experience, voice commands, creating music.
Spotify started to work on the ML platform in 2017 to increase the speed of iteration for ML development. Now 50+ teams throughout the company use the platform. In 2020, the platform helped to train over 30,000 models while handling 300,000 prediction requests every second.
Currently, there are 30+ people on the ML platform team. The team’s core metric is the “number of model iterations per week,” which they call “ML Productivity.”
They also shared how they approached building the platform as a product for internal users and the lessons learned when figuring out the unique differentiator of ML Home as a gateway to all ML platform capabilities.
Spotify’s ML platform:
[BLOG] Product Lessons from ML Home: Spotify’s One-Stop Shop for Machine Learning (2022)
[VIDEO] How Spotify Does ML At Scale (2021)
[VIDEO] Jukebox: Spotify’s Feature Infrastructure (2021)
[BLOG] The Winding Road to Better Machine Learning Infrastructure Through Tensorflow Extended and Kubeflow (2019)
[SLIDES] How Kubeflow Pipelines fits into our Machine Learning ecosystem
Industry: Food delivery
Company size: 3000 people
Use cases: search ranking, recommendations, ETA, supply and demand forecasting, item replacement, fraud detection.
Instacart started developing its ML infrastructure in 2016 with its open-sourced machine learning framework, Lore. However, as the company and diversity of ML applications grew, Lore’s monolithic architecture became a bottleneck.
To address this problem, Instacart built Griffin, an ML platform based on microservice architecture. It combines third-party solutions (such as Snowflake, AWS, Databricks, and Ray) to support diverse use cases and in-house abstraction layers to provide unified access to those solutions.
Instacart’s ML platform:
[VIDEO] Griffin, ML Platform at Instacart (2023)
[BLOG] Griffin: How Instacart’s ML Platform Tripled ML Applications in a year (2022)
[BLOG] Lessons Learned: The Journey to Real-Time Machine Learning at Instacart (2022)
Industry: Online styling service / E-commerce
Company size: 8000 people
Use cases: recommendations, resource management, logistics optimization, demand modeling, inventory management, algorithmic fashion design.
At Stitch Fix, data scientists own the project end-to-end: from idea to implementation to monitoring. The primary goal of the ML platform was to build an abstraction on top of which data scientists can quickly bring their models to production at scale.
After exploring the existing options (i.e., MLFlow, TFX, modelDB), the Stitch Fix team decided to leverage the data infrastructure they already had and built their own ML platform.
As they put it, the platform is operationally simple, framework-agnostic, and intuitive at the same time. It has 90 unique users and supports 50+ ML production services. Stitch Fix’s ML team size is 145+ data scientists and platform engineers.
Stitch Fix’s ML platform:
[BLOG] Deployment for Free -- A Machine Learning Platform for Stitch Fix's Data Scientists (2022)
[BLOG] What I learned building platforms at Stitch Fix (2022)
[BLOG] Aggressively Helpful Platform Teams (2021)
[VIDEO] The Function, the Context, and the Data—Enabling MLOps at Stitch Fix (2021)
Industry: Delivery and mobility services
Company size: 3000+
Use cases: dispatching drivers, serving food recommendations, detecting fraud, dynamically setting prices.
Gojek is an Indonesian multi-service platform with 20+ products in transport, logistics, food delivery, shopping, payments, and entertainment. The core idea behind Gojek’s ML Platform is to enable new projects to compose solutions out of existing products instead of building from scratch.
The ML platform is built with the existing Gojek tech stack in mind and either abstracts away any integration points or makes these integrations easy. They also shared how they implemented individual platform components from the feature store Feast they later open-sourced to their deployment service, Merlin.
Gojek’s ML platform:
[VIDEO] Feature Engineering at Scale with Dagger and Feast (2022)
[BLOG] Finding Intelligence with Turing (2020)
[BLOG] Feast: Bridging ML Models and Data (2020)
[BLOG] Reliable ML Pipelines with Clockwork (2020)
[BLOG] Merlin: Making ML Model Deployments Magical (2019)
[BLOG] An Introduction to Gojek’s Machine Learning Platform (2019)
Company size: 20,000+
Use cases: recommendations, fraud detection, fake items moderation, stock forecasting, predicting package dimensions.
Mercado Libre developed its ML platform entirely in-house. It is a framework created to support data mining, development, and deployment of ML models for more than 500 users, including analysts, data scientists, machine learning, and data engineers.
Mercado Libre’s ML platform:
[BLOG] Why and when to build a Machine Learning Platform (part 1) (2020)
[BLOG] Why and when to build a Machine Learning Platform (part 2) (2021)
Industry: Travel marketplace
Company size: 600+ people
Use cases: ranking tourist activities, demand forecasting, inventory labeling, recommendations.
GetYourGuide runs 20+ ML-based projects in production with batch and online inference. To offer a standard for building, deploying, and managing new models, the company introduced an ML platform.
With a relatively small team of ML engineers, it would be impractical to develop the platform from scratch. Instead, GetYourGuide focused on leveraging what the community has already created and turned to open source. The platform supports CI/CD, model tracking, batch and online inference and is entirely built with open-source tools.
GetYourGuide’s ML platform:
[BLOG] How We Extended our Open Source ML Platform to Support Real-Time Inference (2022)
[VIDEO] How to Build a ML Platform Efficiently Using Open-Source. Data + AI Summit (2022)
[BLOG] Laying the Foundation of our Open Source ML Platform with a Modern CI/CD Pipeline (2021)
[fs-toc-omit] Get started with ML monitoring
Evidently is an open-source python library that helps evaluate, test, and monitor ML models in production. You can use it to detect data drift, data quality issues or track model performance for tabular and text data.
Check it out on GitHub ⟶
Get started tutorial ⟶