For the Ask-Me-Anything series, we invite ML practitioners to share their experiences with the Evidently Community.
We chatted about how orchestration tools evolved, their place within the MLOps landscape, the future of data pipelines, and building an open-source project amidst the economic crisis.
Sounds interesting? Read on for the recap of the AMA with Rick.
Data orchestration and pipelines
For the uninitiated, what is data orchestration? And how does it fit within the MLOps landscape?
Data orchestration can be thought of as the activity of executing various tasks with the aim of getting to a certain end state.
Different people hold different definitions, but the concept of a DAG is very common. DAG stands for a directed acyclic graph. This refers to how these tasks are often modeled: as a graph that in its entirety defines which tasks need to be executed in which order.
You can think of tasks as scripts or functions. They generally do something or produce a result.
To make this more practical, a task in a DAG — or a task that gets orchestrated — could be anything data related. For example, it could fetch a data frame from a database by executing SQL from a Python context. It could query data from a REST API endpoint over HTTP or mathematically optimize parameters by fitting a model to training data. Really it's just code executing a task in the context of something we consider part of the subset of "data tasks."
As for how it fits into MLOps, in my opinion, in MLOps, many tasks can still be most easily modeled as a batch process that runs a set of tasks related to one another. For example, if you want to retrain a model on new data, you can have a set of tasks that:
- get the new data from some location,
- preprocess that data in a way consistent with how data of the old model was preprocessed,
- train an actual model on the newly acquired data,
- evaluate the model for properties like drift (go Evidently, yay!),
- store the trained model somewhere.
Not all MLOps is data orchestration. Disjoint areas are things like inference endpoint serving (e.g., FastAPI auto-horizontally scaled) or a streaming system using edge functions for real-time prediction using Kafka streams. These kinds of tasks don't fit the model of batch processing.
Can you please briefly explain how orchestration tools have evolved? As there are many today. In your opinion, how has opting for a particular tool changed?
Great question! We've come a long way from orchestration just being a "more advanced cron."
One thing I'm particularly happy with in the case of Orchest is that we're making proper use of all the incredible innovation in the container technology space. Proper isolation (which is part of why we build operating systems the way we do) is key to having a reliable and robust system.
If we want orchestration to be robust and reliable, we can't just throw all projects onto an EC2 instance with a single Python virtualenv and hope things won't fall over as we start adding pipelines.
Another area that I think is promising is the idea of creating a more seamless integration with other parts of the data stack. Standardization of data formats like Apache Arrow can help in this area. Orchestrators are often concerned with data passing (this didn't use to be the case for Airflow as it bolted that functionality on using XComs) and making that data passing more integrated, robust and performant. It makes life a lot easier for people authoring data pipelines.
I also highly believe in specialization and not reinventing the wheel. I would love to outsource container orchestration to a tool like Kubernetes as much as we'd love for users to outsource data pipelines to Orchest.
Following the "tools topic," how would you describe the state of the data pipeline tooling field right now? What is the next big thing in the industry you see?
Hot take I have: I think all cloud-specific data tools are destined for failure.
In the DevOps space, a lot of the industry has rallied behind Kubernetes. It really sucks to have a fundamental architecture piece in your stack that completely locks you in. E.g., if you build against Fargate, you can't move your app to GCP or Azure.
Tools like Sagemaker Pipelines, Azure Machine Learning pipelines, Azure Data Factory, and Vertex AI Workbench are destined for failure.
Professionals like to invest in tools they can take to a new job or project. I don't see why anyone would be satisfied with investing time in something they're locked into and cannot use when their context changes.
The other day someone told me that Airflow is not a good product. Yet, the market for pipelines is already saturated. The person observed that companies are not thinking about replacing their existing pipelining toolset at the moment. What would be your counterargument?
Great question! I think replacing Airflow for projects with a large established set of pipelines makes little sense usually.
Luckily, with many data projects, there are new initiatives very frequently. For anything greenfield, people will look at their options (in workflow tools, there are quite a few, but I think the only serious contenders are Airflow, Prefect, Orchest, Dagster, Flyte, Kedro, Metaflow) and decide what's best for them based on their requirements.
Also, people should try the tools and see which product approach suits them best. Read the docs and run through the hello world in 15 minutes.
Quick question: are there new projects in the data tooling space that you are excited about?
Ha, this one is easy — Orchest!
But for real, I think if you haven't checked out Polars, I recommend it.
The pitch to me is a data frame library that, for 98% of cases, obviates the need for Spark/Dask/Ray. This is big because vertical scaling (a bigger box) is much easier than horizontal scaling (cluster).
And I think most data projects absolutely suffer from the complexity that you're taking on that you don't realize you're taking on. That is why we've tried to keep Orchest so dead simple that it almost looks like a toy. But guess what? You ship that custom ETL dashboard ahead of schedule, and it gives you the bandwidth to do other stuff.
In your opinion, what are the main pain points data scientists have when building data pipelines?
I love this question! The biggest thing that comes to mind is idempotency. This is the notion that you can run something multiple times, but repeatably doing so will converge to the same end state.
It's related to making sure you can run your notebook from top to bottom and have a correct result. When you schedule a data pipeline, you want to be able to re-run it even if it failed somewhere in the process of running. This can be a bit tricky to learn but using that mindset enforces a bunch of great coding practices in the process.
Is idempotency the same as reproducibility?
I think reproducibility is more about being able to get the same result as produced earlier. Whereas idempotency is more about rerunning the same thing multiple times and not causing any issues.
For example, a pipeline might be considered reproducible if you can use it on a brand-new computer to produce the same result as the original author. But it might not be idempotent because it writes a file to a location once run, and if that file exists, a second run on the same machine will cause the program to crash (e.g., an unhandled exception when a file already exists).
Challenges of being open-source
With the recent economic crisis, do you think earning money with open source has become more difficult?
I like this question! Running a startup, you have to think on your feet :)
I think it is actually pushing more companies to be healthier in their approach. Giving away free things is great, but it also postpones answering important questions about value and pricing that many startups may not be getting clarity on if they continue to give everything away.
For example, with Heroku's end of the free tier, I think Salesforce just asked people to ask themselves: do I find Heroku worth the money we're charging? And people answering with "not really, I won't be paying for Heroku" is 100% fine too.
Companies building software like ours have real expenses when building these productivity tools. Salaries, administration, computers, co-working rent, etc. At some point, you need to find out whether those things balance out (are you a healthy company?).
I think the economic reality will show what products and services people are willing to pay for because they see the value. But if the value is there, companies will have no trouble earning money with what they've created and are selling. We all have things we buy that we are just super happy to pay for. One example for me is Notion. It's not free, but I love paying for it because it gives me & our team so much value every day!
I'll add that I think open source done well means you're OK with making money on some of the work you do but not all of it. The awesome thing you get back is a great community and the opportunity to give a product to people who might not otherwise be able to use your product. Those people can provide value in different ways than money, like giving feedback or sharing the word about what you've created. Companies that balance this well, I think, are GitLab and Hashicorp.
Why did you choose an open-source model to build Orchest? What do you consider to be the most challenging part of developing open-source projects?
When starting Orchest, we asked ourselves what our goals were. One thing we absolutely hated about some data products in the market is that they just seemed very bad. In the sense that they were "bad software." Poor UX, slow performance, lots of bugs, etc. But because they were often sold "top-down" through executives, the end users never really got much of a say in the buying decision. Meaning that the quality of the experience of using the software did not take the #1 priority for the company building said software.
We thought that was absolutely bonkers. So we tried to ask how we could build the highest quality software possible. That lead us to open source. Some of the best software projects today are open source, and we think it owes a lot to the direct feedback loop you can build with your end users.
Open source is attractive for more reasons. It also shows transparency to your end users (the code touching your sensitive data can be inspected and is out in the open for all to see).
Building a community around a project is key. You need people/users to keep you real. What do they want? How is the product delivering on its promises? What are they using it for? All these things are possible but much harder to do without open source.
Also, Sytse from GitLab talks about the "rake rate" of software. It's a way to think about how much value you generate vs. capture. We felt that having a lower "rake rate" meant that we could be making more impact in the world (by reaching more people) while still making enough money to build a financially healthy business.
Building a product for data people
Data people are often not especially good with frontend interfaces. We prefer the command line. How did you build the user interface for your product? What were the challenges in building a UI for a data product?
Interesting observation! Part of the inspiration for building a DAG tool with a strong emphasis on UI is the success of tools like MATLAB, Jupyter Notebook/Jupyter Lab, RStudio, SAS, SPSS, Excel, Mathematica, KNIME, RapidMiner, Dataiku, etc.
While we acknowledge that some data people prefer a CLI and/or framework SDK/API programming approach (e.g., Airflow), we felt that sometimes those "software engineering" oriented tools get in the way of the actual task: working with data and meaningfully putting that data to work. A design goal we had while building Orchest was to empower people to take their existing Data Science/Data Analysis/Data Engineering workflow (e.g., working with Jupyter notebooks locally to perform data exploration, data munging, and model training) and continue that in the cloud. But with the added functionality of creating workflows/DAGs around them to make their projects repeatable/schedulable.
We were also keen to avoid requiring people to learn tools like Kubernetes or specific cloud provider services like AWS Step Functions. That felt too "low level" and an abstraction we should steer data folks clear from instead of forcing them to become more familiar with underlying implementation details of cloud/DevOps.
I love your UI! It's very rare to see this in products for developers. What does the process for coming up with UI design look like? Do you do user research and usability testing?
I have to credit Nick Post here. He is our Product Designer who joined us from GitLab. He and Juan Luis, our dev advocate, wrote about the design process here.
In short, have great ideas (honed over many years of decomposing software products) and talk to users continuously.
How do you use the feedback of users to improve your tool?
We obsess over everything that gets posted to GitHub issues.
We try to find common patterns for prioritization. What gets asked most often by multiple people we give additional priority.
What can help in some cases is getting additional context by reaching out to users and having a long conversation through chat or over a video call. Sometimes they ask for a faster horse, whereas they would benefit more from a car.
You did a demo of Orchest some time ago. What has changed since then?
It was nine months ago. Time flies! In the meantime, what has changed:
- K8s has been shipped and is running in PROD for many of our users. More scalable and easier to deploy in your virtual private cloud.
- We released our completely revamped UI, resulting in a much more intuitive workflow (see attached screenshots for old/new).
- We shipped a feature introducing JSONForms as a way to parameterize pipeline steps. Using proper schema to type the parameterization of a pipeline/pipeline steps (this was an OSS contribution! by Andrej Zieger).
- We shipped many-many more product improvements.
I would summarize it as we've "matured" from the vision to a functional product implementation that more and more data teams depend on in production.
Orchest takes a different approach to building pipelines than other tools like Airflow, whose core concept is a DAG. What are the limitations of existing tools, in your opinion? What is the key difference in how you approach it at Orchest?
I'll try to make this as short as I can:
- We have a UI that allows editing (pipeline and code), not just viewing. That results in a drastically different product experience.
- Our approach is "ship the backend," as the tool allows you to productionize running the DAG/pipeline without requiring any setup/configuration on the user's end.
- We're language agnostic (not Python-specific) by focusing on containers as the primary task encapsulation. This makes a bunch of stuff simpler & more robust, like installing a system dependency for a SQL Python connector or distributing tasks over multiple nodes for scaling up (if you run multiple data pipelines).
Launching a startup
How and why did you decide to do your own business? What would you suggest to others who are thinking of taking this leap? Considering that you worked before as a Data Science intern and Software engineer.
Fun question! Starting a business is something I've aspired to do since I was a young kid. The earliest "business" idea I can think of is wanting to start a supermarket delivery service like Instacart and building a website for it when I was like 11. I served a total of about four customers, at which point my friends I was doing it with decided they would do something else :)
Another big part of this answer is that I'm a big fan of creating something. You could call me a "maker" in the sense that I enjoy creating software projects that do something cool or useful. In the beginning, I wanted to learn how something worked/was curious about it. Over time I've been steering projects more towards those where I have a strong suspicion (based on various kinds of evidence) that things I'm making are valuable to someone. In this vein, I enjoyed making Grid studio because it was all about learning, and it turned out many-many people in the world were curious too about combining Python with spreadsheets.
In addition, I've always had a sense of urgency (in the sense that all our time is precious). While doing work for others, I noticed that I could absolutely not grok working on things that I thought were not logical or impactful. E.g., if my "boss" asked me to implement a feature that I thought was silly or that no one needed, I could barely muster any motivation to work on it. I'm pretty black/white in that sense. If I see the logic behind something and get excited about it because of the potential an idea has, I work obsessively on it until it's done. I'm either full throttle or not doing it at all. In most companies, that kind of attitude doesn't work well since what you end up working on is a complicated result of many stakeholders/opinions. I deeply enjoy working super duper hard on something and seeing it come alive/reaching a result. I figured that to get more of that, I need to be in a position where I can make the call to change what I'm doing if I think what I'm currently doing no longer makes sense.
This answer is already getting pretty long, but I would be remiss if I didn't mention that the potential for capturing the upside of your work is important to me too. I come from modest means, and being able to set up myself and my future family for success and freedom is also part of what attracts me to entrepreneurship.
I think a particular essay from Paul Graham is a great read for the final two points. He talks about having leverage and the associated reward.
[fs-toc-omit] Want to join the next AMA session?
Join our Discord community! Connect with maintainers, ask questions, and join AMAs with ML experts.
Join community ⟶