contents‍
What happens when you tweak your prompt, switch model versions, or update the toolchain for your LLM agent – will the answers get better or worse?
You wouldn't merge backend code without running tests. You shouldn't ship LLM code or prompt changes without validating output quality, either.
Now you don’t have to.
We just released a GitHub Action that lets you automatically test your LLM application outputs – every time you push code. It runs as part of your CI workflow, using the Evidently open-source library and (optionally) Evidently Cloud.
Let’s walk through what it does and how to use it.
Want to try it right away? Check the example repo for CI/CD testing of an LLM agent. You can also assess the action on GitHub Marketplace.Â
Developing LLM apps means constant iteration. You:
But even tiny changes can produce regressions: less helpful responses, shorter or longer completions, or weird tone shifts. And they’re often silent – your code checks pass, but your LLM behavior changes.
By running tests on your LLM or agent's outputs – not just your functions – you can catch these changes early.Â
Regression testing for LLM apps is one of the key LLM evaluation workflows. In this approach, you run evaluations on a pre-built test dataset to check if your AI system or agent still behaves as expected.
There are two common ways to do this:
Think of it as unit testing – but for your LLM system’s behavior. And because language models are non-deterministic and designed to handle diverse, open-ended inputs, they’re best evaluated using structured test datasets rather than isolated test cases.
Want to learn more about LLM evals?
‍Check this gentle introduction to LLM evals, and overview of the LLM evaluation metrics and methods. You can also explore our free video-courses for AI product teams and developers.
This GitHub Action runs output quality validation and regression checks on your LLM agent or system — right inside your CI workflow.Â
Here’s what it does on every commit or PR:
You’ll see a test result in your GitHub Checks. Optionally, results are saved to Evidently Cloud for easier monitoring, debugging, and collaboration.
On the backend, this action calls the open-source Evidently Python library for LLM testing. The GitHub Action wraps the Evidently CLI and takes in a simple config file.
You define:
If any test fails, the CI job fails. You can view full reports to explore the outcomes.Â
Using Evidently Cloud gives you a few extra benefits beyond just pass/fail checks in CI.
Test dataset collaboration. Storing your test datasets in the cloud makes it easier to collaborate with domain experts. They can design, review and expand the test set – for example, by adding new examples based on recent failures.
Debugging. After a test run, especially if some responses are flagged or fail, you can share results for additional manual review.
Tracking and monitoring. Evidently Cloud automatically tracks your test results over time. You get dashboards with pass/fail history across all tested quality dimensions – so you can spot regressions and trends.
We published a ready-to-use example with both local and cloud workflows.
You’ll find everything you need – sample agent, evaluation config and the GitHub workflows.
👉 Run the CI/CD LLM testing example on GitHub.
You already test, lint, and type-check your code. Now you can regression test your LLM system outputs – with every PR.
With Evidently + GitHub Actions, you can automatically check that your latest prompt or model change still does what you expect – right inside CI.
📬 Have questions or want a personalized demo?
Reach out to us or sign up for Evidently Cloud. Let’s make your LLM stack CI-native.