📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations.  Get your copy
Evidently

CI/CD for LLM apps: Run tests with Evidently + GitHub actions

Last updated:
June 23, 2025
Published:
June 23, 2025

What happens when you tweak your prompt, switch model versions, or update the toolchain for your LLM agent – will the answers get better or worse?

You wouldn't merge backend code without running tests. You shouldn't ship LLM code or prompt changes without validating output quality, either.

Now you don’t have to.

We just released a GitHub Action that lets you automatically test your LLM application outputs  – every time you push code. It runs as part of your CI workflow, using the Evidently open-source library and (optionally) Evidently Cloud.

Let’s walk through what it does and how to use it.

Want to try it right away? Check the example repo for CI/CD testing of an LLM agent. You can also assess the action on GitHub Marketplace. 

🤖 Why test LLM outputs?

Developing LLM apps means constant iteration. You:

  • Refactor the agent logic
  • Adjust system prompts
  • Swap a model or tool
  • Try a few “quick” fixes…

But even tiny changes can produce regressions: less helpful responses, shorter or longer completions, or weird tone shifts. And they’re often silent – your code checks pass, but your LLM behavior changes.

By running tests on your LLM or agent's outputs – not just your functions – you can catch these changes early. 

LLM regression testing

Regression testing for LLM apps is one of the key LLM evaluation workflows. In this approach, you run evaluations on a pre-built test dataset to check if your AI system or agent still behaves as expected.

There are two common ways to do this:

  • Reference-based evaluations: compare the generated responses against expected ground truth answers.
  • Reference-free evaluations: provide a set of test inputs, then automatically assess specific qualities of the responses  – such as helpfulness, tone, correctness, or length.

Think of it as unit testing – but for your LLM system’s behavior. And because language models are non-deterministic and designed to handle diverse, open-ended inputs, they’re best evaluated using structured test datasets rather than isolated test cases.

Reference-based LLM evals
Reference-based evals: comparing responses against expected ones.
Want to learn more about LLM evals?
‍
Check this gentle introduction to LLM evals, and overview of the LLM evaluation metrics and methods. You can also explore our free video-courses for AI product teams and developers.

🚦 What the Evidently GitHub action does

This GitHub Action runs output quality validation and regression checks on your LLM agent or system — right inside your CI workflow. 

Here’s what it does on every commit or PR:

  • Downloads a predefined dataset of test prompts.
  • Runs your agent or LLM system against those inputs.
  • Evaluates the generated responses (using LLM judges, Python functions or any applicable metrics like classification precision or recall). 
  • Generates a Test Suite report with pass/fail results based on your defined thresholds.
  • Fails the CI workflow if any test fails.
  • Produces evaluation artifacts  – a detailed report and a scored version of the dataset.

You’ll see a test result in your GitHub Checks. Optionally, results are saved to Evidently Cloud for easier monitoring, debugging, and collaboration.

Evidently GitHub Action

⚙️ Under the hood

On the backend, this action calls the open-source Evidently Python library for LLM testing. The GitHub Action wraps the Evidently CLI and takes in a simple config file.

You define:

  • Your test dataset (local file or dataset stored at the Evidently Cloud).
  • A Python or JSON config that specifies:
    • A function to run your test inputs against your agent. 
    • Evaluation schema with specific checks to run.
    • Pass/fail condition for each.
  • An output path for saving results.

If any test fails, the CI job fails. You can view full reports to explore the outcomes. 

Using Evidently Cloud gives you a few extra benefits beyond just pass/fail checks in CI.

Test dataset collaboration. Storing your test datasets in the cloud makes it easier to collaborate with domain experts. They can design, review and expand the test set – for example, by adding new examples based on recent failures.

Synthetic ground truth test data generated with Evidently Cloud
Example: synthetic ground truth test data generated with Evidently Cloud.

Debugging. After a test run, especially if some responses are flagged or fail, you can share results for additional manual review.

Reviewing results of adversarial tests on brand safety
Example: reviewing results of adversarial tests on brand safety.

Tracking and monitoring. Evidently Cloud automatically tracks your test results over time. You get dashboards with pass/fail history across all tested quality dimensions  – so you can spot regressions and trends.

LLM evals dashboard with Evidently Cloud

🚀 Get started 

We published a ready-to-use example with both local and cloud workflows.
You’ll find everything you need  – sample agent, evaluation config and the GitHub workflows.

👉 Run the CI/CD LLM testing example on GitHub.

Recap

You already test, lint, and type-check your code. Now you can regression test your LLM system outputs – with every PR.

With Evidently + GitHub Actions, you can automatically check that your latest prompt or model change still does what you expect  – right inside CI.

📬 Have questions or want a personalized demo?
Reach out
to us or sign up for Evidently Cloud. Let’s make your LLM stack CI-native.

You might also like

🏗 Free course "LLM evaluations for AI builders" with 10 code tutorials. Sign up ⟶

Start testing your AI systems today

Book a personalized 1:1 demo with our team or sign up for a free account.
Icon
No credit card required
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.