📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Evidently

CI/CD for LLM apps: Run tests with Evidently + GitHub actions

Last updated:

June 23, 2025

Published:

June 23, 2025

contents‍

Start testing your AI systems today

Get demo

What happens when you tweak your prompt, switch model versions, or update the toolchain for your LLM agent – will the answers get better or worse?

You wouldn't merge backend code without running tests. You shouldn't ship LLM code or prompt changes without validating output quality, either.

Now you don’t have to.

We just released a GitHub Action that lets you automatically test your LLM application outputs – every time you push code. It runs as part of your CI workflow, using the Evidently open-source library and (optionally) Evidently Cloud.

Let’s walk through what it does and how to use it.

Want to try it right away? Check the example repo for CI/CD testing of an LLM agent. You can also assess the action on GitHub Marketplace.

🤖 Why test LLM outputs?

Developing LLM apps means constant iteration. You:

Refactor the agent logic
Adjust system prompts
Swap a model or tool
Try a few “quick” fixes…

But even tiny changes can produce regressions: less helpful responses, shorter or longer completions, or weird tone shifts. And they’re often silent – your code checks pass, but your LLM behavior changes.

By running tests on your LLM or agent's outputs – not just your functions – you can catch these changes early.

Regression testing for LLM apps is one of the key LLM evaluation workflows. In this approach, you run evaluations on a pre-built test dataset to check if your AI system or agent still behaves as expected.

There are two common ways to do this:

Reference-based evaluations: compare the generated responses against expected ground truth answers.
Reference-free evaluations: provide a set of test inputs, then automatically assess specific qualities of the responses – such as helpfulness, tone, correctness, or length.

Think of it as unit testing – but for your LLM system’s behavior. And because language models are non-deterministic and designed to handle diverse, open-ended inputs, they’re best evaluated using structured test datasets rather than isolated test cases.

Reference-based LLM evals — *Reference-based evals: comparing responses against expected ones.*

Want to learn more about LLM evals?
‍Check this gentle introduction to LLM evals, and overview of the LLM evaluation metrics and methods. You can also explore our free video-courses for AI product teams and developers.

🚦 What the Evidently GitHub action does

This GitHub Action runs output quality validation and regression checks on your LLM agent or system — right inside your CI workflow.

Here’s what it does on every commit or PR:

Downloads a predefined dataset of test prompts.
Runs your agent or LLM system against those inputs.
Evaluates the generated responses (using LLM judges, Python functions or any applicable metrics like classification precision or recall).
Generates a Test Suite report with pass/fail results based on your defined thresholds.
Fails the CI workflow if any test fails.
Produces evaluation artifacts – a detailed report and a scored version of the dataset.

You’ll see a test result in your GitHub Checks. Optionally, results are saved to Evidently Cloud for easier monitoring, debugging, and collaboration.

⚙️ Under the hood

On the backend, this action calls the open-source Evidently Python library for LLM testing. The GitHub Action wraps the Evidently CLI and takes in a simple config file.

You define:

Your test dataset (local file or dataset stored at the Evidently Cloud).
A Python or JSON config that specifies:
- A function to run your test inputs against your agent.
- Evaluation schema with specific checks to run.
- Pass/fail condition for each.
An output path for saving results.

If any test fails, the CI job fails. You can view full reports to explore the outcomes.

Using Evidently Cloud gives you a few extra benefits beyond just pass/fail checks in CI.

Test dataset collaboration. Storing your test datasets in the cloud makes it easier to collaborate with domain experts. They can design, review and expand the test set – for example, by adding new examples based on recent failures.

Synthetic ground truth test data generated with Evidently Cloud — *Example: synthetic ground truth test data generated with Evidently Cloud.*

Debugging. After a test run, especially if some responses are flagged or fail, you can share results for additional manual review.

Example: reviewing results of adversarial tests on brand safety.

Tracking and monitoring. Evidently Cloud automatically tracks your test results over time. You get dashboards with pass/fail history across all tested quality dimensions – so you can spot regressions and trends.

LLM evals dashboard with Evidently Cloud

🚀 Get started

We published a ready-to-use example with both local and cloud workflows.
You’ll find everything you need – sample agent, evaluation config and the GitHub workflows.