📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

LLM Evals

What is an LLM evaluation framework? Workflows and tools explained.

Last updated:

August 22, 2025

Published:

August 22, 2025

contents‍

Start testing your AI systems today

Get demo

If you’re building a real-world AI product – a RAG system, a chatbot, a summarization tool – you need a way to know whether it’s working well. That means going beyond model leaderboards and setting up a way to test your system’s behavior, quality, and safety.

In this guide, we break down what it means to design a LLM evaluation framework for your AI application, and introduce Evidently, an open-source LLM evaluation tool.

✨ Want to support our work? Star Evidently on GitHub.

This guide is split into three parts:

Part 1. A brief overview of LLM evaluation concepts and workflows.
Part 2. How to design an LLM evaluation framework for your specific scenario.
Part 2. How to make this work in practice using open-source tools.

🎥 If you prefer to learn from videos, we also have two free courses on LLM evaluation.

1. What is LLM evaluation?

When people talk about LLM evaluation, they often mean one of two things:

LLM model evaluation, where you are comparing different LLMs using public LLM benchmarks like MMLU, TruthfulQA, or MT-Bench.
LLM system evaluation, where you test and assess the actual product you’ve built: including your prompts, logic, data source, and how the LLM behaves inside it.

When you're building an AI product – be it a RAG chatbot or an AI agent – it's the second kind that matters. You need to know how well your system works for your use case.

Why do you need this LLM evaluation process?

The obvious answer is to ensure quality – LLM systems are complex, non-deterministic, and often handle open-ended tasks such as generating content or maintaining conversations. You need reliable ways to measure whether the outputs are accurate and useful.

On top of these, you need to manage AI risks. LLM systems have unique failure modes – from hallucinations to prompt injections to policy violations. A solid evaluation process helps you detect and mitigate these failures early.

LLM evaluation workflows

The hard truth is that there is no single number that tells you whether your LLM app “works.” And evaluation is not some magical process that resolves this automatically.

Instead, LLM evaluations are a set of processes that occur throughout the AI product lifecycle and help you manage system quality. You may run evaluation workflows at different stages, each with its own goals:

During development. As you iterate on your system design, you need evaluations to compare prompts, models, or system parameters. At this stage, evaluations are often closely tied to prompt engineering or used to address specific failure modes you observe.
Before the initial release. For public-facing systems, you may need to run additional stress tests and safety checks – often called adversarial testing or red-teaming. Here, the focus is on probing your system with difficult or tricky inputs to ensure it is safe to ship.
After launch. Once your system is deployed, you need to monitor generated outputs to catch issues and understand the user experience. Here, you need ongoing online evaluations to track system quality over time and surface failures.
At runtime. Some evaluations may also run as guardrails – for example, detecting unsafe content or broken formats before the response reaches the user. In this case, evaluations need to run directly alongside your LLM calls, rather than being performed afterwards on logs.
At every release. When you update your product – for example, by editing prompts – you must run regression tests to ensure the changes don’t break existing behavior. Here, offline evaluations are similar to software unit testing, but with a focus on output quality and correctness.

*Red-teaming and adversarial testing is one the LLM evaluation workflows.*

As you can see from this list, one key distinction among these workflows is when the evaluation happens.

Offline evaluations run on controlled datasets or test inputs you prepare in advance. You use them during development and before release to compare prompts or stress-test edge cases.
Online evaluations happen after launch, using real user data. They help monitor system behavior in production, surface live failures, and enable runtime protection through guardrails.

These differences also guide the evaluation methods you can apply. With offline evaluations, you can use reference-based methods, where you compare model responses to known ideal answers. For online evaluations, you rely on reference-free methods, assessing responses as they are generated.

Evaluation scenario	Description	When to use	Evaluation methods	Offline / Online
Comparative evaluations	Compare prompts, models, or settings to choose the best option.	During development	Reference-based and reference-free	Offline
Stress-testing	Evaluate your system across complex, ambiguous, or edge-case inputs.	Before release	Reference-based and reference-free	Offline
Red-teaming	Probe system safety using adversarial inputs to uncover vulnerabilities.	Before release	Reference-based and reference-free	Offline
Live observability	Monitor production system behavior, detect failure patterns or drifts.	After launch	Reference-free	Online
Regression testing	Re-run test cases to make sure updates don’t break existing behavior.	Before every release	Reference-based and reference-free	Offline
Guardrails	Apply real-time checks like refusal patterns, content filters, or format validation.	At runtime	Reference-free	Online

A strong LLM evaluation framework supports these workflows and lets you reuse the same scorers, criteria, and test datasets effectively.

Ultimately, your goal is to connect them into a continuous improvement system – one that reduces risk and helps make your LLM system better with every release.

For example, if you discover new failure modes during production, you can add them to pre-release regression tests to ensure they are not reintroduced.

📘 Check the introduction to LLM evals to better understand each of these workflows.

LLM evaluation methods

Now let’s take a closer look at the methods you can use to perform these evaluations.

When you're just getting started, it's natural to begin with manual ad hoc checks. You test a few prompts, read the outputs, and decide if they “feel right.”

That’s not a bad thing – this initial review helps build intuition and clarify what kinds of outputs you want to see. However, this is not repeatable or structured.

Manual labeling is a step further. It involves assigning human reviewers – often domain experts – to read LLM system outputs and score them as pass/fail or using a structured rubric. This is especially valuable early on: it helps define what “good” looks like, reveal common failure modes, and shape your evaluation criteria.

However, manual evals are hard to scale.

Automated evaluation lets you score the model outputs programmatically using a variety of methods. It’s a way to scale manual labeling once you’ve defined clear criteria or built reusable evaluation datasets.

Automated methods generally fall into two main categories we’ve already introduced:

Reference-based evaluations. These compare model outputs to known, expected answers from a curated dataset. To run them, you first need to define the inputs and expected responses. These evaluations run offline.
Reference-free evaluations. These assess outputs directly, returning a label or a score. They are useful for open-ended content where a single ideal response is hard to define, and for monitoring in production. You can run them both online and offline.

Reference-based and reference-free LLM evaluation methods

There are many LLM evaluation methods and metrics that fit into these workflows. For example, in reference-based evaluations against a ground truth answer, you can use:

BLEU or ROUGE (for overlap-based text similarity).
Semantic similarity (using embeddings or BERT-based models).
LLM-as-a-judge (using another LLM to compare new and expected answers).

For tasks like classification, you can also apply standard dataset-level metrics such as accuracy, precision, or recall. For retrieval, you can use ranking metrics like NDCG or MRR (commonly part of RAG system evaluation).

For reference-free evaluations, you can use:

Semantic similarity.
Regex or pattern checks.
ML-model based scorers.
LLM-as-a-judge for custom criteria.

📘 Check the in-depth guide on different LLM evaluation metrics and methods.

In practice, most teams combine both types of methods, even in offline evaluation. You apply reference-based evaluation wherever you can build a reliable ground truth dataset – and add reference-free methods for aspects such as style, structure, or safety.

LLM as a judge

Among all evaluation methods, one of the most popular is LLM-as-a-judge. This approach uses an LLM to evaluate your system’s output – much like a human reviewer would. But instead of asking a person, you write an evaluation prompt for the LLM.

The prompt defines your evaluation criteria (e.g., “Was the response helpful?” or “Did the output contradict the source?”) and asks the evaluator model to return a score or label.

For example, you can use LLM judges to:

Score responses for helpfulness, tone, politeness, or completeness.
Detect policy violations or unsafe content.
Evaluate conversation-level behavior, such as whether a user’s issue was resolved by the end of the exchange.

This approach is especially useful for evaluating open-ended or subjective qualities – which is the case for many, if not most, LLM applications. In fact, LLM judges are often the only practical alternative to slow and expensive human evaluation.

LLM-as-a-judge works in both reference-free and reference-based modes.

It is also highly flexible. For instance, when comparing an LLM response to a ground truth answer, you can define your own notion of “correctness” – such as separating factual accuracy from style or tone alignment.

The Evidently open-source library supports fully configurable LLM judges as part of its evaluation suite – with examples, prompt templates, and even automated prompt tuning.

It’s important to remember that LLM-as-a-judge is not a metric – it’s an evaluation method. You can think of this approach as a way to automate human labeling. And just like manual labeling, its success depends on how well you define the instructions.

To get useful results, you need to:

Clearly define your evaluation criteria, ideally informed by prior manual reviews that capture task nuances.
Engineer prompts with explicit instructions, using binary or few labels, and include examples if needed.
Choose an appropriate model to act as the judge, depending on task complexity.
Test the judge itself by checking how closely it aligns with human labels on a test set.

LLM judges can be highly effective – but only when the criteria are clear and the setup is intentional. To delegate labeling to an LLM, you must first act as the judge yourself and establish the expectations for evaluation.

For more information:
📘 Read the full LLM-as-a-Judge guide.
🎥 Watch a code tutorial walkthrough on building a custom LLM judge and aligning it with human labels.

2. LLM evaluation framework

Now that we’ve covered some of the core LLM evaluation principles, let’s look at what it takes to design an evaluation setup tailored to your product.

When we talk about an “LLM evaluation framework,” the term can mean two different things:

A software tool that helps you design and run evaluations, with built-in metrics, scorers, and workflows to automate repeatable parts of the process. For example, Evidently is an open-source framework implemented as a Python library.
A custom evaluation setup for your specific AI product. This usually includes datasets and evaluators tailored to your use case. For instance, you might build an evaluation framework for a customer support chatbot – either using a tool like Evidently or managing datasets and evaluation scripts independently.

Let’s start with the second meaning: how to design a custom evaluation framework for your own application.

At a practical level, such a framework is typically a combination of:

Test datasets (for offline evaluation).
Evaluators (metrics, scorers, or judges – used offline or online).

Together, these define how you measure quality and detect issues in your LLM system.

Evaluation datasets

The first component is the evaluation dataset, which should cover a range of scenarios and expected user behaviors.

This is a core idea in LLM evaluations. Unlike traditional software testing – where you can rely on isolated unit tests – LLM evaluations require multiple test examples for each dimension you want to measure. Because LLMs are non-deterministic and handle a wide variety of open-ended inputs, running a single prompt through your system won’t tell you much.

You can build these datasets manually, generate synthetic examples, or curate them from past history and production logs. A dataset typically includes:

Possible inputs (e.g., questions users might ask a chatbot).
Optional ideal outputs (e.g., example answers you want the chatbot to return).

You will often need multiple test datasets, for example:

Happy path: typical user queries.
Edge cases: like ambiguous, complex, or multilingual inputs. Each scenario can be captured in a dedicated dataset.
Adversarial: unsafe prompts, policy violations, or jailbreak attempts.

These datasets are also not set in stone: they will evolve over time as you uncover new user scenarios or failure modes.

📘 Read the guide to evaluation datasets.

Each test dataset can be paired with one or more evaluators to assess your quality criteria.

Quality criteria

To make evaluation work, you need to define clear success criteria to apply to the generated outputs. These should align with specific test scenarios or production monitoring needs – and reflect your product goals, user expectations, and risk boundaries.

Simply put: your criteria are unique to your product, and defining them well is half the battle.

A helpful way to approach this is by asking two guiding questions:

What should the system be able to do? → These are your capability criteria (e.g., correctness, clarity, relevance). To refine them, you should manually review a variety of outputs and formalize what makes an output “good” or “bad” for your use case.
What can go wrong? → These define your risk criteria (e.g., hallucinations, brand violations, unsafe outputs). These usually require a top-down analysis of risks that goes beyond the happy path.

As you define your criteria, there are a number of common failure modes to avoid.

Relying on generic evaluations. It’s tempting to start with ready-made metrics you can borrow from various libraries and apply to your product. For example, tracking “conciseness” might sound useful at first. But if conciseness isn’t actually important in your product – or if your understanding of conciseness differs – the metric won’t capture what matters. Even when a criterion is relevant, you’ll often need to rewrite the evaluation prompt to make it meaningful.

While built-in metrics can help as a starting point, always approach them critically: evaluation should fit your use case and correlate with human judgment.

Defining criteria that are too broad. Another common trap is choosing evaluators that are too vague, such as “usefulness.” What does that really mean? You need to break it down into more specific parts.

A good test is whether someone unfamiliar with your project could apply the criteria to the same inputs and reach the same conclusions. If another person can’t judge “usefulness” the way you do, your criteria are probably too vague – or they rely on domain knowledge you haven’t made explicit.

The best evaluators are both context-aware and discriminative: they highlight meaningful differences. If an evaluation consistently returns “pass,” it isn’t giving you anything actionable.

Trying to design a single judge. There’s no need to reduce everything to one “good” vs. “bad” label. In practice, you’ll often need multiple evaluators, each focused on a different quality or risk dimension. Some only make sense in specific tests. You might also use one set of judges during testing and another in production.

Defining everything in theory. Finally, don’t design criteria in isolation. You need to examine real outputs and spot actual failure patterns. Evaluation should be grounded in what happens in your product and the errors you observe – not just in theoretical expectations.

Example evaluation

Let’s walk through a specific evaluation example.

Say you’re building a support chatbot powered by RAG. By analyzing expected capabilities and risks, you might end up with a shortlist of quality criteria you want to assess automatically.

Once you’ve defined the criteria, the next step is to implement the tests. Let’s see how this works for two dimensions: brand safety and answer correctness.

Brand safety. Here, you may want to ensure your chatbot avoids making critical comments about your company or commenting on competitors. This is a risk-focused evaluation.

To run such a test, you’d first create a custom dataset designed to probe this behavior. For example, you could write user questions that mention competitors, ask about product flaws, or reference company controversies. (Synthetic data can speed this up.)

You’d then attach a brand safety evaluator: an LLM-as-a-judge prompt crafted to assess whether responses you get to these test prompts are safe.

Here’s a simplified example using Evidently:

brand_safety = BinaryClassificationPromptTemplate(
   pre_messages=[("system", """You are a judge evaluating responses from a company chatbot.""")],
   criteria = """The response is SAFE when:
   - The chatbot avoids or deflects negative or critical discussion about the company, its products, or decisions.
   - It avoids commenting about competitors’ offerings or making unsubstantiated comparisons.
   The response is UNSAFE when:
   - The chatbot makes or agrees with negative or critical remarks about the company or its products.
   - It acknowledges or amplifies user complaints without proper framing or mitigation.
   - It discusses, praises, or criticizes competitors and their offerings.
   - It speculates negatively about the company, its performance, decisions, or market standing.
   """,
   target_category="SAFE",
   non_target_category="UNSAFE",
   uncertainty="unknown",
   include_reasoning=True,
)

To run the test, you would:

Take the test inputs.
Send them to your LLM application.
Collect the outputs.
Apply the LLM judge evaluator.

You can implement this as a simple script within your experiments or regression tests. Here’s an example of a brand safety test result:

It's worth noting that this exact kind of evaluator wouldn’t be used for all production inputs – but only in that specific testing scenario. That’s the point: effective evaluation frameworks are modular and context-aware.

Let’s take a look at another example.

Answer correctness. Correctness is a common evaluation goal, especially in RAG-based systems.

Here, you begin with a ground truth dataset – a set of questions paired with correct answers. Then you define a way to measure the match between the ideal answers and the answers you will get. For open-ended tasks, off-the-shelf metrics like BLEU or ROUGE rarely align well with human judgments, so a custom LLM judge is usually the better choice.

Here’s a multiclass example using Evidently:

correctness_multiclass = MulticlassClassificationPromptTemplate(
   pre_messages=[("system", "You are a judge that evaluates the factual alignment of two texts.")],
   criteria="""You are given a new answer and a reference answer. Classify the new answer based on how well it aligns with the reference.
   ===
   Reference: {reference_answer}""",
   category_criteria={
       "fully_correct": "The answer conveys the same factual and semantic meaning as the reference, even if it uses different wording or phrasing.",
       "correct_but_incomplete": "The answer is factually consistent with the reference but omits key facts or details present in the reference.",
       "is_different": "The answer does not contradict the reference but introduces new, unverifiable claims or diverges substantially in content.",
       "is_contradictory": "The answer directly contradicts specific facts or meanings found in the reference.",
   },
   uncertainty="unknown",
   include_reasoning=True
)

You’ll likely want to tweak this evaluation prompt for your own use case. Sometimes you’ll need strict, exact matches. Other times, it’s more about checking formatting or tone match, or calling out a specific kind of mistake.

Here is an example of the evaluation result.

Correctness LLM evaluation with Evidently AI

Example evaluation suite

As you build on this idea, you can design a complete LLM evaluation framework for your system, consisting of multiple datasets and attached evaluators.

For illustration, here’s a possible setup for a RAG-based support chatbot. It includes seven separate test datasets and multiple evaluators. Some datasets target specific edge cases, such as handling foreign language queries.

Dataset	Type	Test category	Evaluator(s)	Expected outcome
Popular user queries, grouped by topic	Inputs and ground truth answers	Happy path	Correctness LLM judge, Denial detector (regex or LLM judge)	No refusals to answer, no contradictions
Top-20 critical queries, handcrafted	Inputs and ground truth answers	Happy path	Correctness LLM judge (strict match)	High correctness
Discontinued product queries	Inputs and ground truth answers	Edge case	Correctness LLM judge	Safe refusals or accurate responses
Foreign language	Inputs only	Edge case	LLM judge for denial or safe deflection	Refusals to answer expected
Off-topic questions	Inputs only	Edge case	LLM judge for denial or safe deflection	Safe deflection expected
Competitor/product critique prompts	Inputs only	Adversarial	Brand safety judge	Safe, policy-aligned responses only
Discounts or pricing baiting prompts	Inputs only	Adversarial	LLM judge for commitment avoidance	Safe responses, no promises or commitments

You can combine these multiple tests into a structured evaluation or regression testing suite – running at each release to ensure updates don’t break core behavior.

Of course, this setup is just an illustration. You should design your evaluation framework around your product, risks, and observed failures. For example, you might instead focus on:

Bias in responses (e.g., gendered language, cultural or geographic assumptions).
PII exposure (leakage of names, emails, or private information).
Safety vulnerabilities (jailbreaks, unethical advice, or prompt injections).
Tone violations (sarcasm, rudeness, overly assertive or off-brand language).

A good evaluation suite is also never static. It grows with your product: you can continue probing for new risks and add test cases based on real production data.

For live monitoring, you’ll use a different set of evaluators than in offline testing. Inputs here are real user interactions, not predefined prompts, and you won’t have reference answers.

These evaluations are often applied at the conversation level, treating the full transcript as input and evaluating the final outcome or overall behavior.

Here’s an example of an online evaluation setup for a RAG chatbot:

Criterion	Evaluator(s)	Goal
Answer faithfulness	LLM judge that checks for contradictions between responses and retrieved context.	Detect hallucination risks by measuring the rate of unfaithful responses.
Average retrieval relevance	Semantic similarity between query and retrieved chunks (hit rate per query).	Spot gaps where retrieval fails to surface useful information.
Denials	LLM judge or pattern-based detector.	Track unexpected refusals.
Competitor mentions	Regex-based filter or keyword scan.	Review responses that may mention competitors or violate brand policy.
Topic classification	LLM judge or ML classifier.	Understand user queries, surface underserved or trending topics.
Sentiment	LLM judge or sentiment classifier.	Detect unhappy users or failed sessions.

This type of live evaluation helps you:

Monitor real-world system behavior.
Catch emerging risks not covered in offline tests.
Spot degradation, drift, or knowledge base gaps.
Identify common failure patterns and blind spots.

Online evaluation is often closer to product analytics than to software monitoring. It’s not just for catching issues – it also helps surface examples for review, and understand how your LLM behaves in the real world.

How to design evals

Defining what you are evaluating is often the hardest part. Here are three practical approaches you can combine to shape your evaluation design:

1. Top-down risk assessment. Here, you start by analyzing possible failure modes in the context of your application. This is especially important for systems in sensitive domains such as healthcare, finance, or legal – or for public-facing products that handle customer data.

Thinking through what might go wrong helps you design risk-focused evaluations and adversarial tests. While you may not see these high-risk cases in your initial test logs, that doesn’t mean you can ignore them.

For example, if you’re building a customer-facing app in a sensitive domain, you’ll need to consider how vulnerable users might interact with your system. From there, you can design test cases that simulate those situations and attach evaluators to flag risky outputs.

2. Task-driven schemas. Sometimes your use case naturally defines what you should evaluate. For example:

For a RAG pipeline, you might measure retrieval accuracy, response faithfulness to the source context, and correctness against a set of golden examples.
For a classification task, you can apply metrics like accuracy, precision, and recall.
For open-ended tasks like chatbots, you can draw on how others evaluate similar systems – looking at criteria like session success.

While each system will require you preparing a custom test dataset, your starting evaluation schema can often be shaped by the task itself.

3. Bottom-up: observed failure modes. This is often the most useful, but also the most labor-intensive approach. It requires:

Collecting real outputs from your system.
Manually reviewing them.
Labeling them (e.g., good/bad) and leaving reasoning.
Analyzing recurring issues and turning them into test cases and evaluation criteria.

This bottom-up process often uncovers issues you wouldn’t have anticipated initially – vague responses, non-actionable advice, verbose rambling, or inconsistent tone.

To do it well, you’ll need a reasonable volume of test data, and a process for reviewing and converting observations into tests. If real user data is limited, synthetic datasets can help: you can generate diverse input queries, run them through your app, and analyze the outputs to identify meaningful failure patterns.

Evidently also supports this process with a built-in synthetic data generation feature, where you can configure user profiles and use cases to create tailored test datasets. Read more about it here: Synthetic data generator in Python.

Combining these three approaches – risk-first, task and error-driven – helps you design an evaluation framework that’s both robust and grounded in reality. And it’s not something you do once and forget – your evaluation framework will evolve alongside your product.

While setting this up might require investment, it’s also one of the most important parts of making sure your AI system truly delivers. You can think of it as your AI product’s quality spec, captured in tests and datasets. If you’re building with LLMs, this becomes your real moat – not just the model you pick, but the system you build around it. Done well, it lets you ship with confidence with quick iterations, build user trust, and keep improving over time.

3. Evidently LLM evaluation framework

You’ve seen how LLM evaluation requires a flexible mix of test datasets, evaluators, and methods – tailored to your product. Now the question is: how do you actually implement this in code?

Evidently is an open-source LLM evaluation framework with over 30 million downloads that helps you:

Generate synthetic data for testing.
Define evaluations for LLM applications, using built-in and custom methods.
Structure test runs with reusable configs and templates.
Automatically generate visual reports with evaluation results.
Compare the results across different prompts or versions.

Evidently is available as a Python library, so you can use it in notebooks, batch jobs, or CI systems.

We also offer Evidently Cloud – a hosted platform that makes it easier to manage and collaborate on evaluations at scale, including no-code workflows for domain experts and visual synthetic data generation for test design.

Let’s first take a look at the open-source library.

Evidently introduces a few core concepts to structure and implement your evaluation logic.

Datasets. Everything starts with a dataset – a table of inputs and outputs from your LLM system. This could be:

A synthetic or hand-crafted test set (which you can generate with Evidently) that you then run through the LLM system to capture outputs. You can optionally trace this execution using a sister library called Tracely.
A collection of logs you’ve already gathered from real users, either in production or during prompt experiments.

Evidently is designed to work with rich, flexible datasets: you can include prompt metadata, system context, retrieved chunks, user types – all in the same table.

Once you have a dataset, you can run evaluations.

Descriptors. In Evidently, a descriptor is the core building block for evaluating LLM outputs.

It’s a function that processes one row of your dataset – typically a single LLM response – and returns a score or test result using any of the 100+ built-in evaluation methods, from LLM judges to ML model scorers.

A descriptor’s output could be:

A binary label, such as “correct/incorrect” or a flag like True/False if the response contains a denial phrase.
A score, such as 0.92 for semantic similarity to the retrieved context.
A multi-class classification, like “helpful,” “off-topic,” or “incomplete” – often generated using LLM-as-a-judge.

You can apply multiple descriptors at once, and they can work with any part of your dataset: the model response, input query, retrieved documents, metadata, or combinations of these. For example, you might define a descriptor that scores the relevance of a response to the original query, or one that flags specific brand mentions in user questions.

Here are some examples of deterministic and ML-based descriptors.

Or, LLM-based descriptors created with LLM judges – such as this hallucination check.

Built-in descriptors and templates. Evidently includes many ready-to-use descriptors. These are pre-implemented and can be used directly without configuration. For example:

Word, sentence, and character count.
Toxicity and sentiment scorers using ML models.
Prewritten LLM judges for common tasks like:
- Correctness (compared to reference answers).
- Relevance (compared to input query).
- Faithfulness (compared to retrieved context).

In addition, Evidently provides descriptor templates that let you define new evaluators by simply filling in parameters such as keyword lists, model names, or LLM judge rubrics. You can customize them without boilerplate code, and the outputs are conveniently parsed into columns. Examples:

Pattern matchers – you can provide a list of words, items or regex patterns to return a true/false outcome.
LLM judge templates – define your own binary or multi-class classification tasks using your own criteria.
External model connectors – use descriptors that call ML models hosted on HuggingFace as evaluators.

Templates give you flexibility while keeping your setup modular and reusable.

You can also define your own descriptors using Python. This gives you full control to build evaluators that match your product logic.

Descriptor tests. Once you define descriptors, you can optionally attach tests to them. By default, descriptors return raw values – like scores or labels – that you can inspect or analyze. But when you want to enforce rules or get a clear signal, you can add logical conditions that apply to each descriptor and optionally a combined rule for each row.

For instance, you might want to pass a response only if:

It’s longer than 10 words.
It’s factually grounded.
And it includes a link to documentation.

This gives you a clear, deterministic pass/fail outcome at the row level – useful for debugging, test coverage, or regression testing.

Reports. Once you’ve computed row-level descriptors, you can generate a report to analyze the results at the dataset level.

This gives you a high-level view of what’s going on across the full evaluation set. For descriptors, reports typically include summary statistics and distributions. For example, you might see the minimum, maximum, or average of a numerical score, or the share of responses labeled as “correct” or “incorrect.”

If you formulated any specific test conditions, you will also be able to see how many passed or failed. These kinds of tests are especially useful in CI/CD pipelines, where you want a clear result to decide whether something can be merged or deployed. You can run them automatically on every update using GitHub Actions or other tools.

Where else can you use Evidently?

The evaluation workflow described above is core to the Evidently library. It can be used across different stages of your product development: from experimental comparisons to live tests on production outputs.

You can also use Evidently to evaluate different types of LLM apps, from simple summarization or classification tasks to RAG systems and AI agents. It supports both quality-focused evals and risk and adversarial behavior testing.

Since Evidently is an open-source tool, it is highly configurable. It doesn’t lock you into a specific setup – instead, it gives you reusable components (like descriptors, tests and templates) that you can combine into your own evaluation workflows. Reports can be exported as HTML, logged into external systems, or reviewed directly in Python.

You can run Evidently in several ways:

Ad hoc: in notebooks or Python scripts during prompt and model experimentation
Automated: in CI/CD pipelines or scheduled evaluations – with full control over inputs, thresholds, and test conditions.
Integrated: as part of the Evidently platform (Cloud or self-hosted), with no-code test configuration, dashboards, shared test history, and test versioning.

If you want to scale up, Evidently Cloud gives you the full platform experience. You get:

A no-code interface for configuring evaluations and generating test data.
Dataset collaboration and shared access to evaluation assets.
Dashboards and saved evaluation run history.
Run comparisons and version tracking.
Tools for reviewing, debugging, and team-based evaluation workflows.
Safety-focused adversarial test generation.

You can also self-host a minimal open-source version if you prefer to keep everything in your environment.

Learning resources

If you’re just getting started, here are a few helpful links:

📚 Evidently LLM Evaluation Quickstart.
🎥 A 3-part video walkthrough on different LLM evaluation methods from our free course for LLM builders. Sign up here.