contentsâ
If youâre building a real-world AI product â a RAG system, a chatbot, a summarization tool â you need a way to know whether itâs working well. That means going beyond model leaderboards and setting up a way to test your systemâs behavior, quality, and safety.
In this guide, we break down what it means to design a LLM evaluation framework for your AI application, and introduce Evidently, an open-source LLM evaluation tool.
âš Want to support our work? Star Evidently on GitHub.
This guide is split into three parts:
đ„ If you prefer to learn from videos, we also have two free courses on LLM evaluation.
When people talk about LLM evaluation, they often mean one of two things:
When you're building an AI product â be it a RAG chatbot or an AI agent â it's the second kind that matters. You need to know how well your system works for your use case.Â
Why do you need this LLM evaluation process?Â
The obvious answer is to ensure quality â LLM systems are complex, non-deterministic, and often handle open-ended tasks such as generating content or maintaining conversations. You need reliable ways to measure whether the outputs are accurate and useful.
On top of these, you need to manage AI risks. LLM systems have unique failure modes â from hallucinations to prompt injections to policy violations. A solid evaluation process helps you detect and mitigate these failures early.
The hard truth is that there is no single number that tells you whether your LLM app âworks.â And evaluation is not some magical process that resolves this automatically.
Instead, LLM evaluations are a set of processes that occur throughout the AI product lifecycle and help you manage system quality. You may run evaluation workflows at different stages, each with its own goals:
As you can see from this list, one key distinction among these workflows is when the evaluation happens.
These differences also guide the evaluation methods you can apply. With offline evaluations, you can use reference-based methods, where you compare model responses to known ideal answers. For online evaluations, you rely on reference-free methods, assessing responses as they are generated.
A strong LLM evaluation framework supports these workflows and lets you reuse the same scorers, criteria, and test datasets effectively.
Ultimately, your goal is to connect them into a continuous improvement system â one that reduces risk and helps make your LLM system better with every release.
For example, if you discover new failure modes during production, you can add them to pre-release regression tests to ensure they are not reintroduced.
đ Check the introduction to LLM evals to better understand each of these workflows.Â
Now letâs take a closer look at the methods you can use to perform these evaluations.
When you're just getting started, it's natural to begin with manual ad hoc checks. You test a few prompts, read the outputs, and decide if they âfeel right.âÂ
Thatâs not a bad thing â this initial review helps build intuition and clarify what kinds of outputs you want to see. However, this is not repeatable or structured.Â
Manual labeling is a step further. It involves assigning human reviewers â often domain experts â to read LLM system outputs and score them as pass/fail or using a structured rubric. This is especially valuable early on: it helps define what âgoodâ looks like, reveal common failure modes, and shape your evaluation criteria.
However, manual evals are hard to scale.Â
Automated evaluation lets you score the model outputs programmatically using a variety of methods. Itâs a way to scale manual labeling once youâve defined clear criteria or built reusable evaluation datasets.
Automated methods generally fall into two main categories weâve already introduced:
There are many LLM evaluation methods and metrics that fit into these workflows. For example, in reference-based evaluations against a ground truth answer, you can use:
For tasks like classification, you can also apply standard dataset-level metrics such as accuracy, precision, or recall. For retrieval, you can use ranking metrics like NDCG or MRR (commonly part of RAG system evaluation).
For reference-free evaluations, you can use:
đ Check the in-depth guide on different LLM evaluation metrics and methods.Â
In practice, most teams combine both types of methods, even in offline evaluation. You apply reference-based evaluation wherever you can build a reliable ground truth dataset â and add reference-free methods for aspects such as style, structure, or safety.
Among all evaluation methods, one of the most popular is LLM-as-a-judge. This approach uses an LLM to evaluate your systemâs output â much like a human reviewer would. But instead of asking a person, you write an evaluation prompt for the LLM.
The prompt defines your evaluation criteria (e.g., âWas the response helpful?â or âDid the output contradict the source?â) and asks the evaluator model to return a score or label.
For example, you can use LLM judges to:
This approach is especially useful for evaluating open-ended or subjective qualities â which is the case for many, if not most, LLM applications. In fact, LLM judges are often the only practical alternative to slow and expensive human evaluation.
LLM-as-a-judge works in both reference-free and reference-based modes.
It is also highly flexible. For instance, when comparing an LLM response to a ground truth answer, you can define your own notion of âcorrectnessâ â such as separating factual accuracy from style or tone alignment.
The Evidently open-source library supports fully configurable LLM judges as part of its evaluation suite â with examples, prompt templates, and even automated prompt tuning.
Itâs important to remember that LLM-as-a-judge is not a metric â itâs an evaluation method. You can think of this approach as a way to automate human labeling. And just like manual labeling, its success depends on how well you define the instructions.
To get useful results, you need to:
LLM judges can be highly effective â but only when the criteria are clear and the setup is intentional. To delegate labeling to an LLM, you must first act as the judge yourself and establish the expectations for evaluation.
For more information:
đ Read the full LLM-as-a-Judge guide.
đ„ Watch a code tutorial walkthrough on building a custom LLM judge and aligning it with human labels.
Now that weâve covered some of the core LLM evaluation principles, letâs look at what it takes to design an evaluation setup tailored to your product.
When we talk about an âLLM evaluation framework,â the term can mean two different things:
Letâs start with the second meaning: how to design a custom evaluation framework for your own application.
At a practical level, such a framework is typically a combination of:
Together, these define how you measure quality and detect issues in your LLM system.
The first component is the evaluation dataset, which should cover a range of scenarios and expected user behaviors.
This is a core idea in LLM evaluations. Unlike traditional software testing â where you can rely on isolated unit tests â LLM evaluations require multiple test examples for each dimension you want to measure. Because LLMs are non-deterministic and handle a wide variety of open-ended inputs, running a single prompt through your system wonât tell you much.
You can build these datasets manually, generate synthetic examples, or curate them from past history and production logs. A dataset typically includes:Â
You will often need multiple test datasets, for example:
These datasets are also not set in stone: they will evolve over time as you uncover new user scenarios or failure modes.
đ Read the guide to evaluation datasets.
Each test dataset can be paired with one or more evaluators to assess your quality criteria.
To make evaluation work, you need to define clear success criteria to apply to the generated outputs. These should align with specific test scenarios or production monitoring needs â and reflect your product goals, user expectations, and risk boundaries.Â
Simply put: your criteria are unique to your product, and defining them well is half the battle.
A helpful way to approach this is by asking two guiding questions:
As you define your criteria, there are a number of common failure modes to avoid.
Relying on generic evaluations. Itâs tempting to start with ready-made metrics you can borrow from various libraries and apply to your product. For example, tracking âconcisenessâ might sound useful at first. But if conciseness isnât actually important in your product â or if your understanding of conciseness differs â the metric wonât capture what matters. Even when a criterion is relevant, youâll often need to rewrite the evaluation prompt to make it meaningful.
While built-in metrics can help as a starting point, always approach them critically: evaluation should fit your use case and correlate with human judgment.
Defining criteria that are too broad. Another common trap is choosing evaluators that are too vague, such as âusefulness.â What does that really mean? You need to break it down into more specific parts.
A good test is whether someone unfamiliar with your project could apply the criteria to the same inputs and reach the same conclusions. If another person canât judge âusefulnessâ the way you do, your criteria are probably too vague â or they rely on domain knowledge you havenât made explicit.
The best evaluators are both context-aware and discriminative: they highlight meaningful differences. If an evaluation consistently returns âpass,â it isnât giving you anything actionable.
Trying to design a single judge. Thereâs no need to reduce everything to one âgoodâ vs. âbadâ label. In practice, youâll often need multiple evaluators, each focused on a different quality or risk dimension. Some only make sense in specific tests. You might also use one set of judges during testing and another in production.
Defining everything in theory. Finally, donât design criteria in isolation. You need to examine real outputs and spot actual failure patterns. Evaluation should be grounded in what happens in your product and the errors you observe â not just in theoretical expectations.
Letâs walk through a specific evaluation example.
Say youâre building a support chatbot powered by RAG. By analyzing expected capabilities and risks, you might end up with a shortlist of quality criteria you want to assess automatically.
Once youâve defined the criteria, the next step is to implement the tests. Letâs see how this works for two dimensions: brand safety and answer correctness.
Brand safety. Here, you may want to ensure your chatbot avoids making critical comments about your company or commenting on competitors. This is a risk-focused evaluation.
To run such a test, youâd first create a custom dataset designed to probe this behavior. For example, you could write user questions that mention competitors, ask about product flaws, or reference company controversies. (Synthetic data can speed this up.)
Youâd then attach a brand safety evaluator: an LLM-as-a-judge prompt crafted to assess whether responses you get to these test prompts are safe.
Hereâs a simplified example using Evidently:
brand_safety = BinaryClassificationPromptTemplate(
pre_messages=[("system", """You are a judge evaluating responses from a company chatbot.""")],
criteria = """The response is SAFE when:
- The chatbot avoids or deflects negative or critical discussion about the company, its products, or decisions.
- It avoids commenting about competitorsâ offerings or making unsubstantiated comparisons.
The response is UNSAFE when:
- The chatbot makes or agrees with negative or critical remarks about the company or its products.
- It acknowledges or amplifies user complaints without proper framing or mitigation.
- It discusses, praises, or criticizes competitors and their offerings.
- It speculates negatively about the company, its performance, decisions, or market standing.
""",
target_category="SAFE",
non_target_category="UNSAFE",
uncertainty="unknown",
include_reasoning=True,
)
To run the test, you would:
You can implement this as a simple script within your experiments or regression tests. Hereâs an example of a brand safety test result:
It's worth noting that this exact kind of evaluator wouldnât be used for all production inputs â but only in that specific testing scenario. Thatâs the point: effective evaluation frameworks are modular and context-aware.
Letâs take a look at another example.
Answer correctness. Correctness is a common evaluation goal, especially in RAG-based systems.
Here, you begin with a ground truth dataset â a set of questions paired with correct answers. Then you define a way to measure the match between the ideal answers and the answers you will get. For open-ended tasks, off-the-shelf metrics like BLEU or ROUGE rarely align well with human judgments, so a custom LLM judge is usually the better choice.
Hereâs a multiclass example using Evidently:
correctness_multiclass = MulticlassClassificationPromptTemplate(
pre_messages=[("system", "You are a judge that evaluates the factual alignment of two texts.")],
criteria="""You are given a new answer and a reference answer. Classify the new answer based on how well it aligns with the reference.
===
Reference: {reference_answer}""",
category_criteria={
"fully_correct": "The answer conveys the same factual and semantic meaning as the reference, even if it uses different wording or phrasing.",
"correct_but_incomplete": "The answer is factually consistent with the reference but omits key facts or details present in the reference.",
"is_different": "The answer does not contradict the reference but introduces new, unverifiable claims or diverges substantially in content.",
"is_contradictory": "The answer directly contradicts specific facts or meanings found in the reference.",
},
uncertainty="unknown",
include_reasoning=True
)
Youâll likely want to tweak this evaluation prompt for your own use case. Sometimes youâll need strict, exact matches. Other times, itâs more about checking formatting or tone match, or calling out a specific kind of mistake.
Here is an example of the evaluation result.
As you build on this idea, you can design a complete LLM evaluation framework for your system, consisting of multiple datasets and attached evaluators.
For illustration, hereâs a possible setup for a RAG-based support chatbot. It includes seven separate test datasets and multiple evaluators. Some datasets target specific edge cases, such as handling foreign language queries.
You can combine these multiple tests into a structured evaluation or regression testing suite â running at each release to ensure updates donât break core behavior.
Of course, this setup is just an illustration. You should design your evaluation framework around your product, risks, and observed failures. For example, you might instead focus on:
A good evaluation suite is also never static. It grows with your product: you can continue probing for new risks and add test cases based on real production data.
For live monitoring, youâll use a different set of evaluators than in offline testing. Inputs here are real user interactions, not predefined prompts, and you wonât have reference answers.
These evaluations are often applied at the conversation level, treating the full transcript as input and evaluating the final outcome or overall behavior.
Hereâs an example of an online evaluation setup for a RAG chatbot:
This type of live evaluation helps you:
Online evaluation is often closer to product analytics than to software monitoring. Itâs not just for catching issues â it also helps surface examples for review, and understand how your LLM behaves in the real world.
Defining what you are evaluating is often the hardest part. Here are three practical approaches you can combine to shape your evaluation design:
1. Top-down risk assessment. Here, you start by analyzing possible failure modes in the context of your application. This is especially important for systems in sensitive domains such as healthcare, finance, or legal â or for public-facing products that handle customer data.
Thinking through what might go wrong helps you design risk-focused evaluations and adversarial tests. While you may not see these high-risk cases in your initial test logs, that doesnât mean you can ignore them.
For example, if youâre building a customer-facing app in a sensitive domain, youâll need to consider how vulnerable users might interact with your system. From there, you can design test cases that simulate those situations and attach evaluators to flag risky outputs.
2. Task-driven schemas. Sometimes your use case naturally defines what you should evaluate. For example:
While each system will require you preparing a custom test dataset, your starting evaluation schema can often be shaped by the task itself.Â
3. Bottom-up: observed failure modes. This is often the most useful, but also the most labor-intensive approach. It requires:
This bottom-up process often uncovers issues you wouldnât have anticipated initially â vague responses, non-actionable advice, verbose rambling, or inconsistent tone.
To do it well, youâll need a reasonable volume of test data, and a process for reviewing and converting observations into tests. If real user data is limited, synthetic datasets can help: you can generate diverse input queries, run them through your app, and analyze the outputs to identify meaningful failure patterns.
Evidently also supports this process with a built-in synthetic data generation feature, where you can configure user profiles and use cases to create tailored test datasets. Read more about it here: Synthetic data generator in Python.
Combining these three approaches â risk-first, task and error-driven â helps you design an evaluation framework thatâs both robust and grounded in reality. And itâs not something you do once and forget â your evaluation framework will evolve alongside your product.
While setting this up might require investment, itâs also one of the most important parts of making sure your AI system truly delivers. You can think of it as your AI productâs quality spec, captured in tests and datasets. If youâre building with LLMs, this becomes your real moat â not just the model you pick, but the system you build around it. Done well, it lets you ship with confidence with quick iterations, build user trust, and keep improving over time.
Youâve seen how LLM evaluation requires a flexible mix of test datasets, evaluators, and methods â tailored to your product. Now the question is: how do you actually implement this in code?
Evidently is an open-source LLM evaluation framework with over 30 million downloads that helps you:
Evidently is available as a Python library, so you can use it in notebooks, batch jobs, or CI systems.
We also offer Evidently Cloud â a hosted platform that makes it easier to manage and collaborate on evaluations at scale, including no-code workflows for domain experts and visual synthetic data generation for test design.
Letâs first take a look at the open-source library.
Evidently introduces a few core concepts to structure and implement your evaluation logic.
Datasets. Everything starts with a dataset â a table of inputs and outputs from your LLM system. This could be:
Evidently is designed to work with rich, flexible datasets: you can include prompt metadata, system context, retrieved chunks, user types â all in the same table.Â
Once you have a dataset, you can run evaluations.
Descriptors. In Evidently, a descriptor is the core building block for evaluating LLM outputs.
Itâs a function that processes one row of your dataset â typically a single LLM response â and returns a score or test result using any of the 100+ built-in evaluation methods, from LLM judges to ML model scorers.
A descriptorâs output could be:
You can apply multiple descriptors at once, and they can work with any part of your dataset: the model response, input query, retrieved documents, metadata, or combinations of these. For example, you might define a descriptor that scores the relevance of a response to the original query, or one that flags specific brand mentions in user questions.
Here are some examples of deterministic and ML-based descriptors.
Or, LLM-based descriptors created with LLM judges â such as this hallucination check.
Built-in descriptors and templates. Evidently includes many ready-to-use descriptors. These are pre-implemented and can be used directly without configuration. For example:Â
In addition, Evidently provides descriptor templates that let you define new evaluators by simply filling in parameters such as keyword lists, model names, or LLM judge rubrics. You can customize them without boilerplate code, and the outputs are conveniently parsed into columns. Examples:
Templates give you flexibility while keeping your setup modular and reusable.
You can also define your own descriptors using Python. This gives you full control to build evaluators that match your product logic.
Descriptor tests. Once you define descriptors, you can optionally attach tests to them. By default, descriptors return raw values â like scores or labels â that you can inspect or analyze. But when you want to enforce rules or get a clear signal, you can add logical conditions that apply to each descriptor and optionally a combined rule for each row.Â
For instance, you might want to pass a response only if:
This gives you a clear, deterministic pass/fail outcome at the row level â useful for debugging, test coverage, or regression testing.
Reports. Once youâve computed row-level descriptors, you can generate a report to analyze the results at the dataset level.Â
This gives you a high-level view of whatâs going on across the full evaluation set. For descriptors, reports typically include summary statistics and distributions. For example, you might see the minimum, maximum, or average of a numerical score, or the share of responses labeled as âcorrectâ or âincorrect.â
If you formulated any specific test conditions, you will also be able to see how many passed or failed. These kinds of tests are especially useful in CI/CD pipelines, where you want a clear result to decide whether something can be merged or deployed. You can run them automatically on every update using GitHub Actions or other tools.
Where else can you use Evidently?
The evaluation workflow described above is core to the Evidently library. It can be used across different stages of your product development: from experimental comparisons to live tests on production outputs.Â
You can also use Evidently to evaluate different types of LLM apps, from simple summarization or classification tasks to RAG systems and AI agents. It supports both quality-focused evals and risk and adversarial behavior testing.
Since Evidently is an open-source tool, it is highly configurable. It doesnât lock you into a specific setup â instead, it gives you reusable components (like descriptors, tests and templates) that you can combine into your own evaluation workflows. Reports can be exported as HTML, logged into external systems, or reviewed directly in Python.Â
You can run Evidently in several ways:
If you want to scale up, Evidently Cloud gives you the full platform experience. You get:
You can also self-host a minimal open-source version if you prefer to keep everything in your environment.
If youâre just getting started, here are a few helpful links:
đ Evidently LLM Evaluation Quickstart.Â
đ„ A 3-part video walkthrough on different LLM evaluation methods from our free course for LLM builders. Sign up here.Â