contentsâ
Instead of relying on one model to judge LLM outputs, we tried a simple experiment: take AI-generated emails, and have three different models â GPT, Claude, and Gemini â evaluate their tone. Was it appropriate for a U.S. corporate tech setting?
In this blog tutorial, we walk through the idea of LLM juries, the open-source implementation, and what we learned from LLM disagreements.
Looking for the code? Jump here.
How do you know if your LLM product is doing a âgoodâ job?
One way is to ask another LLM to tell you. Thatâs the idea behind LLM-as-a-judge: you give an LLM a rubric and an output, and it returns a judgment. Is this correct? Helpful? Safe?
This is how many LLM eval setups work today in experiments, red-teaming and live monitoring. It's fast, scalable, and surprisingly decent for many use cases.Â
Of course, success depends on:
And critically, youâll want to tune your LLM judge against human labels.
But LLMs donât always agree. (Neither do humans.) And for some tasks, even defining what "good" means is hard.
So hereâs another idea: donât ask one model â ask several. This is the LLM-as-a-jury approach, where you ask multiple LLMs to evaluate the same output. Each will tap into its own training data and give an independent verdict.Â
This concept is formalized in Replacing Judges with Juries (Verga et al., 2024).Â
The authors propose a Panel of LLM Evaluators (PoLL) â a jury of smaller, diverse models from different families. Their result shows that jury-based evaluations align better with human judgment, can reduce model bias and be cheaper than using a single large judge like GPT-4.
While not every task needs a jury, itâs a powerful tool.
Using multiple models can add stability to your evaluations. Sometimes a single model will flip its decision, especially on borderline cases. Having a panel smooths that out. You can take a majority vote, average the scores or treat even a single ânegativeâ label as a warning.Â
Second, the disagreement itself can be useful. When models donât agree, that tells you something. Maybe your criteria are too vague, or there is an edge case you didnât consider. Thatâs exactly the kind of output worth flagging. You can use it to improve your prompt, tighten your rubric, or rethink the generation itself.
To make all this more concrete, letâs run a small experiment.
Imagine youâre building an AI assistant that generates emails from user instructions. The user can type a prompt like âdecline the inviteâ and the assistant turns it into a complete email.Â
If you are building a tool like this, youâd need to evaluate how good your emails are.Â
We built a minimal dataset with 15 examples which contains user input and generated emails:
Next, we needed an evaluation criterion. We can keep it simple: is the email tone appropriate for a U.S. tech workplace email?Â
This is exactly what weâll ask the LLM judges to decide.Â
Weâll show each model the generated email and ask to return:
APPROPRIATE
or INAPPROPRIATE
Then weâll aggregate the results to see whether all the judges we assigned (GPT, Claude, and Gemini) agree.
We picked this task because itâs the kind of real-world subjective thing you often want a second opinion on. In real workplace settings, tone varies by culture, company, even personal style. People may disagree. Models will too.Â
By comparing their judgments, we can surface hidden assumptions â and get closer to a working definition of what âappropriateâ really means.
We will use the Evidently open-source library to run and compare judgments across multiple models. Itâs an open-source toolkit designed for LLM evaluations, with built-in templates, reports and a clean interface for plugging multiple LLMs as evaluators.Â
Here is how it works.
In Evidently, evaluations are defined through descriptors â reusable logic units that define how to score an output. We will create an LLMEval descriptor using the binary classification template. This wraps our core judgment task: Is this email appropriate?
Here is the exact prompt template we used:
If you run the evaluation with one judge (in this case, GPT-4o-mini), you will get a result like this: a judgment with a short explanation.
To run the complete evaluation suite, weâll need to create several descriptors, one for each LLM:
They will all reuse the same evaluation prompt and review the same set of 15 test emails.Â
Additionally, weâll also add a test condition to each judgment: we expect the model to return an "APPROPRIATE" label. Then, we can add a TestSummary for each row with:
Having a final score makes it easy to automate what happens next. You can decide to fail a test if even one judge flags the output, or set another threshold. Then, you can define expectations at the dataset level â like requiring 90% of test cases to pass for an experiment to be considered successful. It gives you a simple way to track regressions or enforce quality gates without needing to check each result manually.
For our more analytical task, this will make it easy to spot which emails were unanimously approved or rejected â and to sort the results by level of agreement.
Check the complete code example for the implementation details.
We also added an extra rule for convenience: if models donât all agree, we explicitly flag the row as a âdisagreementâ.
And in our case, we got 5 examples where models couldnât agree!
Letâs take a closer look.Â
Hereâs a fun one. The email says: âThe server decided to die again.â Claude was fine with the sarcastic tone, seeing it as typical tech-team banter. GPT and Gemini called it unprofessional â too flippant for incident communication.
In another case, the generated email included an emoji. GPT penalized the informality, calling it too casual. Claude and Gemini saw no problem â they considered it appropriate for internal updates.
Another email mentioned that the first draft was very rough. OpenAI judged this self-deprecating tone (ânot sure itâs any good...â) as lacking professionalism. Claude and Gemini interpreted it as humble and collaborative.
The phrase âIâll have to assume this isnât a priorityâ triggered disagreement again.
GPT and Claude saw it as passive-aggressive. Gemini considered it a reasonable way to nudge someone after being ignored. One modelâs âfirmâ is another model's ârude!â
Of course, all of this is just a toy example â but one can see how it can be useful in a real setting.Â
Should your emails include emoji? How casual is too casual? Itâs up to you. Seeing where models disagree, you can take those edge cases and use them to improve both your core email generation and your evaluation prompt.Â
And donât forget: the other 10 examples in our set were clear-cut â all three models agreed. So you can have some confidence and focus your attention on fixing what matters.
Not necessarily.
In many cases, a single strong evaluation model with good prompt engineering works well. Itâs hard enough to tune one judge! So investing in prompt iterations and rubric design is often the default choice.
But there are cases where multiple models genuinely help.
One is when you're still figuring out what âgoodâ even means. If you're thinking about evaluating vague traits like clarity or helpfulness, asking several models to weigh in can surface subtle differences. This wonât automate your evals, but you can use model disagreement to better define your criteria in the first place.
Another case is when the task is subjective. Like âIs this rejection email empathetic enough?â or âCould this meme be seen as offensive?â Even humans would often disagree. Thatâs where multiple perspectives matter. You can even intentionally assign your judges different personas â or try different models.
Sometimes itâs just about having a second opinion. In high-stakes use cases â customer communication, medical summaries, anything safety-sensitive â a single disagreement can be enough to justify a closer look or blocking an output. Having multiple judges makes your evals more robust.
And finally, thereâs coverage. Different models bring different training data, assumptions, and blind spots, so combining their decisions can give you a more complete picture. For example, you might use multiple LLMs to evaluate the style of generated code. Or, if you are working with less capable smaller models, you can combine them to tap into their collective wisdom.Â
You can build on this evaluation ensemble idea. It's not just about multiple LLMs!
For example, one interesting direction is to break down evaluation criteria, even when using the same model. Take something like âcorrectnessâ â instead of treating it as a single label, you can split it into sub-judgments: contradiction, omission, unverifiable addition, style drift. Then, combine those into a more nuanced score.
You can also try prompting the same model in different ways â for example, using a neutral evaluator prompt vs. a devilâs advocate one. Or framing it once as âis this good?â and once as âwhatâs wrong with this?â. Comparing those responses can reveal how much the model's judgment shifts based on framing alone.
In both cases, youâre not just looking for a simple yes/no answer â but iterating on making your evaluations more stable, explainable, and aligned.
If this was useful, check out Evidently â the open-source Python library we used to build this workflow. It includes prompt templates, test logic, and reporting tools to help you evaluate LLM outputs systematically.
We're also building Evidently Cloud â a platform for running evals continuously, managing test suites, and collaborating on prompt and output analysis.
â Star us on GitHub
đŹ Or sign up to start with Evidently Cloud now.