contents‍
Retrieval-Augmented Generation (RAG) is a popular technique for grounding the outputs of large language models (LLMs) in reliable, context-specific data. By pulling in relevant information from trusted data sources, RAG helps reduce hallucinations, improve response accuracy, and enable source-backed and personalized answers.
RAG is already powering a wide range of real-world systems, from customer support bots to fraud investigation tools. As adoption grows, so does the need to evaluate these systems effectively.Â
In this blog, we highlight seven RAG benchmarks that help measure and compare how well different LLMs handle core RAG challenges like large context windows, grounded reasoning, and using retrieved evidence effectively.‍
Want more examples of LLM benchmarks? We put together a database of 200 LLM benchmarks and datasets you can use to evaluate the performance of language models.Â
Bookmark the list âź¶
The Needle-in-a-Haystack (NIAH) test, first proposed by Greg Kamradt, is a simple yet powerful method for testing the in-context retrieval ability of long context LLMs. It tests whether a model can successfully retrieve a small, planted piece of information (the “needle”) hidden in an extensive collection of irrelevant data (the “haystack”). The model is asked a question that can only be answered correctly by finding and using that specific information. Evaluation is straightforward: whether the model can retrieve it accurately. You can also iterate over various document depths, where the needle is placed, and context lengths to measure performance.Â
Initially, Paul Graham's essays were used as a “hay” where a random statement – “The best thing to do in San Francisco is to go to Dolores Park and eat a sandwich on a sunny day” – was placed to perform the test. However, you can use any data corpus as a long context window to run the test.Â
Video explainer: Needle In A Haystack, by Greg Kamradt
Test repo: Needle In A Haystack on GitHub
BeIR (Benchmarking Information Retrieval) evaluates retrieval models across diverse datasets and tasks. It tests the ability of models to find relevant documents, focusing on zero-shot and domain-agnostic evaluation.Â
BeIR includes 18 datasets across 9 task types, including fact checking, duplicate detection, question answering, augment retrieval, and retrieval from forums. The benchmark also allows testing retrieval abilities across diverse domains, from generic ones like news or Wikipedia to highly specialized ones such as biomedical scientific publications. BeIR can evaluate various retrieval systems, such as dense and sparse retrievers, hybrid models, and re-ranking systems.Â
Example questions:
Paper: BeIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Dataset: BeIR dataset
FRAMES (Factuality, Retrieval, And reasoning MEasurement Set) offers a unified framework for assessing LLM performance in end-to-end RAG scenarios. It tests RAG systems across three dimensions: factuality, retrieval accuracy, and reasoning.Â
The dataset comprises over 800 test samples with challenging multi-hop questions that require the integration of information from 2-15 Wikipedia articles to answer. FRAMES questions also cover different reasoning types needed to answer the question, including numerical, tabular, and temporal reasoning, multiple constraints, and post-processing.Â
Example questions:
Paper: Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
Dataset: FRAMES dataset
RAGTruth is a benchmark that helps to evaluate the extent of hallucination in RAG systems. It is tailored for analyzing word-level hallucinations and comprises 18,000 naturally generated responses from diverse LLMs using RAG. The benchmark distinguishes between four types of hallucinations:Â
RAGTruth can be used to assess both hallucination frequencies in different models and the effectiveness of hallucination detection methodologies.Â
Paper: RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models
Dataset: RAGTruth dataset
RULER extends the original NeedleInAHaystack (NIAH) test by varying the number and types of "needles" (target information) within large contexts. It tests LLMs across four task categories: retrieval, multi-hop tracing, aggregation, and question answering.Â
RULER is a synthetic benchmark. It automatically generates evaluation examples based on input configurations of sequence length and task complexity.Â
Paper: RULER: What's the Real Context Size of Your Long-Context Language Models?
Dataset: RULER dataset
Multimodal Needle in a Haystack (MMNeedle) evaluates the long-context capabilities of Multimodal Large Language Models (MLLMs). It tests the ability of MLLMs to locate a target sub-image (the “needle”) within a set of images (the “haystack”) based on textual descriptions.
The benchmark covers diverse settings with varying context lengths and single and multiple needles. It includes 40,000 images, 560,000 captions, and 280,000 needle-haystack pairs. MMNeedle also employs image stitching to further increase the input context length.
Paper: Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models
Dataset: MMNeedle dataset
FEVER (Fact Extraction and VERification) is a publicly available dataset for verification against textual sources. It is designed to test a system’s ability to fact-check claims using evidence from Wikipedia. It includes over 185,000 human-generated claims based on Wikipedia articles and labeled as Supported, Refuted, or Not Enough Info. A model is given a claim (e.g., “The Eiffel Tower is in Berlin”), and must then retrieve Wikipedia sentences relevant to this claim and determine the correct label.Â
The benchmark evaluates information retrieval, evidence selection, and reasoning capabilities. FEVER can be used to assess RAG pipelines and LLMs' ability to avoid hallucination and reason with evidence.
Example claims:
Paper: FEVER: a large-scale dataset for Fact Extraction and VERification
Dataset: FEVER dataset
While benchmarks help compare models, your RAG system needs custom evaluations on your own data to test it during development and production.
That’s why we built Evidently. Our open-source library, with over 25 million downloads, makes it easy to test and evaluate LLM-powered applications, from chatbots to RAG. It simplifies evaluation workflows, offering 100+ built-in checks and easy configuration of custom LLM judges for every use case.
We also provide Evidently Cloud, a no-code workspace for teams to collaborate on AI quality, testing, and monitoring and run complex evaluation workflows.Â
Ready to test your RAG? Sign up for free or schedule a demo to see Evidently Cloud in action. We're here to help you build with confidence!