🎓 Free video course "LLM evaluation for AI builders" with 10 code tutorials.  Save your seat

200 LLM benchmarks and evaluation datasets

A database of LLM benchmarks and datasets to evaluate the performance of language models.
First published: December 10, 2024.
How can you evaluate different LLMs? LLM benchmarks are standardized tests designed to measure and compare the abilities of different language models. We put together a database of 200 LLM benchmarks and publicly available datasets you can use to evaluate LLM capabilities in various domains, including reasoning, language understanding, math, question answering, coding, tool use, and more.

Maintained by the team behind Evidently, an open-source tool for ML and LLM evaluation.

🔥 Free course on LLM evaluations

Building an LLM-powered app? While benchmarks help compare models, your AI product needs custom evaluations. Learn how to create LLM judges, evaluate RAG systems and run adversarial tests.
Learn more and sign up

LLM benchmarks and datasets

All the content belongs to respective parties. We simply put the links together

Did we miss some great LLM benchmarks and datasets? Let us know! Our Discord Community with 2500+ ML practitioners and AI Engineers is the best place to share feedback.

Start testing your AI systems today

Book a personalized 1:1 demo with our team or sign up for a free account.
Icon
No credit card required
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.