contents‍
LLMs are rapidly becoming more advanced at coding tasks, driving the development of real-world applications, from coding co-pilots to developer productivity tools to automated code reviewers. But as model capabilities grow, so does the need to measure their performance.Â
This blog highlights 10 LLM coding benchmarks designed to evaluate and compare how different models perform on various coding tasks, including code completion, snippet generation, debugging, and more.
HumanEval measures how well LLMs can generate code. It tests their ability to understand programming tasks and produce syntactically correct and functionally accurate pieces of code based on given prompts.Â
The dataset includes 164 programming tasks and unit tests that automatically check the model-generated code against expected results, simulating how a human developer would validate their work. A model’s solution must pass all provided test cases for a given problem to be considered correct.
Paper: Evaluating Large Language Models Trained on Code by Chen et al. (2021)
Dataset: HumanEval dataset
Mostly Basic Programming Problems (MBPP) evaluates how well LLMs can generate short Python programs from natural language descriptions. It includes 974 entry-level tasks covering common programming concepts like list manipulation, string operations, loops, conditionals, and basic algorithms. Each task provides a clear description, an example solution, and a set of test cases to validate the LLM's output.
Paper: Program Synthesis with Large Language Models by Austin et al. (2021)
Dataset: MBPP dataset
SWE-bench (Software Engineering Benchmark) assesses the ability of LLMs to tackle real-world software issues sourced from GitHub. It includes more than 2200 issues and their corresponding pull requests from 12 widely used Python repositories. The benchmark challenges models to generate patches that fix the issues based on the provided codebase and issue description. Unlike simpler code generation tasks, SWE-bench requires models to handle long contexts, perform complex reasoning, and operate within execution environments.
Paper: SWE-bench: Can Language Models Resolve Real-World GitHub Issues? by Jimenez et al. (2023)
Dataset: SWE-bench dataset
CodeXGLUE is a benchmark dataset for program understanding and code generation. It includes 14 datasets, 10 diversified programming tasks, and a platform for model evaluation and comparison. The tasks include clone detection, defect detection, cloze test, code completion, code translation, code search, code repair, text-to-code generation, code summarization, and documentation translation.Â
The benchmark's creators also provide three baseline models: a BERT-style model for program understanding problems, a GPT-style model for completion and generation problems, and an Encoder-Decoder framework that tackles sequence-to-sequence generation problems.
Paper: CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation by Shuai Lu et al. (2021)
Dataset: CodeXGLUE dataset
DS-1000 is a code generation benchmark that focuses on data science problems. It contains 1000 coding challenges sourced from 451 StackOverflow questions. The tasks span seven popular Python libraries, including NumPy, Pandas, TensorFlow, PyTorch, and scikit‑learn.Â
Example tasks include realistic operations like data manipulation (e.g., Pandas DataFrame transforms) and machine learning tasks (e.g., training a model with scikit‑learn or PyTorch). Each completed task is run against test cases to check functional correctness and against constraints on API usage to make sure the generated code uses intended library functions.
Paper: DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation by Lai et al. (2022)Â
Dataset: DS-1000 dataset
APPS is a code generation benchmark that measures the ability of LLMs to generate “satisfactory Python code” based on an arbitrary natural language specification. The benchmark includes 10,000 problems, collected from open-access coding websites like Codeforces or Kattis. The task difficulty ranges from one-line solutions to substantial algorithmic challenges. Each problem is accompanied by test cases and ground-truth solutions to evaluate the generated code.
Paper: Measuring Coding Challenge Competence With APPS by Hendrycks et al. (2021)Â
Dataset: APPS dataset
EvalPlus is an evaluation framework that assesses the functional correctness of LLM-synthesized code. It extends the test cases of the popular HumanEval benchmark by 80x and the MBPP benchmark by 35x.Â
EvalPlus augments the evaluation dataset with large amounts of test cases produced by an automatic test input generator. It uses ChatGPT to generate a set of seed inputs for later mutation. The generator randomly selects a seed from a seed pool of ChatGPT-generated inputs and mutates it to create a new input. If the new input meets the requirements, it is added to the seed pool, and the process repeats.
Paper: Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation by Liu et al. (2023)Â
Dataset: EvalPlus dataset
CrossCodeEval is a multilingual benchmark that tests LLMs' ability to perform cross-file code completion. Unlike popular benchmarks like HumnEval or MBPP, it evaluates models on completing code not just within a single file but across a project — capturing dependencies and modularity of real-world coding tasks.Â
The dataset is built on a set of GitHub repositories in four popular programming languages: Python, Java, TypeScript, and C#. The benchmark authors employ static analysis to extract code completion tasks that specifically require cross-file context to solve accurately.
Paper: CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion by Ding et al. (2023)Â
Dataset: CrossCodeEval dataset
RepoBench evaluates LLMs on repository-level code auto-completion tasks. It consists of three interconnected evaluation tasks:
The tasks are derived from GitHub repositories and reflect real-world programming challenges where understanding and integrating information across multiple files is essential. The benchmark supports both Python and Java.
Paper: RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems by Liu et al. (2023)Â
Dataset: RepoBench-R, RepoBench-C, RepoBench-P
Code Lingua evaluates LLMs in programming language translation. It compares models' abilities to understand what the code implements in the source language and translate the same semantics into the target language. For example, converting a function from Java to Python or from C++ to Go. The benchmark also tracks bugs introduced or fixed during translation, assessing semantic fidelity and robustness .
Code Lingua incorporates some commonly used datasets like CodeNet and Avatar and consists of 1700 code samples in five languages – C, C++, Go, Java, and Python – with more than 10,000 tests, over 43,000 translations, 1748 bug labels, and 1365 bug-fix pairs.
Paper: Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code by Pan et al. (2023)Â
Dataset: Code Lingua dataset
While benchmarks help compare models, they rarely reflect the specifics of your AI application. To better fit into the scope of your use case – be it a coding copilot or developer productivity app – you need custom evaluations.Â
That’s why we built Evidently. Our open-source library, with over 25 million downloads, simplifies testing and evaluating LLM-powered applications with built-in evaluation templates and metrics.Â
For teams working on complex, mission-critical AI systems, Evidently Cloud provides a platform to collaboratively test and monitor AI quality. You can generate synthetic data, create evaluation scenarios, run tests, and track performance — all in one place.
Ready to test your AI system? Sign up for free or schedule a demo to see Evidently Cloud in action. We're here to help you build with confidence!