A comprehensive collection of datasets used for evaluating Large Language Models (LLMs). This curated list includes benchmarks for language understanding, reasoning, mathematics, coding, and multimodal capabilities.
Evaluating Large Language Models (LLMs) requires carefully selected datasets that can assess different aspects of model performance. Our comprehensive table below includes the most widely-used evaluation datasets, from fundamental language understanding benchmarks like GLUE and SuperGLUE to specialized assessments for mathematics (GSM8K), coding (HumanEval), and multimodal capabilities (VQAv2).
Each dataset in this collection has been carefully selected based on its significance in the field, citation impact, and adoption by major language models like GPT-4, Claude, PaLM, and Gemini. The table provides essential information about each dataset, including its primary category, detailed description, source links, and which prominent models commonly use it for evaluation.
Dataset | Category | Description | Source | Common Models |
---|---|---|---|---|
GLUE | Language Understanding | General Language Understanding Evaluation benchmark containing 9 tasks: CoLA, SST-2, MRPC, QQP, STS-B, MNLI, QNLI, RTE, and WNLI. | Website | Paper | BERT, RoBERTa, T5, GPT variants |
SuperGLUE | Language Understanding | More challenging successor to GLUE with harder tasks requiring more sophisticated language understanding and reasoning. | Website | Paper | T5, GPT-3, PaLM, Claude |
MMLU | Language Understanding | Massive Multitask Language Understanding - Tests knowledge across 57 subjects including STEM, humanities, and more. Multiple-choice format with expert-level questions. | GitHub | Paper | GPT-4, Claude 2, PaLM 2, Gemini |
BIG-bench | Language Understanding | Beyond the Imitation Game benchmark - Collection of 204 tasks testing capabilities like reasoning, knowledge, and social understanding. | GitHub | Paper | PaLM, GPT-4, Claude, Gemini |
HellaSwag | Language Understanding | Common sense inference dataset for testing grounded commonsense inference with 70K multiple choice questions about grounded situations. | GitHub | Paper | GPT-4, Claude 2, PaLM 2 |
GSM8K | Math & Reasoning | Grade School Math 8K - Collection of 8.5K high-quality linguistically diverse grade school math word problems. Problems take between 2-8 steps to solve. | GitHub | Paper | GPT-4, Claude, PaLM |
MATH | Math & Reasoning | Collection of 12K middle school and high school mathematics problems covering algebra, geometry, probability, and more. | GitHub | Paper | GPT-4, Minerva, Claude |
MathQA | Math & Reasoning | Large-scale dataset containing 37K mathematics word problems with step-by-step solutions and multiple-choice answers. | Website | Paper | GPT-4, Claude 2, Gemini Pro |
HumanEval | Coding | 164 handwritten Python programming problems to evaluate code generation capabilities. | GitHub | Paper | GPT-4, Claude 2, Code Llama |
MBPP | Coding | Mostly Basic Programming Problems - 974 Python programming tasks with test cases. | GitHub | Paper | GPT-4, Claude, StarCoder |
CodeContests | Coding | Collection of competitive programming problems from various online contests. | GitHub | Paper | AlphaCode, GPT-4, Claude |
VQAv2 | Multimodal | Visual Question Answering dataset with 265K images and 3M questions requiring understanding of vision, language, and commonsense knowledge. | Website | Paper | GPT-4V, Claude 3, Gemini |
MMMU | Multimodal | Massive Multi-discipline Multimodal Understanding benchmark covering 183 subjects across various disciplines including science, engineering, and humanities. | Website | Paper | GPT-4V, Claude 3, Gemini |
MathVista | Multimodal | Visual mathematics reasoning dataset with diverse problem types including geometry, graphs, and scientific diagrams. | Website | Paper | GPT-4V, Claude 3, Gemini |
TruthfulQA | Safety & Adversarial | Tests models' ability to identify and avoid generating false or misleading information. | GitHub | Paper | GPT-4, Claude 2, PaLM 2 |
Anthropic Red Team | Safety & Adversarial | Collection of adversarial prompts testing model safety and alignment. | GitHub | Paper | Claude, GPT-4, PaLM |
Toxigen | Safety & Adversarial | Dataset for testing and measuring implicit toxicity in language models, with human-validated examples. | GitHub | Paper | GPT-4, Claude 2, PaLM 2 |
MT-Bench | Dialogue | Multi-turn dialogue benchmark with 80 challenging multi-turn conversations, designed to evaluate chat models. | GitHub | Paper | GPT-4, Claude 2, PaLM 2 |
Anthropic HH | Dialogue | Helpful and Harmless benchmark testing both beneficial behavior and safety constraints in dialogue. | GitHub | Paper | Claude, GPT-4, PaLM 2 |
NaturalQuestions | Knowledge-heavy | Real Google search queries with answers from Wikipedia, testing both short and long-form question answering. | GitHub | Paper | GPT-4, PaLM 2, Claude |
TriviaQA | Knowledge-heavy | Large-scale dataset with over 650K question-answer-evidence triples from trivia questions. | GitHub | Paper | GPT-4, Claude 2, PaLM 2 |
Self-Instruct | Instruction Following | 52K instruction-following examples generated by GPT-3, covering diverse tasks and formats. | GitHub | Paper | GPT-4, Claude 2, PaLM 2 |
Alpaca | Instruction Following | 52K instruction-following examples based on Self-Instruct, used for fine-tuning smaller models. | GitHub | LLaMA, Alpaca, GPT-4 |
FLAN | Instruction Following | Large collection of tasks converted to instruction format, used for instruction tuning. | GitHub | Paper | PaLM 2, FLAN-T5, GPT-4 |
The datasets listed above serve different evaluation purposes:
When selecting evaluation datasets for your LLM application, consider factors such as task relevance, dataset size, quality of annotations, and coverage of edge cases. The right combination of evaluation datasets can provide a comprehensive assessment of your model's capabilities and limitations.
Learn how to effectively evaluate LLMs using different datasets and metrics. Covers best practices, common pitfalls, and practical examples.
Read More →Deep dive into MMLU, HELM, Big-Bench, and other major benchmarks. Understand their strengths, weaknesses, and when to use each.
Read More →Step-by-step guide to creating domain-specific evaluation datasets for your LLM applications.
Read More →