LLM Evaluation Datasets

A comprehensive collection of datasets used for evaluating Large Language Models (LLMs). This curated list includes benchmarks for language understanding, reasoning, mathematics, coding, and multimodal capabilities.

Evaluating Large Language Models (LLMs) requires carefully selected datasets that can assess different aspects of model performance. Our comprehensive table below includes the most widely-used evaluation datasets, from fundamental language understanding benchmarks like GLUE and SuperGLUE to specialized assessments for mathematics (GSM8K), coding (HumanEval), and multimodal capabilities (VQAv2).

Each dataset in this collection has been carefully selected based on its significance in the field, citation impact, and adoption by major language models like GPT-4, Claude, PaLM, and Gemini. The table provides essential information about each dataset, including its primary category, detailed description, source links, and which prominent models commonly use it for evaluation.

Common Datasets for Evaluating LLMs Infographic
Dataset Category Description Source Common Models
GLUE Language Understanding General Language Understanding Evaluation benchmark containing 9 tasks: CoLA, SST-2, MRPC, QQP, STS-B, MNLI, QNLI, RTE, and WNLI. BERT, RoBERTa, T5, GPT variants
SuperGLUE Language Understanding More challenging successor to GLUE with harder tasks requiring more sophisticated language understanding and reasoning. T5, GPT-3, PaLM, Claude
MMLU Language Understanding Massive Multitask Language Understanding - Tests knowledge across 57 subjects including STEM, humanities, and more. Multiple-choice format with expert-level questions. GPT-4, Claude 2, PaLM 2, Gemini
BIG-bench Language Understanding Beyond the Imitation Game benchmark - Collection of 204 tasks testing capabilities like reasoning, knowledge, and social understanding. PaLM, GPT-4, Claude, Gemini
HellaSwag Language Understanding Common sense inference dataset for testing grounded commonsense inference with 70K multiple choice questions about grounded situations. GPT-4, Claude 2, PaLM 2
GSM8K Math & Reasoning Grade School Math 8K - Collection of 8.5K high-quality linguistically diverse grade school math word problems. Problems take between 2-8 steps to solve. GPT-4, Claude, PaLM
MATH Math & Reasoning Collection of 12K middle school and high school mathematics problems covering algebra, geometry, probability, and more. GPT-4, Minerva, Claude
MathQA Math & Reasoning Large-scale dataset containing 37K mathematics word problems with step-by-step solutions and multiple-choice answers. GPT-4, Claude 2, Gemini Pro
HumanEval Coding 164 handwritten Python programming problems to evaluate code generation capabilities. GPT-4, Claude 2, Code Llama
MBPP Coding Mostly Basic Programming Problems - 974 Python programming tasks with test cases. GPT-4, Claude, StarCoder
CodeContests Coding Collection of competitive programming problems from various online contests. AlphaCode, GPT-4, Claude
VQAv2 Multimodal Visual Question Answering dataset with 265K images and 3M questions requiring understanding of vision, language, and commonsense knowledge. GPT-4V, Claude 3, Gemini
MMMU Multimodal Massive Multi-discipline Multimodal Understanding benchmark covering 183 subjects across various disciplines including science, engineering, and humanities. GPT-4V, Claude 3, Gemini
MathVista Multimodal Visual mathematics reasoning dataset with diverse problem types including geometry, graphs, and scientific diagrams. GPT-4V, Claude 3, Gemini
TruthfulQA Safety & Adversarial Tests models' ability to identify and avoid generating false or misleading information. GPT-4, Claude 2, PaLM 2
Anthropic Red Team Safety & Adversarial Collection of adversarial prompts testing model safety and alignment. Claude, GPT-4, PaLM
Toxigen Safety & Adversarial Dataset for testing and measuring implicit toxicity in language models, with human-validated examples. GPT-4, Claude 2, PaLM 2
MT-Bench Dialogue Multi-turn dialogue benchmark with 80 challenging multi-turn conversations, designed to evaluate chat models. GPT-4, Claude 2, PaLM 2
Anthropic HH Dialogue Helpful and Harmless benchmark testing both beneficial behavior and safety constraints in dialogue. Claude, GPT-4, PaLM 2
NaturalQuestions Knowledge-heavy Real Google search queries with answers from Wikipedia, testing both short and long-form question answering. GPT-4, PaLM 2, Claude
TriviaQA Knowledge-heavy Large-scale dataset with over 650K question-answer-evidence triples from trivia questions. GPT-4, Claude 2, PaLM 2
Self-Instruct Instruction Following 52K instruction-following examples generated by GPT-3, covering diverse tasks and formats. GPT-4, Claude 2, PaLM 2
Alpaca Instruction Following 52K instruction-following examples based on Self-Instruct, used for fine-tuning smaller models. LLaMA, Alpaca, GPT-4
FLAN Instruction Following Large collection of tasks converted to instruction format, used for instruction tuning. PaLM 2, FLAN-T5, GPT-4

Understanding LLM Evaluation Datasets

The datasets listed above serve different evaluation purposes:

When selecting evaluation datasets for your LLM application, consider factors such as task relevance, dataset size, quality of annotations, and coverage of edge cases. The right combination of evaluation datasets can provide a comprehensive assessment of your model's capabilities and limitations.