LLM Evaluation Datasets

A comprehensive collection of datasets used for evaluating Large Language Models (LLMs). This curated list includes benchmarks for language understanding, reasoning, mathematics, coding, and multimodal capabilities.

Evaluating Large Language Models (LLMs) requires carefully selected datasets that can assess different aspects of model performance. Our comprehensive table below includes the most widely-used evaluation datasets, from fundamental language understanding benchmarks like GLUE and SuperGLUE to specialized assessments for mathematics (GSM8K), coding (HumanEval), and multimodal capabilities (VQAv2).

Each dataset in this collection has been carefully selected based on its significance in the field, citation impact, and adoption by major language models like GPT-4, Claude, PaLM, and Gemini. The table provides essential information about each dataset, including its primary category, detailed description, source links, and which prominent models commonly use it for evaluation.

Common Datasets for Evaluating LLMs Infographic

Dataset	Category	Description	Source	Common Models
GLUE	Language Understanding	General Language Understanding Evaluation benchmark containing 9 tasks: CoLA, SST-2, MRPC, QQP, STS-B, MNLI, QNLI, RTE, and WNLI.	Website \| Paper	BERT, RoBERTa, T5, GPT variants
SuperGLUE	Language Understanding	More challenging successor to GLUE with harder tasks requiring more sophisticated language understanding and reasoning.	Website \| Paper	T5, GPT-3, PaLM, Claude
MMLU	Language Understanding	Massive Multitask Language Understanding - Tests knowledge across 57 subjects including STEM, humanities, and more. Multiple-choice format with expert-level questions.	GitHub \| Paper	GPT-4, Claude 2, PaLM 2, Gemini
BIG-bench	Language Understanding	Beyond the Imitation Game benchmark - Collection of 204 tasks testing capabilities like reasoning, knowledge, and social understanding.	GitHub \| Paper	PaLM, GPT-4, Claude, Gemini
HellaSwag	Language Understanding	Common sense inference dataset for testing grounded commonsense inference with 70K multiple choice questions about grounded situations.	GitHub \| Paper	GPT-4, Claude 2, PaLM 2
GSM8K	Math & Reasoning	Grade School Math 8K - Collection of 8.5K high-quality linguistically diverse grade school math word problems. Problems take between 2-8 steps to solve.	GitHub \| Paper	GPT-4, Claude, PaLM
MATH	Math & Reasoning	Collection of 12K middle school and high school mathematics problems covering algebra, geometry, probability, and more.	GitHub \| Paper	GPT-4, Minerva, Claude
MathQA	Math & Reasoning	Large-scale dataset containing 37K mathematics word problems with step-by-step solutions and multiple-choice answers.	Website \| Paper	GPT-4, Claude 2, Gemini Pro
HumanEval	Coding	164 handwritten Python programming problems to evaluate code generation capabilities.	GitHub \| Paper	GPT-4, Claude 2, Code Llama
MBPP	Coding	Mostly Basic Programming Problems - 974 Python programming tasks with test cases.	GitHub \| Paper	GPT-4, Claude, StarCoder
CodeContests	Coding	Collection of competitive programming problems from various online contests.	GitHub \| Paper	AlphaCode, GPT-4, Claude
VQAv2	Multimodal	Visual Question Answering dataset with 265K images and 3M questions requiring understanding of vision, language, and commonsense knowledge.	Website \| Paper	GPT-4V, Claude 3, Gemini
MMMU	Multimodal	Massive Multi-discipline Multimodal Understanding benchmark covering 183 subjects across various disciplines including science, engineering, and humanities.	Website \| Paper	GPT-4V, Claude 3, Gemini
MathVista	Multimodal	Visual mathematics reasoning dataset with diverse problem types including geometry, graphs, and scientific diagrams.	Website \| Paper	GPT-4V, Claude 3, Gemini
TruthfulQA	Safety & Adversarial	Tests models' ability to identify and avoid generating false or misleading information.	GitHub \| Paper	GPT-4, Claude 2, PaLM 2
Anthropic Red Team	Safety & Adversarial	Collection of adversarial prompts testing model safety and alignment.	GitHub \| Paper	Claude, GPT-4, PaLM
Toxigen	Safety & Adversarial	Dataset for testing and measuring implicit toxicity in language models, with human-validated examples.	GitHub \| Paper	GPT-4, Claude 2, PaLM 2
MT-Bench	Dialogue	Multi-turn dialogue benchmark with 80 challenging multi-turn conversations, designed to evaluate chat models.	GitHub \| Paper	GPT-4, Claude 2, PaLM 2
Anthropic HH	Dialogue	Helpful and Harmless benchmark testing both beneficial behavior and safety constraints in dialogue.	GitHub \| Paper	Claude, GPT-4, PaLM 2
NaturalQuestions	Knowledge-heavy	Real Google search queries with answers from Wikipedia, testing both short and long-form question answering.	GitHub \| Paper	GPT-4, PaLM 2, Claude
TriviaQA	Knowledge-heavy	Large-scale dataset with over 650K question-answer-evidence triples from trivia questions.	GitHub \| Paper	GPT-4, Claude 2, PaLM 2
Self-Instruct	Instruction Following	52K instruction-following examples generated by GPT-3, covering diverse tasks and formats.	GitHub \| Paper	GPT-4, Claude 2, PaLM 2
Alpaca	Instruction Following	52K instruction-following examples based on Self-Instruct, used for fine-tuning smaller models.	GitHub	LLaMA, Alpaca, GPT-4
FLAN	Instruction Following	Large collection of tasks converted to instruction format, used for instruction tuning.	GitHub \| Paper	PaLM 2, FLAN-T5, GPT-4

Understanding LLM Evaluation Datasets

The datasets listed above serve different evaluation purposes:

Language Understanding: Datasets like GLUE and SuperGLUE test fundamental language capabilities including grammar, sentiment analysis, and semantic similarity.
Reasoning & Mathematics: GSM8K and MATH evaluate mathematical reasoning and problem-solving abilities.
Coding: HumanEval and MBPP assess code generation and understanding capabilities.
Multimodal: Datasets like VQAv2 and MMMU test the model's ability to understand and reason about both text and images.
Safety: TruthfulQA and Anthropic Red Team evaluate model safety and alignment.

When selecting evaluation datasets for your LLM application, consider factors such as task relevance, dataset size, quality of annotations, and coverage of edge cases. The right combination of evaluation datasets can provide a comprehensive assessment of your model's capabilities and limitations.

LLM Evaluation Datasets

Understanding LLM Evaluation Datasets

📚 Related Guides & Articles

Complete Guide to LLM Evaluation

Comparing Popular LLM Benchmarks

Creating Custom Evaluation Datasets