LLM benchmarks

What are LLM benchmarks?

Benchmarks are curated tasks designed to measure LLM capabilities in a consistent way. They often include:

These benchmarks are used in academic research, competitive leaderboards, and internal product evaluations.

Benchmarks help:

They offer a shared language for researchers, builders, and buyers to evaluate model strengths and weaknesses.

MMLU (Massive multitask language understanding): Tests reasoning across 57 academic subjects
TruthfulQA: Evaluates factual correctness and robustness to misinformation
HellaSwag: Measures common-sense reasoning and plausibility
Big-Bench (BIG-bench): Collaborative benchmark with 200+ tasks
GSM8K: Math word problems for grade-school level reasoning
ARC (AI2 reasoning challenge): Science questions from standardized tests
CodeEval / HumanEval: For evaluating code generation tasks
HELM (Holistic evaluation of language models): Framework for comparing models across multiple dimensions like fairness, robustness, and toxicity

Use benchmarks alongside task-specific evaluation to get a full picture of LLM performance.