Benchmark Study #1: MMLU (Pile, MCQ)

Bruce W. Lee

This is a linkpost for https://arxiv.org/abs/2009.03300

Background Note: Benchmark Study is a blog post series to record and study benchmark papers. I am in the process of developing a new LLM evaluation framework for more flexibility over EleutherAI LM Harness. For the initial release, I'm only adding benchmarks that I've studied. All study notes are meant to be read within 10 minutes. I will receive GPT assistance here and there while writing these blog posts. I'm publicly sharing study notes partly to keep myself going and help whoever hasn't read the paper yet.

@misc{hendrycks2021measuring,
     title={Measuring Massive Multitask Language Understanding}, 
     author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
     year={2021},
     eprint={2009.03300},
     archivePrefix={arXiv},
     primaryClass={cs.CY}
}

TL;DR

Performance -> Unspecialized humans: 34.5%, Specialized humans: 89.8% on MMLU
Procedural vs. Declarative Knowledge: GPT-3 acquires declarative knowledge more readily than procedural knowledge, with less accuracy in calculation-heavy STEM tasks.
Also, models exhibit poor performance in tasks requiring human value judgments and procedural knowledge, like Professional Law and Moral Scenarios.
Methodological Shift: Suggests models should be trained more like humans, learning from reading and listening rather than relying solely on large question banks. (reminds me of my recent paper)
GPT-3 Calibration Findings: GPT-3 is found to be uncalibrated, with its confidence often poorly reflecting actual accuracy, especially under zero-shot settings.

Snapshot Preview

https://huggingface.co/datasets/brucewlee1/mmlu-college-biology/viewer/default/validation

Timeline Note: Everything below is written from the perspectives of 2022 when the latest version (at the time of writing) of "Measuring Massive Multitask Language Understanding" was published

Section: Abstract

Introduction of a new test to measure the multitask accuracy of text models.
The test encompasses a wide range of 57 tasks, including topics like elementary mathematics, US history, computer science, law, and more.
Emphasis on the need for models to have extensive world knowledge and problem-solving abilities to score high on this test.
Findings indicate that most recent models perform only slightly better than random chance.
The largest GPT-3 model shows a notable improvement, surpassing random chance by almost 20 percentage points on average.
Despite these improvements, even the best models fall short of expert-level accuracy in all 57 tasks.
Observations of uneven performance across different tasks by the models.
Models frequently fail to recognize their own inaccuracies.
Notably, poor performance in socially significant subjects like morality and law, with near-random accuracy.
The proposed test serves as a tool for comprehensive evaluation of a model's academic and professional understanding, highlighting key areas for improvement.

Section: Introduction

Introduction of a New Benchmark for Language Models

Purpose: To bridge the gap between the knowledge seen by models during pretraining and current measures of success.
Design: A benchmark covering 57 subjects across various fields, including STEM, humanities, and social sciences, ranging from elementary to advanced professional levels.
Feature: Focuses on testing world knowledge and problem-solving abilities in areas such as mathematics, history, law, and ethics.

Analysis of Current NLP Model Performance

Observation: Most models, including those with up to 13 billion parameters, only achieve random chance performance.
GPT-3 Performance: The 175 billion parameter GPT-3 model reaches significantly higher accuracy (43.9%) but lacks expertise in any single subject.
Performance Disparity: GPT-3 shows lopsided results, excelling in some areas but performing near-randomly in others, especially in calculation-heavy and human values-related subjects.

Challenges in Modern NLP Models

Knowledge Application: Current models struggle to apply knowledge from pretraining effectively.
Weak Areas: Low accuracy in subjects like physics, mathematics, law, and morality highlights critical weaknesses.
Confidence vs. Accuracy: GPT-3 often misjudges its own knowledge, with confidence levels significantly deviating from actual accuracy.

Significance of the New Benchmark

Comprehensive Evaluation: This benchmark evaluates a model’s text understanding across a broad range of topics important for human learning.

Section: A Multitask Test

Creation of a Comprehensive Multitask Test

Purpose: To evaluate text models across multiple branches of knowledge.
Design: The test includes 57 tasks spanning humanities, social sciences, hard sciences, and other key learning areas.
Task Source: Questions were manually collected from various online sources, including GRE and USMLE practice questions and undergraduate courses.

Test Composition and Structure

Question Collection: A total of 15908 questions were gathered.
Test Segmentation: The test is divided into a few-shot development set, a validation set, and a main test set.
Task Difficulty Levels: Tasks are categorized by difficulty levels, such as Elementary, High School, College, or Professional.

Benchmark for Human-Level Accuracy

Human Performance Baseline: Unspecialized humans from Amazon Mechanical Turk achieved 34.5% accuracy.
Expert Performance Estimation: Expert-level accuracy is approximated at 89.8% based on the 95th percentile accuracy of real-world test takers.

Emphasis on Real-World Text Understanding

Goal: To assess how well models extract useful knowledge from massive online corpora.
Future Model Application: The test is applicable to both single models and a mixture of expert models.

Focus on Specific Subject Areas

Humanities Tasks: Cover qualitative analysis disciplines like law, philosophy, and history, requiring skills like legal reasoning and moral judgment.
Social Science Tasks: Include subjects like economics, sociology, and politics, focusing on human behavior and societal dynamics.
STEM Tasks: Encompass fields like physics, computer science, and mathematics, focusing on empirical methods and problem-solving abilities.
Other Subjects: Include areas like Professional Medicine, finance, and global facts, offering a diverse range of topics outside traditional categories.

Section: Experiments

Experimental Setup and Assessment Methodology

Assessment Goal: To measure classification accuracy across various tasks in the multitask test.
Models Evaluated: Includes GPT-3 (with its four variants: Small, Medium, Large, X-Large) and UnifiedQA, along with RoBERTa-base, ALBERT-xxlarge, and GPT-2.
Evaluation Process: Used OpenAI API for GPT-3 and pre-existing datasets for UnifiedQA, with a focus on transfer accuracy without further tuning.

Model Performance and Comparison

Accuracy Measurement: Evaluated average weighted accuracy for each model across four broad disciplines: Humanities, Social Science, STEM, and Other.
Model Size Impact: Larger GPT-3 models, particularly the X-Large variant, showed significantly better performance than smaller ones.
UnifiedQA Performance: Exhibited higher accuracy compared to the few-shot GPT-3 X-Large model despite having fewer parameters.

Specific Findings on Model Capabilities

Procedural vs. Declarative Knowledge: GPT-3 acquires declarative knowledge more readily than procedural knowledge, with less accuracy in calculation-heavy STEM tasks.
Knowledge Acquisition Patterns: GPT-3 demonstrates an unusual pattern of knowledge acquisition, performing better in advanced topics compared to elementary ones.
Lopsided Performance: Both GPT-3 and UnifiedQA exhibit uneven performance across different subjects, indicating knowledge gaps.

Calibration and Confidence Analysis

Calibration Importance: Examines the relationship between a model's confidence and its actual prediction accuracy.
GPT-3 Calibration Findings: GPT-3 is found to be uncalibrated, with its confidence often poorly reflecting actual accuracy, especially under zero-shot settings.

Section: Discussion

Integration of Multimodal Understanding

Current Limitations: Existing NLP models, including GPT-3, do not incorporate multimodal information.
Future Benchmarking: Proposes the development of benchmarks that reflect multimodal capabilities, such as a "Turk Test" with Amazon Mechanical Turk tasks.

The Internet as a Comprehensive Training Set

Pretraining Approach: Assumes models have acquired the necessary knowledge from the vast, diverse text on the Internet, akin to human learning methods.
Methodological Shift: Suggests models should be trained more like humans, learning from reading and listening rather than relying solely on large question banks.

Evaluation Format and Purpose

Assessment Strategy: Evaluate pre-trained models in zero-shot, few-shot, or transfer settings.
Task Diversification: Enables the collection of a more extensive and diverse set of tasks, contrasting with identically distributed training and test sets.

Model Limitations and Future Improvements

Performance Shortcomings: Models exhibit poor performance in tasks requiring human value judgments and procedural knowledge, like Professional Law and Moral Scenarios.
Challenges in Enhancing Accuracy: Attempts to improve Professional Law model accuracy through additional specialized pretraining showed limited success.
Scaling Challenges: Questions the efficacy of simply increasing model size, noting the need for more data and the potential bottlenecks in data availability for esoteric knowledge.

LESSWRONG
LW