Background Note: Benchmark Study is a blog post series to record and study benchmark papers. I am in the process of developing a new LLM evaluation framework for more flexibility over EleutherAI LM Harness. For the initial release, I'm only adding benchmarks that I've studied. All study notes are meant to be read within 10 minutes. I will receive GPT assistance here and there while writing these blog posts. I'm publicly sharing study notes partly to keep myself going and help whoever hasn't read the paper yet.
@misc{hendrycks2021measuring,
title={Measuring Massive Multitask Language Understanding},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
year={2021},
eprint={2009.03300},
archivePrefix={arXiv},
primaryClass={cs.CY}
}
TL;DR
Performance -> Unspecialized humans: 34.5%, Specialized humans: 89.8% on MMLU
Procedural vs. Declarative Knowledge: GPT-3 acquires declarative knowledge more readily than procedural knowledge, with less accuracy in calculation-heavy STEM tasks.
Also, models exhibit poor performance in tasks requiring human value judgments and procedural knowledge, like Professional Law and Moral Scenarios.
Methodological Shift: Suggests models should be trained more like humans, learning from reading and listening rather than relying solely on large question banks. (reminds me of my recent paper)
GPT-3 Calibration Findings: GPT-3 is found to be uncalibrated, with its confidence often poorly reflecting actual accuracy, especially under zero-shot settings.
Snapshot Preview
Timeline Note: Everything below is written from the perspectives of 2022 when the latest version (at the time of writing) of "Measuring Massive Multitask Language Understanding" was published
Section: Abstract
Introduction of a new test to measure the multitask accuracy of text models.
The test encompasses a wide range of 57 tasks, including topics like elementary mathematics, US history, computer science, law, and more.
Emphasis on the need for models to have extensive world knowledge and problem-solving abilities to score high on this test.
Findings indicate that most recent models perform only slightly better than random chance.
The largest GPT-3 model shows a notable improvement, surpassing random chance by almost 20 percentage points on average.
Despite these improvements, even the best models fall short of expert-level accuracy in all 57 tasks.
Observations of uneven performance across different tasks by the models.
Models frequently fail to recognize their own inaccuracies.
Notably, poor performance in socially significant subjects like morality and law, with near-random accuracy.
The proposed test serves as a tool for comprehensive evaluation of a model's academic and professional understanding, highlighting key areas for improvement.
Section: Introduction
Introduction of a New Benchmark for Language Models
Purpose: To bridge the gap between the knowledge seen by models during pretraining and current measures of success.
Design: A benchmark covering 57 subjects across various fields, including STEM, humanities, and social sciences, ranging from elementary to advanced professional levels.
Feature: Focuses on testing world knowledge and problem-solving abilities in areas such as mathematics, history, law, and ethics.
Analysis of Current NLP Model Performance
Observation: Most models, including those with up to 13 billion parameters, only achieve random chance performance.
GPT-3 Performance: The 175 billion parameter GPT-3 model reaches significantly higher accuracy (43.9%) but lacks expertise in any single subject.
Performance Disparity: GPT-3 shows lopsided results, excelling in some areas but performing near-randomly in others, especially in calculation-heavy and human values-related subjects.
Challenges in Modern NLP Models
Knowledge Application: Current models struggle to apply knowledge from pretraining effectively.
Weak Areas: Low accuracy in subjects like physics, mathematics, law, and morality highlights critical weaknesses.
Confidence vs. Accuracy: GPT-3 often misjudges its own knowledge, with confidence levels significantly deviating from actual accuracy.
Significance of the New Benchmark
Comprehensive Evaluation: This benchmark evaluates a model’s text understanding across a broad range of topics important for human learning.
Section: A Multitask Test
Creation of a Comprehensive Multitask Test
Purpose: To evaluate text models across multiple branches of knowledge.
Design: The test includes 57 tasks spanning humanities, social sciences, hard sciences, and other key learning areas.
Task Source: Questions were manually collected from various online sources, including GRE and USMLE practice questions and undergraduate courses.
Test Composition and Structure
Question Collection: A total of 15908 questions were gathered.
Test Segmentation: The test is divided into a few-shot development set, a validation set, and a main test set.
Task Difficulty Levels: Tasks are categorized by difficulty levels, such as Elementary, High School, College, or Professional.
Benchmark for Human-Level Accuracy
Human Performance Baseline: Unspecialized humans from Amazon Mechanical Turk achieved 34.5% accuracy.
Expert Performance Estimation: Expert-level accuracy is approximated at 89.8% based on the 95th percentile accuracy of real-world test takers.
Emphasis on Real-World Text Understanding
Goal: To assess how well models extract useful knowledge from massive online corpora.
Future Model Application: The test is applicable to both single models and a mixture of expert models.
Focus on Specific Subject Areas
Humanities Tasks: Cover qualitative analysis disciplines like law, philosophy, and history, requiring skills like legal reasoning and moral judgment.
Social Science Tasks: Include subjects like economics, sociology, and politics, focusing on human behavior and societal dynamics.
STEM Tasks: Encompass fields like physics, computer science, and mathematics, focusing on empirical methods and problem-solving abilities.
Other Subjects: Include areas like Professional Medicine, finance, and global facts, offering a diverse range of topics outside traditional categories.
Section: Experiments
Experimental Setup and Assessment Methodology
Assessment Goal: To measure classification accuracy across various tasks in the multitask test.
Models Evaluated: Includes GPT-3 (with its four variants: Small, Medium, Large, X-Large) and UnifiedQA, along with RoBERTa-base, ALBERT-xxlarge, and GPT-2.
Evaluation Process: Used OpenAI API for GPT-3 and pre-existing datasets for UnifiedQA, with a focus on transfer accuracy without further tuning.
Model Performance and Comparison
Accuracy Measurement: Evaluated average weighted accuracy for each model across four broad disciplines: Humanities, Social Science, STEM, and Other.
Model Size Impact: Larger GPT-3 models, particularly the X-Large variant, showed significantly better performance than smaller ones.
UnifiedQA Performance: Exhibited higher accuracy compared to the few-shot GPT-3 X-Large model despite having fewer parameters.
Specific Findings on Model Capabilities
Procedural vs. Declarative Knowledge: GPT-3 acquires declarative knowledge more readily than procedural knowledge, with less accuracy in calculation-heavy STEM tasks.
Knowledge Acquisition Patterns: GPT-3 demonstrates an unusual pattern of knowledge acquisition, performing better in advanced topics compared to elementary ones.
Lopsided Performance: Both GPT-3 and UnifiedQA exhibit uneven performance across different subjects, indicating knowledge gaps.
Calibration and Confidence Analysis
Calibration Importance: Examines the relationship between a model's confidence and its actual prediction accuracy.
GPT-3 Calibration Findings: GPT-3 is found to be uncalibrated, with its confidence often poorly reflecting actual accuracy, especially under zero-shot settings.
Section: Discussion
Integration of Multimodal Understanding
Current Limitations: Existing NLP models, including GPT-3, do not incorporate multimodal information.
Future Benchmarking: Proposes the development of benchmarks that reflect multimodal capabilities, such as a "Turk Test" with Amazon Mechanical Turk tasks.
The Internet as a Comprehensive Training Set
Pretraining Approach: Assumes models have acquired the necessary knowledge from the vast, diverse text on the Internet, akin to human learning methods.
Methodological Shift: Suggests models should be trained more like humans, learning from reading and listening rather than relying solely on large question banks.
Evaluation Format and Purpose
Assessment Strategy: Evaluate pre-trained models in zero-shot, few-shot, or transfer settings.
Task Diversification: Enables the collection of a more extensive and diverse set of tasks, contrasting with identically distributed training and test sets.
Model Limitations and Future Improvements
Performance Shortcomings: Models exhibit poor performance in tasks requiring human value judgments and procedural knowledge, like Professional Law and Moral Scenarios.
Challenges in Enhancing Accuracy: Attempts to improve Professional Law model accuracy through additional specialized pretraining showed limited success.
Scaling Challenges: Questions the efficacy of simply increasing model size, noting the need for more data and the potential bottlenecks in data availability for esoteric knowledge.
TL;DR
Snapshot Preview
Timeline Note: Everything below is written from the perspectives of 2022 when the latest version (at the time of writing) of "Measuring Massive Multitask Language Understanding" was published
Section: Abstract
Section: Introduction
Introduction of a New Benchmark for Language Models
Analysis of Current NLP Model Performance
Challenges in Modern NLP Models
Significance of the New Benchmark
Section: A Multitask Test
Creation of a Comprehensive Multitask Test
Test Composition and Structure
Benchmark for Human-Level Accuracy
Emphasis on Real-World Text Understanding
Focus on Specific Subject Areas
Section: Experiments
Experimental Setup and Assessment Methodology
Model Performance and Comparison
Specific Findings on Model Capabilities
Calibration and Confidence Analysis
Section: Discussion
Integration of Multimodal Understanding
The Internet as a Comprehensive Training Set
Evaluation Format and Purpose
Model Limitations and Future Improvements