Benchmark Study #3: HellaSwag (Task, MCQ)

Bruce W. Lee

This is a linkpost for https://arxiv.org/abs/1905.07830

Background Note: Benchmark Study is a blog post series to record and study benchmark papers. I am in the process of developing a new LLM evaluation framework for more flexibility over EleutherAI LM Harness. For the initial release, I'm only adding benchmarks that I've studied. All study notes are meant to be read within 10 minutes. I will receive GPT assistance here and there while writing these blog posts. I'm publicly sharing study notes partly to keep myself going and help whoever hasn't read the paper yet.

@misc{zellers2019hellaswag,
     title={HellaSwag: Can a Machine Really Finish Your Sentence?}, 
     author={Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi},
     year={2019},
     eprint={1905.07830},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}

TL;DR

Human Performance -> 95% accuracy on HellaSwag.
From the original SWAG dataset, HellaSwag is created to further challenge state-of-the-art models in commonsense natural language inference (NLI).
Utilizes Adversarial Filtering (AF), a technique where discriminators iteratively choose difficult machine-generated incorrect answers to improve the dataset's robustness.
BERT often selects responses that match keywords in the context but are incoherent, highlighting its limitations in commonsense reasoning.
Extrapolation suggests that achieving human-level performance on HellaSwag would require an impractical amount of GPU hours (over 100k years) without significant algorithmic improvements.
- @jacobjacob In 2024, certain models like GPT-4 are reported to have reached human-level performance. See here.

Timeline Note: Everything below is written from the perspectives of 2019 when the latest version (at the time of writing) of "HellaSwag: Can a Machine Really Finish Your Sentence?" was published

Section: Abstract

Introduction to HellaSwag and Commonsense Inference

Background: Zellers et al. (2018) introduced a task for testing commonsense natural language inference (NLI) with models predicting logical follow-ups to given scenarios.
Challenge with State-of-the-Art Models: Despite BERT (Devlin et al., 2018) achieving near-human performance, this paper introduces HellaSwag to demonstrate ongoing challenges in commonsense inference for advanced models.

Development of HellaSwag Dataset

Objective: To create a dataset that remains challenging for state-of-the-art models while being trivial for human understanding.
Methodology: Utilizes Adversarial Filtering (AF), a technique where discriminators iteratively select challenging, machine-generated wrong answers.
Dataset Complexity: Focuses on scaling up the length and complexity of dataset examples to a 'Goldilocks' zone, where the text seems absurd to humans but often confuses sophisticated models.

Implications for Machine Learning and NLP

Insights on Model Performance: HellaSwag's difficulty reveals insights into the limitations and functioning of deep pre-trained models in understanding common.
Proposal for Benchmark Evolution: Suggests that NLP benchmarks should evolve in an adversarial manner alongside model advancements, thus continuously raising the bar for NLP research.

Section: Introduction

Exploring Commonsense Inference in AI Models

Background: Zellers et al. (2018) developed a task to test AI's ability to make commonsense inferences, like predicting a logical follow-up to a scenario (e.g., "A woman sits at a piano").
BERT's Performance: Despite BERT's near-human performance on similar tasks, this study finds that deep pre-trained models like BERT still struggle with robust commonsense reasoning.

Introduction of HellaSwag Dataset

Purpose: Created to further challenge state-of-the-art models in commonsense natural language inference (NLI).
Methodology: Utilizes Adversarial Filtering (AF), a technique where discriminators iteratively choose difficult machine-generated incorrect answers to improve the dataset's robustness.
Dataset Characteristics: Focuses on increasing the length and complexity of examples, finding a 'Goldilocks' zone where text confuses AI models but is clear to humans.

Assessing Model Limitations and Dataset Evolution

Model Limitations: The study reveals that models like BERT act more as rapid surface learners, heavily reliant on dataset-specific cues and struggling when these cues are slightly altered.
Dataset Evolution Concept: Proposes that NLP benchmarks should evolve adversarially with model advancements, continuously presenting harder challenges to accurately gauge model capabilities.

Adversarial Filtering Overview

Process: Involves training a classifier to discriminate between real and generated endings, iteratively refining the dataset to ensure it remains challenging regardless of the data split.
Outcome: Aimed at producing a dataset where even after significant training, models find the tasks difficult, especially when encountering new concepts within the same domain.

Future of Verified Progress in NLP

Case Study Approach: Suggests iterative building and breaking of datasets as a method to ensure genuine progress in NLP rather than superficial achievements based on dataset-specific learning.
Evolving Benchmarks: Emphasizes the need for benchmarks to dynamically adapt, leveraging advances in model technology to eliminate biases and ensure that tasks genuinely test the intended capabilities.

Section: Investigating SWAG

Investigating SWAG's Resolution by BERT

Background: Focusing on BERT, the study aims to understand why it effectively solved the SWAG dataset, a commonsense natural language inference task.
BERT's Innate Knowledge of SWAG: Analysis shows BERT outperforms the ELMo NLI model with as few as 64 examples, indicating significant learning ability but still requiring over 16k examples to approach human levels.

Learning Dynamics During Finetuning

Context Dependency: When trained without context, BERT's performance only drops slightly, indicating a bias in the endings themselves.
Structure Analysis: BERT adapts well to randomly shuffled text, suggesting it performs lexical reasoning over context and answer pairs.

Source of Stylistic Biases in SWAG

Adversarial Filtering (AF) Methodology: SWAG was created using a language model for generation and discriminators for selection, aiming to identify the source of biases that BERT exploits.
Generator's Role in Bias Creation: Comparison of AF results using BERT-Large as a discriminator shows that the original language model used in SWAG created distinct biases, as BERT's accuracy never dropped to chance levels.

BERT's Adaptability and Discriminatory Power

Adaptation to Shuffled Endings: BERT's ability to adapt to shuffled endings and still perform significantly better than ELMo indicates its strong lexical reasoning capability.
Performance in Different Scenarios: BERT's varied performance under different conditions (context/no context, shuffled/not shuffled) reveals its sensitivity to structural and contextual cues in data.

Section: HellaSwag

A. Development and Structure of HellaSwag

Creation of HellaSwag for Commonsense NLI:

Objective: To build a dataset that challenges state-of-the-art models while being simple for human understanding.
ActivityNet Captions Inclusion: Incorporates video captions from the ActivityNet Captions dataset, focusing solely on this source compared to the original SWAG dataset.

Incorporating WikiHow as a New Testbed:

Purpose: Expands the domain for commonsense reasoning by using how-to articles from WikiHow, covering a wide range of topics.
Dataset Composition: Consists of 80,000 context and follow-up paragraphs from WikiHow, each with a maximum of three sentences.

Adversarial Filtering (AF) Methodology:

Approach: Utilizes AF to generate challenging machine-written wrong answers.
Effectiveness in WikiHow Context: Demonstrates that AF is effective, especially in longer, two-sentence generation settings, creating a 'Goldilocks' zone where text challenges models but is easy for humans.

B. Human Interaction and Model Evaluation in HellaSwag

Achieving High Human Agreement:

Validation Process: Multiple rounds of human validation are performed to filter out realistic-sounding machine generations.
Dataset Selection: The final dataset comprises 25,000 best-performing contexts from ActivityNet and 45,000 from WikiHow based on human agreement.

Zero-Shot Categories for Model Generalization:

Evaluation Strategy: Uses category labels from WikiHow and ActivityNet to create zero-shot evaluation sets.
Dataset Structure: Forms two subsets for each set (validation/test) - one with 5,000 in-domain examples and another with 5,000 zero-shot examples from unseen categories.

Observations on Dataset Lengths and Model Performance:

Length Analysis: Notes that WikiHow's two-sentence generations are longer than ActivityNet's, offering more opportunities for detectable errors.
Implications for Human and BERT Performance: Longer lengths in WikiHow correspond to easier human performance, while BERT shows increased accuracy during validation but still lags behind human performance.

Section: Results

A. Evaluation of Models on HellaSwag Dataset

Model Performance Comparison:

Methodology: Models are evaluated using a four-way cross-entropy loss to predict the correct ending in multiple-choice format.
Models Tested: Includes BERT-Large, OpenAI GPT, BERT-Base, ESIM+ELMo, LSTM sentence encoders with different embeddings, and FastText.
Human Benchmarking: Human performance is assessed by a majority vote from five independent crowd workers solving the same problems.

Results Indicating Dataset Difficulty:

Human vs. Machine Performance: Humans achieve over 95% accuracy, while no model surpasses 50% accuracy.
BERT-Large Performance: Despite being used as the adversarial filter, BERT-Large performs best among models at 47.3%.

Insights on Pretraining and Finetuning:

Necessity of Pretraining: Models with extensive pretraining, like BERT-Large, show better performance.
Impact of Finetuning: Freezing BERT-Base and adding LSTM reduces performance, highlighting the importance of end-to-end finetuning.

B. Model Transferability Between SWAG and HellaSwag

Transfer Experiments:

Procedure: Models trained on one dataset (SWAG or HellaSwag) are evaluated on the other.
Results: Training on SWAG and evaluating HellaSwag reduces performance by 12%; the reverse reduces it by 15%.

Domain-Specific Observations:

Performance on Different Domains: BERT-Large shows a marked difference in performance between ActivityNet (higher) and WikiHow (lower).
OpenAI GPT vs. BERT: OpenAI GPT outperforms BERT in WikiHow but underperforms in ActivityNet, suggesting structural biases in models.

C. Qualitative Analysis of Model Responses

Evaluation of BERT-Large's Predictions:

Context Understanding: BERT-Large performs well in some contexts but struggles with zero-shot categories and complex scenarios.
Incoherent Responses: BERT often selects responses that match keywords in the context but are incoherent, highlighting its limitations in commonsense reasoning.

Section: Discussion

HellaSwag as a Challenging Testbed:

Observation: HellaSwag proves to be a difficult dataset for even the most advanced natural language inference (NLI) models with extensive pretraining.
Implication: Raises questions about the future direction and development in the field of NLP.

Difficulty for Future Discriminators:

Challenge: Assessing how future, more powerful models may perform on HellaSwag, which is designed to be nonsensical yet hard for current models to differentiate.
Ablation Study: Comparing weaker filters with stronger discriminators shows that merely increasing discriminator strength is insufficient for significantly better performance.

Scaling of Pretraining:

Analysis: Investigating the scalability of pretraining and its impact on achieving human-level performance in commonsense inference.
Findings: Extrapolation suggests that achieving human-level performance on HellaSwag would require an impractical amount of GPU hours (over 100k years) without significant algorithmic improvements.

Potential Algorithmic Improvements:

Options: Architectural advances and better pretraining objectives as potential avenues for improvement.
Data Source Bottleneck: Highlights the challenge of overcoming reporting bias and the tendency of models to pick up spurious patterns.

Evolving Benchmarks in NLP:

Proposal: Suggests the need for continuously evolving benchmarks that keep pace with advancements in NLP models.
Approach: Recommends crowdsourcing new datasets and using state-of-the-art models as adversaries to ensure ongoing challenges in NLP research.

[-]jacobjacob4mo20

Humans achieve over 95% accuracy, while no model surpasses 50% accuracy. (2019)

A series on benchmarks does seem very interesting and useful -- but you really gotta report more recent model results than from 2019!! GPT-4 allegedly surpasses 95.3% on HellaSwag, making that initial claim in the post very misleading.

A Google Gemini benchmark performance chart provided by Google.

[-]Bruce W. Lee4mo20

Thanks for the feedback. This is similar to the feedback that I received from Owain. Since my posts are getting upvotes (which I never really expected thank you), it is of course important to not mislead anyone.

But yes, I did have several major epistemic concerns about the reliability of current academic reporting practices in performance scores. Even if a certain group of researchers were very ethical, as a reader, how will we ever confirm that the numbers are indeed correct, or even that there was an experiment run ever?

I was weighing the overall benefits of reporting such non-provable numbers (in my opinion) and just focusing on the situation that the paper is written and enjoying the a-ha moments that the authors would have felt back then.

Anyway, before I post another benchmark study blog tomorrow, I’ll devise some steps of action to satisfy both my concern and yours. It’s always a joy to post here on LessWrong. Thanks for the comment!

If that's your belief, I think you should edit in a disclaimer to your TL;DR section, like "Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don't trust their methodology".

Also, the numbers aren't "non-provable": anyone could just replicate them with the GPT-4 API! (Modulo dataset contamination considerations.)

[-]Bruce W. Lee4mo10

Thanks for the recommendation, though I'll think of a more fundamental solution to satisfy all ethical/communal concerns.

"Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don't trust their methodology." Regarding this, just to sort everything out, because I'm writing under my real name, I do trust the authors and ethics of both OpenAI and DeepMind. It's just me questioning everything when I still can as a student. But I'll make sure not to cause any further confusion, as you recommended!