Humans achieve over 95% accuracy, while no model surpasses 50% accuracy. (2019)
A series on benchmarks does seem very interesting and useful -- but you really gotta report more recent model results than from 2019!! GPT-4 allegedly surpasses 95.3% on HellaSwag, making that initial claim in the post very misleading.
Thanks for the feedback. This is similar to the feedback that I received from Owain. Since my posts are getting upvotes (which I never really expected thank you), it is of course important to not mislead anyone.
But yes, I did have several major epistemic concerns about the reliability of current academic reporting practices in performance scores. Even if a certain group of researchers were very ethical, as a reader, how will we ever confirm that the numbers are indeed correct, or even that there was an experiment run ever?
I was weighing the overall benefits of reporting such non-provable numbers (in my opinion) and just focusing on the situation that the paper is written and enjoying the a-ha moments that the authors would have felt back then.
Anyway, before I post another benchmark study blog tomorrow, I’ll devise some steps of action to satisfy both my concern and yours. It’s always a joy to post here on LessWrong. Thanks for the comment!
If that's your belief, I think you should edit in a disclaimer to your TL;DR section, like "Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don't trust their methodology".
Also, the numbers aren't "non-provable": anyone could just replicate them with the GPT-4 API! (Modulo dataset contamination considerations.)
Thanks for the recommendation, though I'll think of a more fundamental solution to satisfy all ethical/communal concerns.
"Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don't trust their methodology." Regarding this, just to sort everything out, because I'm writing under my real name, I do trust the authors and ethics of both OpenAI and DeepMind. It's just me questioning everything when I still can as a student. But I'll make sure not to cause any further confusion, as you recommended!
TL;DR
Timeline Note: Everything below is written from the perspectives of 2019 when the latest version (at the time of writing) of "HellaSwag: Can a Machine Really Finish Your Sentence?" was published
Section: Abstract
Introduction to HellaSwag and Commonsense Inference
Development of HellaSwag Dataset
Implications for Machine Learning and NLP
Section: Introduction
Exploring Commonsense Inference in AI Models
Introduction of HellaSwag Dataset
Assessing Model Limitations and Dataset Evolution
Adversarial Filtering Overview
Future of Verified Progress in NLP
Section: Investigating SWAG
Investigating SWAG's Resolution by BERT
Learning Dynamics During Finetuning
Source of Stylistic Biases in SWAG
BERT's Adaptability and Discriminatory Power
Section: HellaSwag
A. Development and Structure of HellaSwag
Creation of HellaSwag for Commonsense NLI:
Incorporating WikiHow as a New Testbed:
Adversarial Filtering (AF) Methodology:
B. Human Interaction and Model Evaluation in HellaSwag
Achieving High Human Agreement:
Zero-Shot Categories for Model Generalization:
Observations on Dataset Lengths and Model Performance:
Section: Results
A. Evaluation of Models on HellaSwag Dataset
Model Performance Comparison:
Results Indicating Dataset Difficulty:
Insights on Pretraining and Finetuning:
B. Model Transferability Between SWAG and HellaSwag
Transfer Experiments:
Domain-Specific Observations:
C. Qualitative Analysis of Model Responses
Evaluation of BERT-Large's Predictions:
Section: Discussion
HellaSwag as a Challenging Testbed:
Difficulty for Future Discriminators:
Scaling of Pretraining:
Potential Algorithmic Improvements:
Evolving Benchmarks in NLP: