Broken Benchmark: MMLU

awg

Broken Benchmark: MMLU

by awg

1 min read29th Aug 20235 comments

23

AI BenchmarkingAI

Frontpage

This is a linkpost for https://www.youtube.com/watch?v=hVade_8H8mE

Phillip over at the AI Explained channel has been running some experiments on his SmartGPT framework against the MMLU benchmark and discovered a not-insignificant amount of issues with the problem set.

Among them:

Crucial context missing from questions (apparently copy-paste errors?)
Ambiguous sets of answers
Wrong sets of answers

He highlights a growing need for a proper benchmarking organization that can research and create accurate, robust, sensible benchmarking suites for evaluating SOTA models.

I found this video to be super interesting and the findings to be very important, so I wanted to spread this here.

New to LessWrong?

Getting Started

FAQ

Library

Broken Benchmark: MMLU

New Comment

5 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:25 PM

[-]Dan H8mo2310

Almost all datasets have label noise. Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise, very roughly. My guess is MMLU has 1-2%. I've seen these sorts of label noise posts/papers/videos come out for pretty much every major dataset (CIFAR, ImageNet, etc.).

[-]alenoach8mo12

As the video says, labeling noise becomes more important as LLMs get closer to 100%. Does making a version 2 look worthwhile ? I suppose that a LLM could be used to automatically detect most problematic questions and a human could verify for each flagged question if it needs to be fixed or removed.

[-]awg8mo1-2

Your position seems to be one that says this is not something to be worried about/looking at. Can you explain why?

For instance, if it is a desire to train predictive systems to provide accurate information, how is 10% or even 1-2% label noise "fine" under those conditions (if, for example, we could somehow get that number down to 0%)?

[-]Richard_Ngo8mo64

It seems like he's mainly responding to the implication that this means MMLU is "broken". Label noise can be both suboptimal and also much less important than this post's title suggests.

[-]O O8mo1-2

I imagine researchers at big labs know this and are correcting these errors as models get good enough for this to matter.

Moderation Log