x
This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
AI Benchmarking — LessWrong
AI Benchmarking
This page is a stub.
Subscribe
Discussion
Subscribe
Discussion
Posts tagged
AI Benchmarking
Most Relevant
2
79
AI benchmarking has a Y-axis problem
Lizka
2mo
3
2
61
FrontierMath Score of o3-mini Much Lower Than Claimed
YafahEdelman
1y
7
2
49
Introducing BenchBench: An Industry Standard Benchmark for AI Strength
Ω
Jozdien
1y
Ω
0
2
24
Broken Benchmark: MMLU
awg
3y
5
2
15
The real reason AI benchmarks haven’t reflected economic impacts
Noosphere89
1y
0
1
95
Every Benchmark is Broken
Jonathan Gabor
2mo
0
1
71
Some lessons from the OpenAI-FrontierMath debacle
7vik
1y
9
1
46
A Guide For LLM-Assisted Web Research
nikos
,
dschwarz
,
Lawrence Phillips
,
FutureSearch
9mo
3
1
45
Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)
Roland Pihlakas
,
Sruthi Kuriakose
,
shrutidattagupta
1y
8
1
44
I'm confused by the change in the METR trend
Expertium
1mo
17
1
34
"Superhuman" Isn't Well Specified
JustisMills
11mo
9
1
33
Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols
Arjun Panickssery
,
agg
2y
0
1
30
Improving Model-Written Evals for AI Safety Benchmarking
Ω
Sunishchal Dev
,
Marius Hobbhahn
1y
Ω
0
1
27
Reasons to care about Canary Strings
Alice Blair
4mo
3
1
24
Maybe benchmarks should be broken?
Jonathan Gabor
1mo
2