This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
328
Wikitags
AI Benchmarking
This page is a stub.
Subscribe
Discussion
Subscribe
Discussion
Posts tagged
AI Benchmarking
Most Relevant
61
FrontierMath Score of o3-mini Much Lower Than Claimed
YafahEdelman
6mo
7
50
Introducing BenchBench: An Industry Standard Benchmark for AI Strength
Ω
Jozdien
6mo
Ω
0
24
Broken Benchmark: MMLU
awg
2y
5
15
The real reason AI benchmarks haven’t reflected economic impacts
Noosphere89
5mo
0
71
Some lessons from the OpenAI-FrontierMath debacle
7vik
8mo
9
46
A Guide For LLM-Assisted Web Research
nikos
,
dschwarz
,
Lawrence Phillips
,
FutureSearch
3mo
3
45
Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)
Roland Pihlakas
,
Sruthi Kuriakose
,
shrutidattagupta
6mo
8
34
"Superhuman" Isn't Well Specified
JustisMills
5mo
9
33
Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols
Arjun Panickssery
,
agg
2y
0
30
Improving Model-Written Evals for AI Safety Benchmarking
Ω
Sunishchal Dev
,
Marius Hobbhahn
1y
Ω
0
20
Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents
Ω
Sam F. Brown
,
BasilLabib
,
Codruta (Coco) Lugoj
,
Sai Sasank Y
1y
Ω
0
19
Edge Cases in AI Alignment
Florian_Dietz
6mo
3
18
MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures
Ω
corey morris
2y
Ω
3
18
Building AI safety benchmark environments on themes of universal human values
Roland Pihlakas
8mo
3
17
Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs
Roland Pihlakas
3mo
0