LESSWRONG
Wikitags
LW

Subscribe
Discussion0

AI Benchmarking

Subscribe
Discussion0
This page is a stub.
Posts tagged AI Benchmarking
2
61FrontierMath Score of o3-mini Much Lower Than Claimed
YafahEdelman
2mo
7
2
50Introducing BenchBench: An Industry Standard Benchmark for AI Strength
Ω
Jozdien
1mo
Ω
0
2
24Broken Benchmark: MMLU
awg
2y
5
2
15The real reason AI benchmarks haven’t reflected economic impacts
Noosphere89
1mo
0
1
69Some lessons from the OpenAI-FrontierMath debacle
7vik
4mo
9
1
37Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format
Roland Pihlakas, Sruthi Kuriakose, shrutidattagupta
2mo
6
1
33Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols
Arjun Panickssery, agg
1y
0
1
32"Superhuman" Isn't Well Specified
JustisMills
13d
9
1
30Improving Model-Written Evals for AI Safety Benchmarking
Ω
Sunishchal Dev, Marius Hobbhahn
7mo
Ω
0
1
20Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents
Ω
Sam F. Brown, BasilLabib, Codruta (Coco) Lugoj, Sai Sasank Y
10mo
Ω
0
1
19Edge Cases in AI Alignment
Florian_Dietz
2mo
3
1
18MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures
Ω
corey morris
2y
Ω
3
1
18Building AI safety benchmark environments on themes of universal human values
Roland Pihlakas
4mo
3
1
10Understanding Benchmarks and motivating Evaluations
markov, Charbel-Raphaël
3mo
0
1
10In-Context Scheming: A Run is Worth a Thousand Words
noise-field
2mo
0
Load More (15/21)
Add Posts