LESSWRONG
LW

2261
Wikitags

AI Benchmarking

This page is a stub.
Subscribe
Discussion
Subscribe
Discussion
Posts tagged AI Benchmarking
61FrontierMath Score of o3-mini Much Lower Than Claimed
YafahEdelman
8mo
7
49Introducing BenchBench: An Industry Standard Benchmark for AI Strength
Ω
Jozdien
7mo
Ω
0
24Broken Benchmark: MMLU
awg
2y
5
15The real reason AI benchmarks haven’t reflected economic impacts
Noosphere89
7mo
0
71Some lessons from the OpenAI-FrontierMath debacle
7vik
9mo
9
46A Guide For LLM-Assisted Web Research
nikos, dschwarz, Lawrence Phillips, FutureSearch
4mo
3
45Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)
Roland Pihlakas, Sruthi Kuriakose, shrutidattagupta
8mo
8
34"Superhuman" Isn't Well Specified
JustisMills
6mo
9
33Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols
Arjun Panickssery, agg
2y
0
30Improving Model-Written Evals for AI Safety Benchmarking
Ω
Sunishchal Dev, Marius Hobbhahn
1y
Ω
0
20Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents
Ω
Sam F. Brown, BasilLabib, Codruta (Coco) Lugoj, Sai Sasank Y
1y
Ω
0
19Edge Cases in AI Alignment
Florian_Dietz
7mo
3
18MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures
Ω
corey morris
2y
Ω
3
18Building AI safety benchmark environments on themes of universal human values
Roland Pihlakas
10mo
3
17Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs
Roland Pihlakas
4mo
0
Load More (15/27)
Add Posts