x

LESSWRONG

LW

AI Benchmarking — LessWrong

AI Benchmarking

This page is a stub.

Add Posts

Posts tagged AI Benchmarking

2

80AI benchmarking has a Y-axis problem

6mo

3

2

61FrontierMath Score of o3-mini Much Lower Than Claimed

1y

7

2

49Introducing BenchBench: An Industry Standard Benchmark for AI Strength

1y

0

2

24Broken Benchmark: MMLU

3y

5

2

15The real reason AI benchmarks haven’t reflected economic impacts

1y

0

2

8Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment

Roland Pihlakas, lenz, Three Laws

17d

0

1

102Every Benchmark is Broken

6mo

1

1

71Some lessons from the OpenAI-FrontierMath debacle

2y

9

1

46A Guide For LLM-Assisted Web Research

nikos, dschwarz, Lawrence Phillips, FutureSearch

1y

3

1

46I'm confused by the change in the METR trend

5mo

17

1

45Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Roland Pihlakas, Sruthi Kuriakose, shrutidattagupta, Three Laws

1y

8

1

34"Superhuman" Isn't Well Specified

1y

9

1

33Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols

Arjun Panickssery, agg

3y

0

1

32SWE-Bench Pro is even worse

5mo

2

1

30Improving Model-Written Evals for AI Safety Benchmarking

Sunishchal Dev, Marius Hobbhahn

2y

0

Load More (15/40)

Add Posts