x

LESSWRONG

LW

christianvuye — LessWrong

christianvuye

christianvuye

Message

2

1

9mo

christianvuye

2

9mo

My AGI timeline updates from GPT-5 (and 2025 so far)

christianvuye8mo*30

I do wonder why the SWE-Bench and METR benchmarks are taken as THE best indicator of progress. SWE-Bench is a particular benchmark that only captures a small fraction of real-world software engineering. METR themselves have published work that shows the benchmark only captures very narrow algorithmic work, not software engineering holistically. Benchmarks tell a minimal story, so to extrapolate predictions from limited benchmarks is a good example of Goodhart’s law. Real-world impact from AI on software engineering is much smaller than progress on benchmarks such as SWE-Bench would imply.