LESSWRONG
LW

1588
christianvuye
2010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
My AGI timeline updates from GPT-5 (and 2025 so far)
christianvuye2mo*30

I do wonder why the SWE-Bench and METR benchmarks are taken as THE best indicator of progress. SWE-Bench is a particular benchmark that only captures a small fraction of real-world software engineering. METR themselves have published work that shows the benchmark only captures very narrow algorithmic work, not software engineering holistically. Benchmarks tell a minimal story, so to extrapolate predictions from limited benchmarks is a good example of Goodhart’s law. Real-world impact from AI on software engineering is much smaller than progress on benchmarks such as SWE-Bench would imply.

Reply