piedrastrae — LessWrong

Re: the imperfection of benchmarks, there is reason to believe SWE-bench scores have improved due to data contamination rather than pure model improvement (see "The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason"). It was released in October 2023, far before most current frontier models' knowledge cutoff dates.

Possibly better alternatives would be SWE-bench-live (https://swe-bench-live.github.io/, based on Github tasks) or Live code bench (https://livecodebench.github.io/leaderboard.html, based on LeetCode/AtCoder/Codeforces problems). Interestingly enough, these two benchmarks have wildly different results for the top contenders, 19.26% v/s 80.2%. This seems to be mostly due to only the latter using reasoning models, but not only: "real" in context SWE tasks seem to be harder to figure out than code problems.

I don't have evidence for this, but to me it does seem that AI companies are aware of this and keep using SWE-bench to keep their marketing hype. Data contamination is a known issue and it would be a glaring oversight to think it would not happen with such an "old" and public benchmark.

My AI Predictions for 2027

piedrastrae2mo2-7

Glad to see some common sense/transparency about uncertainty. It seems to me that AGI/ASI is basically a black swan event — by definition unpredictable. Trying to predict it is a fool's errand, it makes more sense to manage its possibility instead.

It's particularly depressing when people who pride themselves in being rationalists basically ground their reasoning on "line has been going up, therefore it will keep going up", as if Moore's law mere existence means it extends to any and all technology-related lines in existence[1]. It's even more depressing when those "line go up" come from very flawed/contaminated benchmarks (like SWE-bench), or very skewed (like the 50% success aspect of the METR long tasks benchmark, which imo is absolutely crucial for differentiating an autonomous agent v/s a supervised copilot).

Hopefully I'll be able to mirror your sipping eggnog and gloating in Christmastime 2027.

[1] "Hume, I felt, was perfectly right in pointing out that induction cannot be logically justified." (Popper)

Agents lag behind AI 2027's schedule

piedrastrae3mo40

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments