Re: the imperfection of benchmarks, there is reason to believe SWE-bench scores have improved due to data contamination rather than pure model improvement (see "The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason"). It was released in October 2023, far before most current frontier models' knowledge cutoff dates.
Possibly better alternatives would be SWE-bench-live (https://swe-bench-live.github.io/, based on Github tasks) or Live code bench (https://livecodebench.github.io/leaderboard.html, based on LeetCode/AtCoder/Codeforces prob...
Glad to see some common sense/transparency about uncertainty. It seems to me that AGI/ASI is basically a black swan event — by definition unpredictable. Trying to predict it is a fool's errand, it makes more sense to manage its possibility instead.
It's particularly depressing when people who pride themselves in being rationalists basically ground their reasoning on "line has been going up, therefore it will keep going up", as if Moore's law mere existence means it extends to any and all technology-related lines in existence[1]. It's even more depress... (read more)