I fitted logistic functions and gaussian cdfs with a factor to the trend of the percentage scores for the four rankings I analysed and they all asymptote below 80%. The idea was to find some evidence for an "irreducible error".
But given that 20+% error rate is clearly way too high, it still makes more sense to me to argue that the improvement is slowing and therefore these fits asymptote too low, than to argue that the time horizons and percentages are asymptoting because of a high percentage of unsolvable tasks.
But this gave me a more general idea of assessing changes in improvement speed: The default assumption right now should be that model improvement moves linearly through the log of the time horizon space. Additionally, I found that at least SWE-bench verified seems to have task lengths that are lognormally distributed and I suspect that holds for many benchmarks.
This means that the way to saturation should follow a gaussian cdf. Now the idea would be that one can use the movement through the first x percent of the benchmark to fit the gaussian cdf (or at least sanity check that assumption) and then see whether the model slows down for the rest of the benchmark. To put it differently: Constant improvement speed -> Symmetric underlying gaussian of the cdf. Slowdown -> Right tail gets fatter.
Of course the signal would be pretty weak, but if one would aggregate this over several benchmarks, it might make a good speedometer.
Hmm, actually all these checks can't distinguish between actually unsolvable tasks and tasks that are unsolvable for further scaled up models of the current kind (with the framework and compute used in the evaluations).
Yeah, I am also pretty much on the fence right now. But time will tell.
It depends how the work times of these unsolvable tasks are distributed, you could in principle get any outcome. But there are a few ways to check for the existence of unsolvable tasks, maybe I'll find the time today.
SWE bench verified shouldn't have that many impossible tasks if any, right? And the highest scores for the rankings I used are still significantly below 80%. But it's possible. Maybe a good motivation to look at SWE bench pro.
I computed METR time horizons for SWE bench verified sota models using both the existing difficulty estimates and work time estimates derived from commit data.
I used a range of different methods including the original METR methodology where task level success info was available.
I did this for 4 different rankings, EpochAI's, LLMStats's and the "verified" and "bash only" rankings of the SWE benchmark website.
In every single case the trend fits a logistic function with an asymptote of a couple of hours better than an exponential. In some cases the trend only becomes logistic with the last one or two datapoints, so it's not surprising that the METR report has an exponential fit for SWE bench.
I am not sure when I get around to publishing this analysis, because it's a giant mess of different datasets and methods. But I thought I at least state the result before it becomes irrelevant, falsified or obvious.
Thanks for doing this! My blitz rating is around 2470 right now, so I seem to have done a bit better than typical, probably by virtue of playing more games.
I played some 70 games against LeelaQueensOdds with 5+3 (I basically played until I reached a >50% score after losing the first couple of games and then slowly figuring it out) so the most interesting graph for me is unfortunately missing. ;-)
What is uniquely interesting/valuable about METR time horizons is that the score is meaningful and interpretable. Can do software tasks that would take an expert 2h with 50% success probability is very specific. Has the score y on benchmark x is only valuable for comparisons, it does not tell you what's going to happen when the models reach score z.
The recent goodfire paper seems to me a step into that direction. Also going completely synthetic for the training data might be a way.