Anthropic researchers estimate that Opus 4.5 provides 2-3x speedup to their research, if I'm reading this correctly. This seems very important and I'm surprised I haven't seen more discussion of it.
Twitter thread: https://x.com/HjalmarWijk/status/1993752035536331113
Unrolled/without login required: https://twitter-thread.com/t/1993752035536331113
@HjalmarWijk
Nov 26Anthropic says in their system card that *all* their AI R&D evals are close to saturation, and report a median self-reported uplift of 2X (mean over 3X!) for power users. They provide very little evidence ruling out imminent dramatic AI R&D acceleration.
I personally suspect that their self-report uplift numbers are inflated and that agent time horizons are still limited. But if taken at face value, then even the most aggressive scenarios (e.g. AI 2027 or https://blog.redwoodresearch.org/p/whats-up-with-anthropic-predicting) would have underestimated progress.
I didn't quote the whole thread, there's more if you follow the link.
I basically just don't believe it. According to the METR downlift study people tend to overestimate the uplift effect of AIs on their own work. I would be surprised if the true effect was more than 50%.
I think there is nuance about the downlift study that would be helpful to highlight:
This is not to say that it’s true that Anthropic employees are getting that high of an uplift, but may make it a bit more believable.
I was aware of all of that except point 2. I think it undercuts the result "AI models aren't useful for coding" but it doesn't undercut the result "people tend to overestimate how much AI is helping them."
Re: the one guy with the 38% uplift: Did he accurately predict it in advance? I can't tell from skimming the thread.
Hmm, my thought was that devs (or at least Anthropic folks) have improved their ability to estimate how much AI is helping us since the release of the first truly agentic model? My feeling is that most top-end people should be better calibrated despite the moving target. Most people in the study had spent less than 50 hours (except for one of the folks who performed well), so I don’t think we cnnuse the study to say much about how things change over the course months or a year of usage and training (unless we do another study I guess).
In terms of the accurate prediction, I’m not recalling what exactly made me believe this, though if you look at the first chart in the METR thread, the confidence intervals of the predicted uplift from the devs is below the 38%. The average thought they were 24% faster at the beginning of the study (so, in fact, he probably underestimated his uplift a bit).
That's a reasonable point, but, going in the other direction, Anthropic people are probably biased towards overestimating the value of their models in particular.
Like, I'm at like 20% that Anthropic is currently getting 2x or more coding uplift. It's possible (for the reasons you mention) but I don't think it's the most likely scenario.
If you have Long COVID or ME/CFS, or want to learn more about them, I highly recommend https://s4me.info. The signal to noise ratio is much better than on other forums for those topics that I've found. The community is good at recognizing and critiquing low vs high quality studies.
As an example of the quality, this factsheet created by the community is quite good: https://s4me.info/docs/WhatIsMECFS-S4ME-Factsheet.pdf
If compute is the main bottleneck to AI progress, then one goalpost to watch for is when AI is able to significantly increase the pace of chip design and manufacturing. After writing the above, I searched for work being done in this area and found this article. If these approaches can actually speed up certain steps in this process from taking weeks to just taking a few days, will that increase the pace of Moore's law? Or is Moore's law mainly bottlenecked by problems that will be particularly hard to apply AI to?
I have a potential category of questions that could fit on Metaculus and work as an "AGI fire alarm." The questions are of the format "After an AI system achieves task x, how many years will it take for world output to double?"