Your link to chatgpt's analysis is not loading for me, but I don't really trust chatgpt to do this without mistakes anyway; when I run this regression in python statsmodels with METR's official values and release dates from their github (plus the recent gpt-5 models), it is stat sig for this OpenAI instruct-model trend (p ~ .04) and it gets more stat sig once we include Claude 4.5. I don't see what dates you are using, but your beta values have a fair amount of (rounding?) error compared to METR's data, and the model list isn't the same, e.g. they didn't include non-frontier mini models, etc. (btw, note that one-sided p-values are appropriate in this case since only positive slope would be treated as evidence of catching up).
But again, the main question from that particular trend/argument was whether LLMs were catching up with the human baseline slope, which Claude now has basically done for the existing baselines, so that seems pretty strongly confirmatory. It's true that Claude has a somewhat worse logistic intercept than gpt-5 models (associated w/ a bit worse short-task reliability), but it is still better than the human baseline intercept, and the net effect is that Claude beats the human baselines over a much longer range of horizons than other models (per METR's logistic fits).
As far as what it would mean for humans to not be able to complete really long/difficult tasks at high reliability, I wonder if that comes back to this argument from the post:
"Many people have the intuition that humans can handle tasks of arbitrary length at high reliability, but of course that depends on task difficulty, and while we can extrapolate the METR curve to weeks or months, the existing/actual tasks are short, so it's not clear how to estimate the difficulty of hypothetical long METR tasks. There is a tendency to assume these would just be typical long software engineering tasks (e.g. merely time-consuming due to many fairly straightforward subtasks), but there is not much basis for that assumption, as opposed to longer tasks on this length/difficulty trend being more like 'prove this tricky mathematical theorem', etc"
(edited)
If you run a linear regression on versus time, then the regression line does have a positive slope estimate (even pre-claude-4.5), and I used that regression line in section 2.2.1.1 to provide an alternative estimate for when LLMs will catch up with human (at which point the denominator goes to zero, and the LLM's horizon blows up). That said, the trend is quite noisy; and while the OpenAI trend from that estimate was already stat sig, the overall trend across companies did not hit stat sig until the Claude Opus 4.5 datapoint. Also, as I argued in the post (partly due to the noisy trend): "I suspect it makes more sense to directly extrapolate the overall time-horizon estimate rather than linearly extrapolating the noisy logistic coefficient in isolation, even if the slope trend is a useful intuition-pump for seeing why a finite-time blowup is plausible" (btw, the specific projection in that section was based on just OpenAI instruct-tuned models, i.e. post gpt-3 models)
Also, the underlying question in this case was whether the LLM slope would catch up to human baseline slope, but it's somewhat moot since that has basically happened with Claude 4.5; and if METR were to collect better incentivized human baselines (with better slope ), it seems quite likely that there would be a similar catch up to match this improved , leading to another blowup in the updated (human-relative) LLM time horizons.
(edited)
Could you actually provide a citation for Claude already being a supercoder?
I did not claim that Claude is a "supercoder" or even human-level at coding; rather, the Claude addendum continued with: "to be clear, we shouldn't over-interpret this specific 444 billion figure" and "Realistically, this highlights that to really make accurate projections of the time to catch up with human horizons based on METR data, we need better human baselines." In my view, the natural takeaway is that that Claude has now basically caught up with METR's existing human baselines, which they have acknowledged were not that well incentivized, which does not mean that it is better than properly incentived software engineers.
However, per the sensitivity analysis, if we assume well incentivized humans could do ~2x better than METR's baselines on METR's longest benchmark tasks, "then Claude 4.5 Opus has an intersection-based time horizon of only 35.9 minutes", ie far from human-level. So as I said in the post, I do think this highlights the need for better human baselines for METR, but while the current horizon estimates are quite sensitive to the baselines, the estimated time to human-level doesn't actually shift that much with this stronger baseline, i.e. from early 2026 to late 2026.
In general, the primary point of the post wasn't that the current baselines are good enough to make an accurate prediction of human-level horizons using METR data, but rather "my main takeaway from this analysis is probably that we shouldn't over-interpret the METR trends at fixed reliability as a direct marker of progress towards human-level software horizons" (because the METR metrics are likely underestimating the progress rate, due to using fixed reliabilities at all horizons)
"What I expect is that the time horizon is exponential until the last few doublings, not hyperbolic."
I provided both theoretical and statistical arguments (e.g. AIC) in the post for why the human-relative time horizon trend is likely hyperbolic rather than exponential, and your comment does not address or acknowledge either of those arguments. Note the post does argue that METR's metrics likely are exponential, so the hyperbolic claim is specifically about human-relative time horizon metrics (per the proposal in the post).
Regarding the 3x/1.25x copy/speedup multipliers, do you have any kind of back-of-the-envelop justification for why those are plausible, e.g. based on the trend of algorithmic-progress, such that you might expect about this much inference-efficiency gain paralleling the assumed 10x (or so) training-efficiency gain?
@Michaël Trazzi Actually, it's the opposite, the Claude progress was dominated by slope (β) improvement, and intercept actually got a bit worse: Is METR Underestimating LLM Time Horizons?