Consider everything in this post speculative. I intend to provide updates once I have data from more models, more robust Starburst performance data (especially for older Claude models), and general higher confidence. This is somewhat less polished than I'd like, so I can publish it before GPT-5 releases or demos and METR publishes Claude-4.1 time horizon data (to later see whether the trends discussed here extrapolate to them).
I've written before about METR's time horizon benchmark. While I consider it a valuable benchmark, it doesn't measure exactly what it's trying to. In order to only measure a model's time horizon, a benchmark would need to only vary the task length. Instead, the short tasks tend to be easy and not specialized-knowledge-dependent (i.e. doing a web search), whereas the long ones tend to require far greater specialized knowledge and intelligence/problem-solving (i.e. ML coding tasks). So it winds up measuring an amalgamation of time horizon, coding ability, ML knowledge, problem-solving, etc. Very roughly speaking, it's a decent benchmark of (partly narrow) abilities useful for AI automation of AI progress.
The above link also describes Starburst, an intelligence (hereafter basically used as a shorthand for novel problem-solving) benchmark I designed. For reasons described there, it's seemed promising thus far.
If you look at just ChatGPT models, there seems to be a roughly linear relationship between Starburst Intelligence (mapped intuitively to a 1-5 scale long before doing these analyses) and METR 50% time horizons:
The ChatGPT models get more modern farther right, ranging from 3.5 to o3 and o4-mini-high. If both benchmarks are doing their jobs, this can't stay linear. We'd expect TH to start growing much faster than linearly as a function of int. However, it appears to be linear in this regime.
Interestingly, Claude models seem to follow a roughly similar relationship, with a similar slope but notably different y-intercept. (3.7 Sonnet is left; 4 models are right). Claude models seem to be more capable (at least at coding) for their intelligence, despite being stupider than contemporary ChatGPT and Gemini models.
Now we add Sonnet 3.5 to the plot. It's the lower red point, roughly inline with the ChatGPT trendline. Something strange happened to Claude models around the introduction of chain of thought: they broke from the ChatGPT trendline and moved to one with more capability-per-intelligence. And Claude got stupider in the process. 3.7 Sonnet was dumber (though more capable) than 3.5 Sonnet. Especially assuming this italicized point isn't noise (which is a nontrivial possibility), what the heck is Anthropic doing here?
I have some possibilities, but none of them seem convincing. Maybe Anthropic really sucks at doing CoT RL. That seems unlikely. More plausibly, maybe Anthropic is trying to win the AI race by focusing really aggressively upon coding, and their unexpectedly slow progress on the time horizon benchmark (Claude 4 Opus, out after o3, was weaker) suggests that this approach isn't working as well as they'd hoped. This still doesn't explain 3.7 Sonnet getting stupider. Maybe it's actually an epic big-brained move to get very capable safe(r) AI by squeezing as much (safer) narrow capability out of as little (dangerous) general intelligence as possible. (Even given this evidence, I'm skeptical of this, but kudos to them if so.) Perhaps you have some idea what's going on here.
One final note: given that Claude models seem to be intelligence-hobbled, we should reserve judgment of how the METR time horizon SOTA trendline is progressing until we see GPT-5's result.