Time horizon of o3 is ~1.5 hours vs Claude 3.7's 54 minutes, and it's statistically significant that it's above the long-term trend. It's been less than 2 months since the release of Claude 3.7. If time horizon continues doubling every 3.5 months as it has over the last year, we only have another 12 months until time horizon hits 16 hours and we are unable to measure it with HCAST.
My guess is that future model time horizon will double every 3-4 months for well-defined tasks (HCAST, RE-Bench, most automatically scorable tasks) that labs can RL on, while capability on more realistic tasks will follow the long-term 7-month doubling time.
What's your basis for "well-defined tasks" vs. "realistic tasks" to have very different doubling times going forward? Is the idea that the recent acceleration seems to be specifically due to RL, and RL will be applicable to well-defined tasks but not realistic tasks?
This seems like an extremely important question, so if you have any further thoughts / intuitions / data to share, I'd be very interested.
Yes. RL will at least be more applicable to well-defined tasks. Some intuitions:
This trend will break at some point, eg when labs get better at applying RL to realistic tasks, or when RL hits diminishing returns, but I have no idea when
I thank y'all for rapidly replicating and extending this eval. This is the most important eval extant. Units are truly comparable, and it's directly connected to the questions of "coding for ML/AI research" and "long-horizon agency" that seem cruxy for short timelines. I did not expect @Daniel Kokotajlo to be right about the superexponentiality so quickly.
My long-timeline probability mass is increasingly dependent on "this doesn't generalize past formally verifiable domains + formally verifiable domains are insufficient for to automate AI algorithmic progress substantially" or "somehow this progress doesn't extend to the arbitrarily messy and novel real world." But it ain't looking good.
Thanks for re-running the analysis!
I agree that RE-bench aggregate results should be interpreted with caution, given the low sample size. Let's focus on HCAST instead.
A few questions:
(source: I work at METR)
Thanks for the questions!