https://www.elilifland.com/. You can give me anonymous feedback here. I often change my mind and don't necessarily endorse past writings.
Also when I changed the "How much easier/harder each coding time horizon doubling gets" parameter by small amounts, the forecasted time from AC to ASI changes significantly (2.7 years at 0.90, over 4 years for 1.00), so it looks like stages 2 and 3 are affected as well.
I'd guess that this is only because compute growth (and human labor growth, but that doesn't matter as much) at that point is slower during takeoff if takeoff starts later.
Let's test this, this theory would predict that whatever time horizon growth parameter I changed, would result in the same takeoff if it ends up starting at the same time:
Ok, looks like I was right. I'm pretty sure that these do affect takeoff, but only by changing the starting date.
Edit: actually sorry these can also affect takeoff via the coding automation task efficiencies when reaching AC / start of takeoff, because if the effective compute requirement is different then the logistic curve has a lower slope, not just shifted over to the right. My guess is that the compute growth is having a larger impact, but we'd have to do a bit more work to check (either way each time horizon growth parameter would have the same effect if it reuslted in AC happening at the same time, because all the parameters do is set the effective compute requirement for AC).
Thanks for writing this up! Excited about research taste experiments.
Is human research taste modeled correctly? Eg it seems likely to me that the 0.3% of top humans add more than 0.3%*3.7x to the “aggregate research taste” of a lab because they can set research directions. There are maybe more faithful ways to model it; all the ones Eli mentioned seemed far more complicated.
A minimal change would be to change the aggregation from mean to something else, we were going to do this but didn't get to it in time. But yeah to do it more faithfully I think would be pretty complicated because you have to model experiment compute budgets for each human/AI. Note also that we aren't really modeling human/AI taste complementarity.
Or, they could coordinate better (especially with all the human ex-coders to help them), and decrease the parallelization penalties for labor and/or compute
Agree that ideally there would at least be different penalties for AIs vs. humans doing the labor.
Is modeling AI research taste as exponential in human standard deviations valid? I have no idea whether someone 9 standard deviations above the human median would be able to find 3.7^(9/3) = 50x better research ideas or not.
Note that because of limits (which weren't in your summary) the model is in practice subexponential, but exponential is generally a good approximation for the model around the human range. See here (4.2.2) for an explanation of taste limits.
Regarding whether it's a good approximation in the human range, we have some n=12 survey results on this here, obviously take with a huge grain of salt, but extracted from these results the ratio of (taste per SD between the 90th percentile and top researchers) and (taste per SD between 50th percentile and top) appears to be fairly close to 1: 1.01 median if assuming a population of 1000 researchers, and 0.95 median if assuming a population of 100.
Thanks for the thoughts! I'm not sure I exactly understand your point. I do think that we should think about the relationship between the time horizon and the AC effective compute requirement directly, which is why we chose to use this to set the effective compute requirement. If a model has achieved very high time horizons, then we think that this is direct evidence for them being an AC. Note that we also optionally have an effective compute gap as well, to be added after reaching the time horizon requirement.
I'm also not sure what you mean about the relationship being 1-1, like why we should increase the effective compute requirement rather than decrease if we instead had decided to try to use AI R&D speedup to anchor the requirement. Why would we think that setting the effective compute requirement via the AI R&D speedup would predictably give a higher effective compute requirement? I don't think the METR trend being superexponential implies anything one way or the other. They are just different metrics and thus we would use a totally different method if we had instead tried to set the requirement using AI R&D speedup. I'm not immediately sure what method would be best though given that we don't have a trend for it. If we had more high-quality data on coding uplift over time, then that could help us out, and I think that would be a reasonable alternative thing to extrapolate (Daniel discusses this a bit in the post), but I don't have a prior on whether it would lead to a lower or higher requirement than extrapolating time horizons (in fact a quick guess would be that it would lead to a lower requirement, given Opus 4.5's reported much higher uplift than Sonnet 4.5).
I think revenue extrapolations seem like a useful exercise. But I think they provide much less evidence than our model.
Which revenues would you extrapolate? You get different results for e.g. doing OpenAI vs. Nvidia.
Also (most importantly) are you saying we should assume that log(revenue) is a straight line?
edited to add: relevant graph from https://epoch.ai/gradient-updates/openai-is-projecting-unprecedented-revenue-growth:
much more clear threshold for AGI
Also I disagree with this, I think time horizon is about as good as revenue on this dimension, maybe a bit better. Both are hugely uncertain though of course.
But I consider it fairly likely that mid-term (maybe 50% of the way to TAI, in years) safe AI systems are likely to outperform humans on AI safety strategy and the large majority of the research work.
The best humans, or the median humans who do that work, or something else?
Discrete capabilities progress seems slower this year than in 2024 (but 2024 was insanely fast). Kudos to this person for registering predictions and so reminding us what really above-trend would have meant concretely. The excellent forecaster Eli was also over-optimistic.
I haven't done a thorough look but I think so far progress is somewhat below my predictions but not by a huge amount, with still a few weeks left in the year? If the AI 2025 predictions are what you're referring to.
I believe the SOTA benchmark scores are higher than I predicted for Cybench, right on for OSWorld, and lower for RE-Bench, SWE-Bench Verified, and FrontierMath. RE-Bench is the one I was most wrong on though.
For non-benchmark results, I believe that the sum of annualized revenues is higher than I predicted (but the Americans' importance lower). I think that OpenAI has hit both CBRN high and Cyber medium. They've removed/renamed model autonomy and persuasion.
I was only referring to our AI timelines mode, in this case it’s defined as the most likely year in which superhuman coder arrives.
In general the concept of mode for most of the scenario decisions seems not well defined as e.g. for non-naturally-numeric choices it depends on how you define the categories and what past events you condition on (for the timelines mode we’re conditioning on the starting point but in other cases one might condition on all events thus far).
I would personally describe our process as some mixture of sampling what intuitively feels most likely at each point (which might e.g. correspond to the mode of a natural categorical breakdown or of a distribution conditional on all events thus far, but we mostly didn’t explicitly calculate this), while also optimizing for making things not too degenerate and overall intuitively feel like a plausible trajectory (because by default doing mode every time would look unlike what we actually expect in some sense, because in the real world there will be many surprises).
As an example of how much definitions matter here, if we just conditioned on the previous conditions for each month and sampled what big algorithmic improvements might happen treating this as a categorical variable which enumerated many possible improvements, we might never end up with any specific algorithmic improvements or end up with them quite late in the game. But if we instead assume that we think overall probably some will come before superhuman coder and then pick what we think are the most likely ones even though any individual one may be <50% this quickly (though not totally clear in this case) and <<50% in any individual month, then we end up with neuralese recurrence and shared memory bank right before SC.
Perhaps a simpler example of how categorization matters is that if we break down possible AIs’ goals very granularly then we have the most peobabilities of AIs being very well aligned, relative to any very specific misaligned goal. But we overall have more probability on misalignment in this scenario so we first make that high level choice, then we choose one of the most likely specific misaligned goals.
The authors have emphasized repeatedly that AI 2027 was and is faster than their mode scenario, which makes doing this kind of evaluation annoying,
We've said that it was faster than our median, not our mode. I think it was close to most of our modes at the time of publication, mostly we were at around 2027-2028.
But the evaluation itself seems useful either way, in terms of checking in on how things are going relative to the trajectory that was our best guess conditional on the AGI timelines depicted.
Thanks for the comments! Besides the below, I'm curious what your overall views are. What does your distribution for AC look like?
I think this is basically addressed in our uncertainty over the present doubling time, at least that's how I'd think of it for myself. Note that my median present doubling time estimate of 5.5 months is slower than the potentially accelerated recent time horizon trend.
Our model reflects that, with my median parameters the current software R&D upflit is 1.1x.
Our Automated Coder has efficiency definitions built in, so you wouldn't put it that way, you'd instead say you get an Automated Coder in 2035 and a very expensive replication of AC abilities in 2031. I personally think that a large majority of the relevant recent gains have not come from inference scaling, but if I did think that a lot of it had been, I would adjust my present doubling time to be slower.
Here's some portions of a rough Slack message I wrote recently on this topic:
[end Slack message]
Furthermore, it seems that once capabilities can be reached very expensively, they pretty reliably get cheaper very quickly. See here for my research into this or just skip to Epoch's data which I used as input to my parameter esitmate; happy to answer questions, sorry that my explanation is pretty rough.
I think I already addressed at least part of this in my answer to (2).
I don't understand this. What exactly do you want us to address? Why should we adjust our predictions because of this? We do explicitly say we are assuming a hypothetical "METR-HRS-Extended" benchmark in our explanation. Ok maybe you are saying that it will be hard to create long-horizon tasks which will slow down the trend. I would say that I adjust for this when my make my all-things-considered AC prediction longer due to the potential for data bottlenecks, and also to some extent by making the doubling difficulty growth factor higher than it otherwise would be.
Yeah for the parts I didn't explicitly respond to, my response is mainly that it seems like this sort of inside view reasoning is valuable but overall I give more weight to trend extrapolation, and historically simple trend extrapolations like "when will we have the same ops/second as the human brain" have performed pretty well, as we discuss in our blog post.