Someone referred me back to this post for comment, so I want to share a couple of updates on how we think about training run lengths at Epoch.
First, we now have better data. Across notable models, we have seen training run lengths get longer by around 30%/year in the last decade. This would naively imply we would see 3x longer training runs by the end of the decade. Recent large training runs often take up to 90 days (eg Llama 3), so this would naively lead to nine-month training runs by the end of the decade.
Second, I still believe the argument given in the original post is coherent and makes for a compelling upper bound, after accounting for uncertainty on the relevant trends.
This is not the only consideration that goes into deciding how long to train for. In practice, my understanding is that developers are mostly weighing the improvement they see during training versus the costs of a delayed release in terms of attention and market share. But I still expect for the upper bound of ~a year to be roughly binding, at least while hardware and algorithmic improvements continue progressing as fast as in recent years.
For clarity, at the moment of writing I felt that was a valid concern.
Currently I think this is no longer compelling to me personally, though I think at least some of our stakeholders would be concerned if we published work that significantly sped up AI capabilities and investment, which is a perspective we keep in mind when deciding what to work on.
I never thought that just because something speed up capabilities it means it is automatically something we shouldn't work on. We are willing to make trade offs here in service of our core mission of improving the public understanding of the trajectory of AI. And in general we make a strong presumption in favour of freedom of knowledge.
I'm talking from a personal perspective here as Epoch director.
I'm also going to risk also sharing more internal stuff without coordinating on it, erring on the side of over sharing. There is a chance that other management at Epoch won't endorse these takes.
The ability to pay liability is important to factor in and this illustrates it well. For the largest prosaic catastrophes this might well be the dominant consideration
For smaller risks, I suspect in practice mitigation, transaction and prosecution costs are what dominates the calculus of who should bear the liability, both in AI and more generally.
What's the FATE community? Fair AI and Tech Ethics?
We have conveniently just updated our database if anyone wants to investigate this further!
https://epochai.org/data/notable-ai-models
Here is a "predictable surprise" I don't discussed often: given the advantages of scale and centralisation for training, it does not seem crazy to me that some major AI developers will be pooling resources in the future, and training jointly large AI systems.
I've been tempted to do this sometime, but I fear the prior is performing one very important role you are not making explicit: defining the universe of possible hypothesis you consider.
In turn, defining that universe of probabilities defines how bayesian updates look like. Here is a problem that arises when you ignore this: https://www.lesswrong.com/posts/R28ppqby8zftndDAM/a-bayesian-aggregation-paradox
shrug
I think this is true to an extent, but a more systematic analysis needs to back this up.
For instance, I recall quantization techniques working much better after a certain scale (though I can't seem to find the reference...). It also seems important to validate that techniques to increase performance apply at large scales. Finally, note that the frontier of scale is growing very fast, so even if these discoveries were done with relatively modest compute compared to the frontier, this is still a tremendous amount of compute!
Quick comment, this is not correct. As of this time, we have not evaluated Grok 4 on FrontierMath Tier 4 questions. Our preliminary evaluation was conducted only with Tier 1-3 questions.