A proposed method for forecasting transformative AI

[-]Daniel Kokotajlo3yΩ8158

Nice post, and nice argument! I think I agree that this is a worthy alternative to the Bio Anchors brain-and-genome-size-comparison stuff. I even tentatively agree that it's better overall, though I'd want to think about it more. (When I go on about how great Bio Anchors is, it's not because I'm in love with the brain size comparison--though I used to like it more than I do now--it's because I'm in love with the "useful core" of it, the breakdown into computing price-performance, willingness to spend, algorithmic progress, and compute requirements. Which it seems you also are doing.)

This type of reasoning personally convinced me that a reasonable hard upper bound for training TAI was about 10^40 FLOP, with something between 10^30 to 10^35 FLOP as my central estimate for the training requirements, using 2022 algorithms.
...
If we're also given estimates for growth in computing price-performance, willingness to spend, and algorithmic progress, then it is possible to provide a distribution over dates when we expect TAI to arrive.
...
Alternatively, you can incorporate this approach into Tom Davidson's takeoff model to build a more theoretically grounded timelines model but I have not done that yet.

Tom's model just uses a training requirements variable, it doesn't appeal to all the fancy bio anchors stuff that your method is a viable alternative to. Insofar as you are still using what I consider the "useful core" of the bio anchors model then I think your bottom line numbers for compute requirements can just be straightforwardly plugged into Tom's model.

So I just went and plugged in the values of 10^30 and 10^35 FLOP for the training requirements variable at takeoffspeeds.com. Playing around with it a bit (modifying their preset scenarios) it looks like this gives you somewhere between 2029 and 2044.

Though if you also do what I recommend and increase software returns from 1.25 to 2.5, to be more consistent with the data we have so far about algorithmic progress, the top end of the range cuts down considerably: Now you get 2027 - 2033 as the range.

[-]p.b.3y121

I applaud the effort, but I confess that I didn't read the post very carefully, because so many problems with this approach jumped out at me. So it is entirely possible that some of the following points of critique are already dealt with in your text, in which case, sorry about that:

My understanding is that the irreducible part of the loss has nothing (necessarily) to do with "entropy of natural text" and even less with "roughly human-level" - it is the loss this particular architecture for this particular training regime can reach in the limit on this particular training data distribution.

Nothing more, nothing less.

Other architectures with other training regimes will have scaling laws with a different irreducible loss. Human level of token prediction is way worse than probably any GPT, so why would that loss indicate human level of reasoning?

To actually have an "anchor" you'd have to look at human performance on benchmarks und estimate the range of loss that might indicate human level of reasoning.

I also don't understand why the loss on random internet text should be the same as on scientific papers. Loss differs from domain to domain. Scaling laws likely do, too.

Also, new (significant) papers are not sampled from the distribution of papers. They are out of distribution because they go beyond all previous papers. So I'm not sure your formula doesn't just measure the ability to coherently rehash stuff that is already known.

Of course if you'd actually measure the ability to write a new scientific paper, the model would be ridiculously overpowered, because humans need 6 or 7 orders of magnitude more time to create such a paper (at least with current sampling speeds).

[-]Matthew Barnett3y*80

I think your critiques are great since you're thinking clearly about how this approach is supposed to work. At a high level my reply to your comment is something like, "I basically agree, but don't think that anything you mentioned is devastating. I'm trying to build something that is better than Bio Anchors, and I think I probably succeeded even with all these flaws."

That said, I'll address your points more directly.

My understanding is that the irreducible part of the loss has nothing (necessarily) to do with "entropy of natural text" and even less with "roughly human-level" - it is the loss this particular architecture for this particular training regime can reach in the limit on this particular training data distribution.

That's correct, but if the hypothesis space is sufficiently large, then the term in the Hoffmann et al. equation for loss should actually correspond to the entropy of natural language. The reason comes down to how entropy is defined. Roughly speaking, language entropy can be defined as the limit of a certain functional approximation of entropy as the capacity of that model goes to infinity. This is strikingly similar to what I'm doing with the Hoffmann et al. equation.

The main difference is that, in the case of Hoffmann et al., we can't be sure that a sufficiently large model trained over a sufficiently large amount of data would actually converge onto the entropy of the English language. In fact, we have reason to believe that it wouldn't, due to constraints like the limited size of the context window.

However, I don't see any fundamental issue with treating $E$ as the entropy of the whatever distribution we're talking about, so long as our hypothesis space is vast enough.

In this post, I'm not trying to lean into the Hoffmann et al. results per se, except to the extent that they provided the best current source of data on scaling language models. Indeed, initially I didn't even present any results at all, until someone persuaded me to put a preliminary CDF over TAI timeline in the post to make it more interesting. I'm mostly trying to explain the approach, which can certainly be improved upon if we get better data about the scaling properties of models trained on a more relevant distribution, like scientific papers.

Human level of token prediction is way worse than probably any GPT, so why would that loss indicate human level of reasoning?

I think this question might rest on a misconception that also came up a few times when I was sharing this draft with people, so it's perhaps important to put this caveat into the post more directly. This approach has almost nothing to do with human-level abilities to predict tokens. It's based on something completely different, which is more similar to whether humans can distinguish long outputs written by language models from outputs written by humans. There are good reasons to believe that these two approaches should yield very different results.

Also, new (significant) papers are not sampled from the distribution of papers. They are out of distribution because they go beyond all previous papers. So I'm not sure your formula doesn't just measure the ability to coherently rehash stuff that is already known.

This is a good point. Ultimately, my response is that I'm trying to measure something like the hardness of training a model to think reliably over long sequences, rather than something like, the hardness of training a model to copy the exact distribution its trained on. We can already see with current models, like GPT-3.5, that it can often produce novel results (e.g. its poetry) despite the intuition that it shouldn't be able to "go beyond" its training distribution. I think this points to something important and true, which is that language models seem to be learning how to do the stuff that in practice allows it to write content on par with what humans write, rather than merely learning how to emulate its training distribution.

[-]NunoSempere3y125

I thought this was great, cheers.

Here:

Next, we estimate a sufficient horizon length, which I'll call the k-horizon, over which we expect the most complex reasoning to emerge during the transformative task. For the case of scientific research, we might reasonably take the k-horizon to roughly be the length of an average scientific paper, which is likely between 3,000 and 10,000 words. However, we can also explicitly model our uncertainty about the right choice for this parameter.

It's unclear whether the final paper would be the needed horizon length.

For analogous reasoning, consider a model trained to produce equations which faithfully describe reality. These equations tend to be quite short. But I imagine that the horizon length needed to produce them is larger, because you have to keep many things in mind when doing so. Unclear if I'm anthropomorphizing here.

[-]NunoSempere3y21

I am also curious about the extent to which you are taking the Hoffman scaling laws as an assumption, rather than as something you can assign uncertainty over.

[-]Richard Korzekwa3y100

This is cool! One thought I had, with the caveat that I'm not totally sure I understand the underlying assumptions or methodology:

Of course, real scientific research involves more than merely writing research papers. It involves proposing hypotheses, devising experiments, and collecting data, but for now, let's imagine that we can simplify all these steps into one step that involves writing high quality research papers. This simplification may not be entirely unrealistic, since if the papers are genuinely judged to be high quality and not fraudulent or p-hacked etc., then presumably they are the end result of a process that reliably performs all the essential steps to proper scientific research.

I tentatively agree there is a real sense in which "write a good scientific paper" has as a subtask "do good scientific research", such that a process that can do the former will need to learn to do the latter. The scaling laws for LLM are based on looking at lots of text and getting very good at prediction, and presumably this involves learning patterns about how the real world works, which definitely feels like a move in the direction of "do good scientific research".

But most science requires actually looking at the world. The reason we spend so much money on scientific equipment is because we need to check if our ideas correspond to reality, and we can't do that just by reading text. I imagine the outcome of some publishable experiments could be predicted with high confidence just by looking at existing text, but I expect there to be enough of a difference between "look at text and learn patterns" and "look at the world and learn patterns" that it's unclear to me if we should expect the scaling laws to hold once the models are running a process that is actually doing science. (Like maybe the "irreducible loss" is reducible once you start looking at the world?) Or maybe it's very similar and the scaling laws will take us all the way to high-quality text-mediated real-world experiments or whatever.

[-]Matthew Barnett3y104

But most science requires actually looking at the world. The reason we spend so much money on scientific equipment is because we need to check if our ideas correspond to reality, and we can't do that just by reading text.

I agree. The primary thing I'm aiming to predict using this model is when LLMs will be capable of performing human-level reasoning/thinking reliably over long sequences. It could still be true that, even if we had models that did that, they wouldn't immediately have a large scientific/economic impact on the world, since science requires a lot more than thinking well. (There are also a number of other issues like models hallucinating, even if they're capable of reasoning coherently, but I'll set aside that for now.)

The original definition of transformative AI referred broadly to economic, social and scientific progress society-wide, rather than just AI that reasoned and understood the world broadly as well as a human does. An implicit assumption in this blog post is that TAI will follow after the development of high quality AI thinkers. That's not an assumption I defended because I felt it was separate from the purpose of the report. To be clear, I also find this assumption questionable, so I suppose it's worth clarifying that in the post.

I consider it important to explore what we should expect after the development of high quality reasoners, which I expect we can do by incorporating this framework into the context of a more general takeoff model, such as Tom Davidson's recent model. I have yet to bridge these two models, but I suspect after bridging them, we might get more insight into this question.

[-]Richard Korzekwa3y40

The primary thing I'm aiming to predict using this model is when LLMs will be capable of performing human-level reasoning/thinking reliably over long sequences.

Yeah, and I agree this model seems to be aiming at that. What I was trying to get at in the later part of my comment is that I'm not sure you can get human-level reasoning on text as it exists now (perhaps because it fails to capture certain patterns), that it might require more engagement with the real world (because maybe that's how you capture those patterns), and that training on whichever distribution does give human-level reasoning might have substantially different scaling regularities. But I don't think I made this very clear and it should be read as "Rick's wild speculation", not "Rick's critique of the model's assumptions".

[-]Matthew Barnett3y40

training on whichever distribution does give human-level reasoning might have substantially different scaling regularities.

I agree again. I talked a little bit about this at the end of my post, but overall I just don't have any data for scaling laws on better distributions than the one in the Chinchilla paper. I'd love to know the scaling properties of training on scientific tasks and incorporate that into the model, but I just don't have anything like that right now.

Also, this post is more about the method rather than any conclusions I may have drawn. I hope this model can be updated with better data some day.

[-]Wei Dai2yΩ572

I'm confused about how heterogeneity in data quality interacts with scaling. Surely training a LM on scientific papers would give different results from training it on web spam, but data quality is not an input to the scaling law... This makes me wonder whether your proposed forecasting method might have some kind of blind spot in this regard, for example failing to take into account that AI labs have probably already fed all the scientific papers they can into their training processes. If future LMs train on additional data that have little to do with science, could that keep reducing overall cross-entropy loss (as scientific papers become a smaller fraction of the overall corpus) but fail to increase scientific ability?

[-]Jacob Pfau3y50

The assumptions of stationarity and ergodicity are natural to make, but I wonder if they hide much of the difficulty of achieving indistinguishability. If we think of text sequences as the second part of a sequence where the first part is composed of whatever non-text world events preceded the text (or even more text data that was dropped from the context). I'd guess a formalization of this would violate stationarity or ergodicity. My point here is a version of the general causal confusion / hallucination points made previously e.g. here.

This is, of course, fixable by modifying the training process, but I thinks it is worth flagging that the stationarity and ergodicity assumptions are not arbitrary with respect to scaling. They are assumptions which likely bias the model towards shorter timelines. Adding more of my own inside view, I think this point is evidence for code and math scaling accelerating ahead of other domains. In general, any domain where modifying the training process to cheaply allow models to take causal actions (which deconfound/de-hallucinate) should be expected to progress faster than other domains.

[-]Stephen McAleese3y30

Strong upvote. I think the methods used in this post are very promising for accurately forecasting TAI for the reasons explained below.

While writing GPT-4 Predictions I spent a lot of time playing around with the parametric scaling law L(N, D) from Hoffmann et al. 2022 (the Chinchilla paper). In the post, I showed that scaling laws can be used to calculate model losses and that these losses seem to correlate well with performance on the MMLU benchmark. My plan was to write a post extrapolating the progress further to TAI until I read this post which has already done that!

Scaling laws for language models seem to me like possibly the most effective option we have for forecasting TAI accurately for several reasons:

It seems as though the closest ML models to TAI that currently exist are language models and therefore predictive uncertainty should be lower for forecasting TAI from language models than from other types of less capable models.
A lot of economically valuable work such as writing and programming involves text and therefore language models tend to excel at these kinds of tasks.
The simple training objective of language models makes it easier to reason about their properties and capabilities. Also, despite their simple training objective, large language models demonstrate impressive levels of generalization and even reasoning (e.g. chain-of-thought prompting).
Language model scaling laws are well-studied and highly accurate for predicting language model losses.
There are many existing examples of language models and their capabilities. Previous capabilities can be used as a baseline for predicting future capabilities.

Overall my intuition is that language model scaling laws require much fewer assumptions and guesswork for forecasting TAI and therefore should allow narrower and more confident predictions which your post seems to show (<10 OOM vs 20 OOM for the bio anchors method).

As I mentioned in this post there are limitations to using scaling laws such as the possibility of sudden emergent capabilities and the difficulty of predicting algorithmic advances.

^{^}
Exceptions include deep RL work by DeepMind such as AlphaTensor.

[-]Nathan Helm-Burger3y3-1

In Stephen Byrnes recent post https://www.lesswrong.com/posts/PDx4ueLpvz5gxPEus/why-i-m-not-working-on-debate-rrm-elk-natural-abstractions he states: " Two different perspectives are:

AGI is about knowing how to do lots of things
AGI is about not knowing how to do something, and then being able to figure it out.

I’m strongly in the second camp.

" I very much agree with this, and I think measuring perplexity confuses these two things (ability to recite/compose memorized knowledge vs ability to generate novel insights). I think you need an entirely different benchmark, based on something like giving a model only the information available to Newton before he came up with his laws of motion and seeing if the model can come up with equivalent insights.

[-]Hailey Collet3y30

I used the first chart, the compute required for GPT3, and my personal assessment that ChatGPT clearly meets the cutoff for tweet length, very probably meets it for short blog (but not by a wide margin), and clearly does not meet it for research paper, to create my own 75th percentile estimate for human slowdown of 25-75. It moves the P(TAI<=year) = 50% from ~2041 to ~2042, and the 75% from ~2060 to ~2061. Big changes! 😂

[-]Cameron Holmes3y30

Excellent post, thank you. I look forward to playing with the notebook model.

One observation on the timelines (which I realise is somewhat tangential to the core content of the post) is that I believe your estimates around compute cost scaling feel quite optimistic to me.

Memory and compute density scaling is already dropping off meanwhile everyone seems to be struggling with yields on sub-3nm nodes despite huge spend, exacerbating the problem for cost scaling.

https://fuse.wikichip.org/news/7343/iedm-2022-did-we-just-witness-the-death-of-sram/

https://www.semianalysis.com/p/tsmcs-3nm-conundrum-does-it-even

We don't really have many more promising technologies in the pipeline to radically get cost per transistor down so I expect improvements to slow quite a bit towards the end of this decade.

[-]Hailey Collet3y10

Your assertion that we don't have many things to reduce cost per transistor may be true, but is not supported by the rest of your comment or links - reduction in transitor size and similar performance improving measures are not the only way to improve cost performance.

[-]Cameron Holmes3y10

Sorry I agree that comment and those links left some big inferential gaps.

I believe the link below is more holistic and doesn't leave such big leaps (admittedly it does have some 2021-specific themes that haven't aged so well, but I don't believe they undermine the core argument made).

https://www.fabricatedknowledge.com/p/the-rising-tide-of-semiconductor

This still leaves a gap between cost per transistor and overall compute cost, but that's a much smaller leap e.g. frequency being bound by physical constraints like the speed of light. etc..

To evidence my point about this trend getting even worse after 2030 - EUV lithography was actively being pursued for decades before active usage in 2030. My understanding is that we don't have anything that significant at the level of maturity that EUV was at in the 90s. Consider my epistemic status on this point fairly weak though.

[-]Liam Donovan3y3-1

It seems like something has to be going wrong if the model output has higher odds that TAI is already here (~12%) than TAI being developed between now and 2027 (~11%)? Relatedly, I'm confused by the disclaimer that "we are not updating on the fact that TAI has obviously not yet arrived" -- shouldn't that fact be baked into the distributions for each parameter (particularly the number of FLOPs to reach TAI)?

[-]Matthew Barnett3y31

In the notebook, the number of FLOP to train TAI is deduced a priori. I basically just estimated distributions over the relevant parameters by asking what I'd expect from TAI, rather than taking into consideration whether those values would imply a final distribution that predicts TAI arrived in the past. It may be worth noting that that Bio Anchors also does this initially, but it performs an update by chopping off some probability from the distribution and then renormalizing. I didn't do that yet because I don't know how to best perform the update.

Personally, I don't think a 12% chance that TAI already arrived is that bad, given that the model is deduced a priori. Others could reasonably disagree though.