A proposed method for forecasting transformative AI

Matthew Barnett

In 2021, I proposed measuring progress in the perplexity of language models and extrapolating past results to determine when language models were expected to reach roughly "human-level" performance. Here, I build on that approach by introducing a more systematic and precise method of forecasting progress in language modeling that employs scaling laws to make predictions.

The full report for this forecasting method can be found in this document. In this blog post I'll try to explain all the essential elements of the approach without providing excessive detail regarding the technical derivations.

This approach can be contrasted with Ajeya Cotra's Bio Anchors model, providing a new method for forecasting the arrival of transformative AI (TAI). I will tentatively call it the "Direct Approach", since it makes use of scaling laws directly to make predictions about compute requirements for AI.

Naturally, the Direct Approach is a very speculative framework and might end up being useless for forecasting TAI (in fact, I consider this the most likely outcome). Nonetheless, I'm hopeful that something like it can serve as a better foundation than current TAI timelines models, which I currently think are likely even worse. Note that there may be errors in the report and Colab notebook, as they were not extensively fact-checked.

Some background

In a nutshell, this approach is simply about taking the cross-entropy loss of an autoregressive model and trying to find a way of interpreting that quantity qualitatively: that is, something we can put on a chart and extrapolate until the quantity reaches a natural threshold that we identify with something important.

In my 2021 post about predicting language model performance, I drew a trendline through a plot of language model perplexities on various benchmarks and noted when the trendline went through estimates of "human-level" perplexity. This approach felt reasonable to me at the time, but I now think it too easily hand-waved away some important details.

The error of omission I committed in my old approach becomes more apparent when you think about language model performance from the perspective of scaling laws, for example the parametric scaling law from Hoffmann et al. 2022:

Here, we see cross-entropy loss as a function of parameters $N$ and training tokens $D$ seen during training. Notably, if we take the limit as the number of parameters and training tokens goes to infinity, then we're left with $E$ . Theoretically, $E$ corresponds to the "entropy of natural text" under certain assumptions, which is precisely the thing I identified with "roughly human-level" performance in my previous post. In other words, if we take this scaling law naively, it seems as though it will take infinite compute to reach human-level performance.

I believe the resolution to this apparent issue is to say that "human-level" performance will not be obtained when loss hits $E$ , but rather some small level above $E$ . How close to $E$ is enough? Well, that's the question we tried to answer with this report.

Summary of the Direct Approach

We begin by considering a language task, which in this post will be scientific research for illustration. For simplicity, let's imagine that this task consists of writing high-quality research papers or reports, although more nuanced specifications are possible.

Of course, real scientific research involves more than merely writing research papers. It involves proposing hypotheses, devising experiments, and collecting data, but for now, let's imagine that we can simplify all these steps into one step that involves writing high quality research papers. This simplification may not be entirely unrealistic, since if the papers are genuinely judged to be high quality and not fraudulent or p-hacked etc., then presumably they are the end result of a process that reliably performs all the essential steps to proper scientific research.

Next, we estimate a sufficient horizon length, which I'll call the k-horizon, over which we expect the most complex reasoning to emerge during the task. For the case of scientific research, we might reasonably take the k-horizon to roughly be the length of an average scientific paper, which is likely between 3,000 and 10,000 words. However, we can also explicitly model our uncertainty about the right choice for this parameter.

Our goal is to find a value for the KL-divergence of the "true" distribution of text from a model that roughly corresponds to "the model cannot be reliably distinguished from the true distribution over lengths equal to the k-horizon". Note that if KL-divergence were exactly zero, then there would be no detectable difference between the two distributions, and thus, the model could directly substitute for scientific researchers. In the more realistic case, the KL-divergence will be non-zero, allowing us to calculate the expected number of tokens over which it becomes theoretically possible to discriminate between the model and the true distribution, to some degree of confidence.

In general, the key insight of this approach is that indistinguishability implies competence. The reason is simple. If there is no way of reliably distinguishing between what the model produces and the true distribution, then there cannot be defects in the model's competence, as otherwise, we could exploit those defects to distinguish its outputs from the true distribution.

I am not saying that competence implies indistinguishability, as a model can be superhuman and quite distinguishable from human performance. I think the Direct Approach is best seen as a way of calculating an upper bound on the hardness of training a model that can think reliably over long sequences, rather than a statement about how transformative models will be trained (especially that they will be trained to copy human behavior exactly).

In the next sections, I introduce the concept of k-performance, which informally means that the model produces indistinguishable outputs according to a trained judge, up to the horizon length k. All of this can be made more formal by introducing a simple model of human abilities to discriminate between outputs.

When combined with empirically derived scaling laws, the result permits us to directly calculate the compute requirements to train a model with a KL-divergence corresponding to a target k-performance value. Then, just as in Bio Anchors, we can forecast a distribution over the arrival date of this model by forecasting future growth in price performance, willingness to spend, and algorithmic efficiency, among other variables we might wish to consider.

Interpreting the training loss

In language modeling the training loss is generally an estimate of the cross-entropy between the model and the "true" distribution. This loss is convenient for our purposes, since there is a neat mathematical relationship between the cross-entropy between two distributions and our ability to distinguish between samples from those distributions.

First, note that the cross-entropy between $p_{0}$ and $p_{1}$ can be decomposed as a sum of entropy and a KL-divergence,

H (p_{0}, p_{1}) = H (p_{0}) + D_{K L} (p_{0} ∥ p_{1})

Since the Hoffmann et al. scaling law also decomposes training loss into an irreducible loss, which they associate with the intrinsic entropy of internet text, and a reducible loss component, we can use their scaling law to estimate the KL-divergence given data and parameter training inputs. The KL-divergence can then be used to calculate how "distinguishable" $p_{0}$ and $p_{1}$ are in the following sense.

Suppose you were given i.i.d samples either $p_{0}$ or $p_{1}$ but you did not initially know what distribution they were being sampled from. If you start with some prior over whether you are sampling from $p_{0}$ or $p_{1}$ then you can use the KL-divergence to calculate the number of samples it should take, in expectation, for your prior to exceed some threshold confidence about whether you are sampling from either $p_{0}$ or $p_{1}$ .

In full report, I derive this formula by borrowing from these lecture notes on the Sequential Probability Ratio Test.

Let $γ$ be the stopping threshold probability, and let $H_{0}, H_{1}$ be the hypotheses that we are sampling from $p_{0}$ or $p_{1}$ respectively. We will stop when our odds $O (H_{1})$ either falls below $δ_{0} = O (H_{1}) \frac{γ}{1 - γ}$ or exceeds $δ_{1} = O (H_{1}) \frac{1 - γ}{γ}$ . We can then show that the expected stopping time, or $E [K^{*}]$ is approximately inversely linear in the KL-divergence,

E_{0} [K^{*}] \approx \frac{P (D)_{1} log (\frac{P (D)_{0}}{P (D)_{1}}) + (1 - P (D)_{1}) log (\frac{1 - P (D)_{0}}{1 - P (D)_{1}})}{- D (p_{0} ∥ p_{1})}

and

E_{1} [K *] \approx \frac{P (D)_{0} log (\frac{P (D)_{0}}{P (D)_{1}}) + (1 - P (D)_{0}) log (\frac{1 - P (D)_{0}}{1 - P (D)_{1}})}{- D (p_{1} ∥ p_{0})}

Importantly, these are the expected times with respect to an ideal discriminator, meaning that no matter what method one uses, it can be proved that there is no possible way they can discriminate between samples from these distributions in a shorter time.

In other words, for any desired level of confidence $γ$ and over any horizon length k we can find KL-divergences of $p_{0}$ from $p_{1}$ such that it will be impossible to reliably distinguish between fewer than k samples from $p_{0}$ and $p_{1}$ , with confidence $γ$ .

In plainer language, we can use scaling laws in data and parameters to calculate the loss of a language model with respect to its training data, which can be used to calculate the horizon over which the model's output is essentially indistinguishable from the training data. If this horizon length is longer than the horizon length of the transformative task, then our approach is to predict that the model is competent, and can directly substitute for humans in performing the task.

(For technical reasons, the preceding analysis is slightly incomplete, since in the case of language models, we do not obtain i.i.d. samples. In full report, I address this concern by appealing to the Shannon–McMillan–Breiman theorem.)

Building a more realistic model

In the last section, I only showed how to calculate the horizon length over which two distributions are hard to distinguish between for an ideal discriminator. A more realistic model is to assume that there is some imperfection in real human discrimination abilities.

To make things slightly more precise, let the k-performance of a model refer to the horizon length k over which a model cannot be reliably distinguished from the true distribution relative to some expert human judge. In particular, we are interested in calculating the k-performance relative to a trained human judge, who, despite falling short of being an ideal discriminator, possesses a variety of tools at their disposal which they can use to discriminate between outputs from the model, and outputs from the true distribution.

In the full report, I show that two intuitive models of human discrimination abilities yield the same result, or roughly, that the k-performance of a model with respect to an human discriminator will be some constant multiple of the k-performancd with respect to an ideal discriminator, which we can calculate directly. Performing experiments to measure this factor, which I call the human slowdown factor, is a tractable way of reducing uncertainty in this approach to forecasting AI.

When will TAI arrive?

Given an estimate of the slowdown factor of human discrimination abilities, a horizon length k over a transformative task (in our illustrative case, scientific research), and a scaling law in compute for the relevant distribution, it is possible to calculate a distribution over the upper bound of the training compute for TAI, assuming those scaling laws hold.

In the chart below (from Tamay Besiroglu), we can see the relationship between training compute and k-performance under various possible values of the slowdown parameter, according to the Hoffmann et al. compute-optimal scaling law. However, to be clear, the Hoffmann et al. scaling law was derived from internet data and by scaling a particular class of models, which may not be a reliable guide for measuring the difficulty of learning how to do scientific reasoning. Ultimately, better data will be needed on scaling properties on higher quality distributions to make this analysis more conclusive.

One way to read this chart is to imagine what horizon length of text you think Chinchilla is capable of reliably (~90% of the time) performing coherent reasoning over. For example, suppose you thought that Chinchilla was capable of reliably reasoning coherently over tweet-length tasks, but not short-blog-post-length tasks. Then, your estimate of the slowdown parameter would be at least 10x but not 50x. You can then look at the corresponding lines and extrapolate until it reaches the threshold for scientific-manuscript-length tasks. This type of reasoning personally convinced me that a reasonable hard upper bound for training TAI was about 10^40 FLOP, with something between 10^30 to 10^35 FLOP as my central estimate for the training requirements, using 2022 algorithms.

If we're also given estimates for growth in computing price-performance, willingness to spend, and algorithmic progress, then it is possible to provide a distribution over dates when we expect TAI to arrive.

You can visit this Google Colab notebook to create a timeline estimate using this approach. Note that I did not tweak these parameters in order to produce a desired result (indeed, I was surprised by the end result). However, I have been updating it as I find better estimates for each parameter.

[Note: originally I didn't put this plot in the blog post because I didn't want it shared everywhere without context. I was convinced to put it in here with the caveat that this is highly preliminary and sensitive to the parameters in the frequently updated notebook.

Perhaps the most misleading thing right now is that the model does not yet update on the fact that TAI has not already arrived. Thus, it puts non-trivial credence on us already having TAI. See this comment for a longer explanation. Don't take this plot too seriously.]

Alternatively, you can incorporate this approach into Tom Davidson's takeoff model to build a more theoretically grounded timelines model but I have not done that yet.

Also, it is worth noting that I am trying to model my uncertainty over the underlying parameters, yielding a very uncertain bottom-line result, despite my opinion that this model slightly reduces true uncertainty about the arrival of TAI relative to Bio Anchors.

Comparison to Bio Anchors

A good summary of Bio Anchors can be found in this comment from Rohin Shah.

The Bio Anchors report actually considers a version of the "direct extrapolation" model, but eventually dismisses the idea since it's unclear what metric we should use to measure performance. Cotra wrote,

A very different high-level approach to estimating TAI timelines (which in our experience most people initially gravitate toward) involves more holistically assessing progress in AI systems’ capabilities, rather than leaning heavily on biological anchors. Essentially, this approach is to:
Judge how “impressive”, “capable”, “general”, or “useful” state-of-the-art (SOTA) AI systems currently are (for example by synthesizing information from various key benchmarks and AI challenges such as performance on board games, Winograd schemas, adversarial examples, etc).
Assess how quickly the impressiveness of AI systems has been improving recently.
Extrapolate how many years of progress at the current pace would be required to reach the level of impressiveness required for TAI.
I’ll call this approach the subjective impressiveness extrapolation approach, which stands in contrast with the biological anchors framework used in this report. Here is a visualization of a hypothetical TAI timelines forecast using a subjective impressiveness extrapolation approach, where the x-axis is the year (from 2000 to 2100), and the red line represents the holistic “impressiveness” of AI systems in that year (which reaches the level of TAI around ~2065-2070):
[...]
The most important disadvantage of the subjective impressiveness extrapolation is that it is extremely unclear what exactly the y-axis refers to, and different people will have different intuitions about it.

My alternative is simply to take scaling laws at face value, and try my best to calculate the training loss associated with something like "transformative" or "human-level" abilities. Since many transformative tasks can arguably be translated into language modeling tasks, and the best data we have on neural scaling laws comes from language modeling, it's natural to examine language models, though I'd be excited if someone tried this approach for other modalities too, including RL.

Since it was published, Cotra's report has been subject to a number of critiques. As I pointed out last year, Cotra's final distribution over the compute required for training TAI is extremely broad, spanning over 20 orders of magnitude, making her model relatively uninformative. In addition, her report arguably underestimates our true uncertainty over TAI timelines since it relies on point estimates for algorithmic progress and price-performance declines in hardware, rather than modeling our uncertainty over these parameters.

Furthermore, Cotra's model is bit complex in some places, requiring considerable guesswork before producing any results. In the absence of historical precedent for biological anchor models anticipating AI developments, the lack of model simplicity, the uninformative nature of the bottom-line results, among other methodological issues, it is worth seeking alternative approaches for modeling TAI timelines.

By contrast, while I think my model still requires lots of guesswork and has massive uncertainty, I think there are several advantages of the Direct Approach relative to Bio Anchors. These include,

It's simpler than Bio Anchors, with fewer free parameters.
The model can be generalized to many tasks, even non-transformative ones, enabling us to actually test the model in the short term, and see if it generates reliable predictions.
We can plausibly greatly reduce our uncertainty over the bottom line distribution through experiment, especially by measuring the human slowdown factor.

That said, I'm also aware of several issues with the Direct Approach, including,

It employs a somewhat dubious interpretation of the cross-entropy loss of language models. In general, it is unclear whether we can readily obtain much information about the abilities of a language model given only its cross-entropy loss on a given task. Ultimately, downstream benchmarking data may be required instead.
It assumes that we can naively extrapolate neural scaling laws over many orders of magnitude. (Note however that Bio Anchors also assumes this).
The model currently lacks confirmation about many key sources of data, without which it is very difficult to use it to make good predictions. For example, I wouldn't put too much trust in the Hoffmann et al. scaling law.
The Direct Approach only produces a soft upper bound over the compute distribution required for TAI, since a more efficient method than simply scaling language models could be employed. This upper bound may be far above the actual requirements, which could render the Direct Approach fairly useless.

Overall, I'm hopeful that the Direct Approach can supplement Bio Anchors as a means of forecasting advanced AI, even if it does not serve as an adequate replacement.

Nice post, and nice argument! I think I agree that this is a worthy alternative to the Bio Anchors brain-and-genome-size-comparison stuff. I even tentatively agree that it's better overall, though I'd want to think about it more. (When I go on about how great Bio Anchors is, it's not because I'm in love with the brain size comparison--though I used to like it more than I do now--it's because I'm in love with the "useful core" of it, the breakdown into computing price-performance, willingness to spend, algorithmic progress, and compute requirements. Which it seems you also are doing.)

This type of reasoning personally convinced me that a reasonable hard upper bound for training TAI was about 10^40 FLOP, with something between 10^30 to 10^35 FLOP as my central estimate for the training requirements, using 2022 algorithms.
...
If we're also given estimates for growth in computing price-performance, willingness to spend, and algorithmic progress, then it is possible to provide a distribution over dates when we expect TAI to arrive.
...
Alternatively, you can incorporate this approach into Tom Davidson's takeoff model to build a more theoretically grounded timelines model but I have not done that yet.

Tom's model just uses a training requirements variable, it doesn't appeal to all the fancy bio anchors stuff that your method is a viable alternative to. Insofar as you are still using what I consider the "useful core" of the bio anchors model then I think your bottom line numbers for compute requirements can just be straightforwardly plugged into Tom's model.

So I just went and plugged in the values of 10^30 and 10^35 FLOP for the training requirements variable at takeoffspeeds.com. Playing around with it a bit (modifying their preset scenarios) it looks like this gives you somewhere between 2029 and 2044.

Though if you also do what I recommend and increase software returns from 1.25 to 2.5, to be more consistent with the data we have so far about algorithmic progress, the top end of the range cuts down considerably: Now you get 2027 - 2033 as the range.

I applaud the effort, but I confess that I didn't read the post very carefully, because so many problems with this approach jumped out at me. So it is entirely possible that some of the following points of critique are already dealt with in your text, in which case, sorry about that:

My understanding is that the irreducible part of the loss has nothing (necessarily) to do with "entropy of natural text" and even less with "roughly human-level" - it is the loss this particular architecture for this particular training regime can reach in the limit on this particular training data distribution.

Nothing more, nothing less.

Other architectures with other training regimes will have scaling laws with a different irreducible loss. Human level of token prediction is way worse than probably any GPT, so why would that loss indicate human level of reasoning?

To actually have an "anchor" you'd have to look at human performance on benchmarks und estimate the range of loss that might indicate human level of reasoning.

I also don't understand why the loss on random internet text should be the same as on scientific papers. Loss differs from domain to domain. Scaling laws likely do, too.

Also, new (significant) papers are not sampled from the distribution of papers. They are out of distribution because they go beyond all previous papers. So I'm not sure your formula doesn't just measure the ability to coherently rehash stuff that is already known.

Of course if you'd actually measure the ability to write a new scientific paper, the model would be ridiculously overpowered, because humans need 6 or 7 orders of magnitude more time to create such a paper (at least with current sampling speeds).

I think your critiques are great since you're thinking clearly about how this approach is supposed to work. At a high level my reply to your comment is something like, "I basically agree, but don't think that anything you mentioned is devastating. I'm trying to build something that is better than Bio Anchors, and I think I probably succeeded even with all these flaws."

That said, I'll address your points more directly.

My understanding is that the irreducible part of the loss has nothing (necessarily) to do with "entropy of natural text" and even less with "roughly human-level" - it is the loss this particular architecture for this particular training regime can reach in the limit on this particular training data distribution.

That's correct, but if the hypothesis space is sufficiently large, then the term in the Hoffmann et al. equation for loss should actually correspond to the entropy of natural language. The reason comes down to how entropy is defined. Roughly speaking, language entropy can be defined as the limit of a certain functional approximation of entropy as the capacity of that model goes to infinity. This is strikingly similar to what I'm doing with the Hoffmann et al. equation.

The main difference is that, in the case of Hoffmann et al., we can't be sure that a sufficiently large model trained over a sufficiently large amount of data would actually converge onto the entropy of the English language. In fact, we have reason to believe that it wouldn't, due to constraints like the limited size of the context window.

However, I don't see any fundamental issue with treating $E$ as the entropy of the whatever distribution we're talking about, so long as our hypothesis space is vast enough.

In this post, I'm not trying to lean into the Hoffmann et al. results per se, except to the extent that they provided the best current source of data on scaling language models. Indeed, initially I didn't even present any results at all, until someone persuaded me to put a preliminary CDF over TAI timeline in the post to make it more interesting. I'm mostly trying to explain the approach, which can certainly be improved upon if we get better data about the scaling properties of models trained on a more relevant distribution, like scientific papers.

Human level of token prediction is way worse than probably any GPT, so why would that loss indicate human level of reasoning?

I think this question might rest on a misconception that also came up a few times when I was sharing this draft with people, so it's perhaps important to put this caveat into the post more directly. This approach has almost nothing to do with human-level abilities to predict tokens. It's based on something completely different, which is more similar to whether humans can distinguish long outputs written by language models from outputs written by humans. There are good reasons to believe that these two approaches should yield very different results.

Also, new (significant) papers are not sampled from the distribution of papers. They are out of distribution because they go beyond all previous papers. So I'm not sure your formula doesn't just measure the ability to coherently rehash stuff that is already known.

This is a good point. Ultimately, my response is that I'm trying to measure something like the hardness of training a model to think reliably over long sequences, rather than something like, the hardness of training a model to copy the exact distribution its trained on. We can already see with current models, like GPT-3.5, that it can often produce novel results (e.g. its poetry) despite the intuition that it shouldn't be able to "go beyond" its training distribution. I think this points to something important and true, which is that language models seem to be learning how to do the stuff that in practice allows it to write content on par with what humans write, rather than merely learning how to emulate its training distribution.

I thought this was great, cheers.

Here:

Next, we estimate a sufficient horizon length, which I'll call the k-horizon, over which we expect the most complex reasoning to emerge during the transformative task. For the case of scientific research, we might reasonably take the k-horizon to roughly be the length of an average scientific paper, which is likely between 3,000 and 10,000 words. However, we can also explicitly model our uncertainty about the right choice for this parameter.

It's unclear whether the final paper would be the needed horizon length.

For analogous reasoning, consider a model trained to produce equations which faithfully describe reality. These equations tend to be quite short. But I imagine that the horizon length needed to produce them is larger, because you have to keep many things in mind when doing so. Unclear if I'm anthropomorphizing here.

I am also curious about the extent to which you are taking the Hoffman scaling laws as an assumption, rather than as something you can assign uncertainty over.

This is cool! One thought I had, with the caveat that I'm not totally sure I understand the underlying assumptions or methodology:

Of course, real scientific research involves more than merely writing research papers. It involves proposing hypotheses, devising experiments, and collecting data, but for now, let's imagine that we can simplify all these steps into one step that involves writing high quality research papers. This simplification may not be entirely unrealistic, since if the papers are genuinely judged to be high quality and not fraudulent or p-hacked etc., then presumably they are the end result of a process that reliably performs all the essential steps to proper scientific research.

I tentatively agree there is a real sense in which "write a good scientific paper" has as a subtask "do good scientific research", such that a process that can do the former will need to learn to do the latter. The scaling laws for LLM are based on looking at lots of text and getting very good at prediction, and presumably this involves learning patterns about how the real world works, which definitely feels like a move in the direction of "do good scientific research".

But most science requires actually looking at the world. The reason we spend so much money on scientific equipment is because we need to check if our ideas correspond to reality, and we can't do that just by reading text. I imagine the outcome of some publishable experiments could be predicted with high confidence just by looking at existing text, but I expect there to be enough of a difference between "look at text and learn patterns" and "look at the world and learn patterns" that it's unclear to me if we should expect the scaling laws to hold once the models are running a process that is actually doing science. (Like maybe the "irreducible loss" is reducible once you start looking at the world?) Or maybe it's very similar and the scaling laws will take us all the way to high-quality text-mediated real-world experiments or whatever.

But most science requires actually looking at the world. The reason we spend so much money on scientific equipment is because we need to check if our ideas correspond to reality, and we can't do that just by reading text.

I agree. The primary thing I'm aiming to predict using this model is when LLMs will be capable of performing human-level reasoning/thinking reliably over long sequences. It could still be true that, even if we had models that did that, they wouldn't immediately have a large scientific/economic impact on the world, since science requires a lot more than thinking well. (There are also a number of other issues like models hallucinating, even if they're capable of reasoning coherently, but I'll set aside that for now.)

The original definition of transformative AI referred broadly to economic, social and scientific progress society-wide, rather than just AI that reasoned and understood the world broadly as well as a human does. An implicit assumption in this blog post is that TAI will follow after the development of high quality AI thinkers. That's not an assumption I defended because I felt it was separate from the purpose of the report. To be clear, I also find this assumption questionable, so I suppose it's worth clarifying that in the post.

I consider it important to explore what we should expect after the development of high quality reasoners, which I expect we can do by incorporating this framework into the context of a more general takeoff model, such as Tom Davidson's recent model. I have yet to bridge these two models, but I suspect after bridging them, we might get more insight into this question.

The primary thing I'm aiming to predict using this model is when LLMs will be capable of performing human-level reasoning/thinking reliably over long sequences.

Yeah, and I agree this model seems to be aiming at that. What I was trying to get at in the later part of my comment is that I'm not sure you can get human-level reasoning on text as it exists now (perhaps because it fails to capture certain patterns), that it might require more engagement with the real world (because maybe that's how you capture those patterns), and that training on whichever distribution does give human-level reasoning might have substantially different scaling regularities. But I don't think I made this very clear and it should be read as "Rick's wild speculation", not "Rick's critique of the model's assumptions".

training on whichever distribution does give human-level reasoning might have substantially different scaling regularities.

I agree again. I talked a little bit about this at the end of my post, but overall I just don't have any data for scaling laws on better distributions than the one in the Chinchilla paper. I'd love to know the scaling properties of training on scientific tasks and incorporate that into the model, but I just don't have anything like that right now.

Also, this post is more about the method rather than any conclusions I may have drawn. I hope this model can be updated with better data some day.

I'm confused about how heterogeneity in data quality interacts with scaling. Surely training a LM on scientific papers would give different results from training it on web spam, but data quality is not an input to the scaling law... This makes me wonder whether your proposed forecasting method might have some kind of blind spot in this regard, for example failing to take into account that AI labs have probably already fed all the scientific papers they can into their training processes. If future LMs train on additional data that have little to do with science, could that keep reducing overall cross-entropy loss (as scientific papers become a smaller fraction of the overall corpus) but fail to increase scientific ability?

The assumptions of stationarity and ergodicity are natural to make, but I wonder if they hide much of the difficulty of achieving indistinguishability. If we think of text sequences as the second part of a sequence where the first part is composed of whatever non-text world events preceded the text (or even more text data that was dropped from the context). I'd guess a formalization of this would violate stationarity or ergodicity. My point here is a version of the general causal confusion / hallucination points made previously e.g. here.

This is, of course, fixable by modifying the training process, but I thinks it is worth flagging that the stationarity and ergodicity assumptions are not arbitrary with respect to scaling. They are assumptions which likely bias the model towards shorter timelines. Adding more of my own inside view, I think this point is evidence for code and math scaling accelerating ahead of other domains. In general, any domain where modifying the training process to cheaply allow models to take causal actions (which deconfound/de-hallucinate) should be expected to progress faster than other domains.

Strong upvote. I think the methods used in this post are very promising for accurately forecasting TAI for the reasons explained below.

While writing GPT-4 Predictions I spent a lot of time playing around with the parametric scaling law L(N, D) from Hoffmann et al. 2022 (the Chinchilla paper). In the post, I showed that scaling laws can be used to calculate model losses and that these losses seem to correlate well with performance on the MMLU benchmark. My plan was to write a post extrapolating the progress further to TAI until I read this post which has already done that!

Scaling laws for language models seem to me like possibly the most effective option we have for forecasting TAI accurately for several reasons:

It seems as though the closest ML models to TAI that currently exist are language models and therefore predictive uncertainty should be lower for forecasting TAI from language models than from other types of less capable models.
A lot of economically valuable work such as writing and programming involves text and therefore language models tend to excel at these kinds of tasks.
The simple training objective of language models makes it easier to reason about their properties and capabilities. Also, despite their simple training objective, large language models demonstrate impressive levels of generalization and even reasoning (e.g. chain-of-thought prompting).
Language model scaling laws are well-studied and highly accurate for predicting language model losses.
There are many existing examples of language models and their capabilities. Previous capabilities can be used as a baseline for predicting future capabilities.

Overall my intuition is that language model scaling laws require much fewer assumptions and guesswork for forecasting TAI and therefore should allow narrower and more confident predictions which your post seems to show (<10 OOM vs 20 OOM for the bio anchors method).

As I mentioned in this post there are limitations to using scaling laws such as the possibility of sudden emergent capabilities and the difficulty of predicting algorithmic advances.

^{^}
Exceptions include deep RL work by DeepMind such as AlphaTensor.

In Stephen Byrnes recent post https://www.lesswrong.com/posts/PDx4ueLpvz5gxPEus/why-i-m-not-working-on-debate-rrm-elk-natural-abstractions he states: " Two different perspectives are:

AGI is about knowing how to do lots of things
AGI is about not knowing how to do something, and then being able to figure it out.

I’m strongly in the second camp.

" I very much agree with this, and I think measuring perplexity confuses these two things (ability to recite/compose memorized knowledge vs ability to generate novel insights). I think you need an entirely different benchmark, based on something like giving a model only the information available to Newton before he came up with his laws of motion and seeing if the model can come up with equivalent insights.

I used the first chart, the compute required for GPT3, and my personal assessment that ChatGPT clearly meets the cutoff for tweet length, very probably meets it for short blog (but not by a wide margin), and clearly does not meet it for research paper, to create my own 75th percentile estimate for human slowdown of 25-75. It moves the P(TAI<=year) = 50% from ~2041 to ~2042, and the 75% from ~2060 to ~2061. Big changes! 😂

Excellent post, thank you. I look forward to playing with the notebook model.

One observation on the timelines (which I realise is somewhat tangential to the core content of the post) is that I believe your estimates around compute cost scaling feel quite optimistic to me.

Memory and compute density scaling is already dropping off meanwhile everyone seems to be struggling with yields on sub-3nm nodes despite huge spend, exacerbating the problem for cost scaling.

https://fuse.wikichip.org/news/7343/iedm-2022-did-we-just-witness-the-death-of-sram/

https://www.semianalysis.com/p/tsmcs-3nm-conundrum-does-it-even

We don't really have many more promising technologies in the pipeline to radically get cost per transistor down so I expect improvements to slow quite a bit towards the end of this decade.

Your assertion that we don't have many things to reduce cost per transistor may be true, but is not supported by the rest of your comment or links - reduction in transitor size and similar performance improving measures are not the only way to improve cost performance.

Sorry I agree that comment and those links left some big inferential gaps.

I believe the link below is more holistic and doesn't leave such big leaps (admittedly it does have some 2021-specific themes that haven't aged so well, but I don't believe they undermine the core argument made).

https://www.fabricatedknowledge.com/p/the-rising-tide-of-semiconductor

This still leaves a gap between cost per transistor and overall compute cost, but that's a much smaller leap e.g. frequency being bound by physical constraints like the speed of light. etc..

To evidence my point about this trend getting even worse after 2030 - EUV lithography was actively being pursued for decades before active usage in 2030. My understanding is that we don't have anything that significant at the level of maturity that EUV was at in the 90s. Consider my epistemic status on this point fairly weak though.

It seems like something has to be going wrong if the model output has higher odds that TAI is already here (~12%) than TAI being developed between now and 2027 (~11%)? Relatedly, I'm confused by the disclaimer that "we are not updating on the fact that TAI has obviously not yet arrived" -- shouldn't that fact be baked into the distributions for each parameter (particularly the number of FLOPs to reach TAI)?

In the notebook, the number of FLOP to train TAI is deduced a priori. I basically just estimated distributions over the relevant parameters by asking what I'd expect from TAI, rather than taking into consideration whether those values would imply a final distribution that predicts TAI arrived in the past. It may be worth noting that that Bio Anchors also does this initially, but it performs an update by chopping off some probability from the distribution and then renormalizing. I didn't do that yet because I don't know how to best perform the update.

Personally, I don't think a 12% chance that TAI already arrived is that bad, given that the model is deduced a priori. Others could reasonably disagree though.

This type of reasoning personally convinced me that a reasonable hard upper bound for training TAI was about 10^40 FLOP, with something between 10^30 to 10^35 FLOP as my central estimate for the training requirements, using 2022 algorithms.
...
If we're also given estimates for growth in computing price-performance, willingness to spend, and algorithmic progress, then it is possible to provide a distribution over dates when we expect TAI to arrive.
...
Alternatively, you can incorporate this approach into Tom Davidson's takeoff model to build a more theoretically grounded timelines model but I have not done that yet.

Nothing more, nothing less.

To actually have an "anchor" you'd have to look at human performance on benchmarks und estimate the range of loss that might indicate human level of reasoning.

I also don't understand why the loss on random internet text should be the same as on scientific papers. Loss differs from domain to domain. Scaling laws likely do, too.

That said, I'll address your points more directly.

My understanding is that the irreducible part of the loss has nothing (necessarily) to do with "entropy of natural text" and even less with "roughly human-level" - it is the loss this particular architecture for this particular training regime can reach in the limit on this particular training data distribution.

However, I don't see any fundamental issue with treating $E$ as the entropy of the whatever distribution we're talking about, so long as our hypothesis space is vast enough.

Human level of token prediction is way worse than probably any GPT, so why would that loss indicate human level of reasoning?

Also, new (significant) papers are not sampled from the distribution of papers. They are out of distribution because they go beyond all previous papers. So I'm not sure your formula doesn't just measure the ability to coherently rehash stuff that is already known.

I thought this was great, cheers.

Here:

Next, we estimate a sufficient horizon length, which I'll call the k-horizon, over which we expect the most complex reasoning to emerge during the transformative task. For the case of scientific research, we might reasonably take the k-horizon to roughly be the length of an average scientific paper, which is likely between 3,000 and 10,000 words. However, we can also explicitly model our uncertainty about the right choice for this parameter.

It's unclear whether the final paper would be the needed horizon length.

I am also curious about the extent to which you are taking the Hoffman scaling laws as an assumption, rather than as something you can assign uncertainty over.

This is cool! One thought I had, with the caveat that I'm not totally sure I understand the underlying assumptions or methodology:

Of course, real scientific research involves more than merely writing research papers. It involves proposing hypotheses, devising experiments, and collecting data, but for now, let's imagine that we can simplify all these steps into one step that involves writing high quality research papers. This simplification may not be entirely unrealistic, since if the papers are genuinely judged to be high quality and not fraudulent or p-hacked etc., then presumably they are the end result of a process that reliably performs all the essential steps to proper scientific research.

But most science requires actually looking at the world. The reason we spend so much money on scientific equipment is because we need to check if our ideas correspond to reality, and we can't do that just by reading text.

The primary thing I'm aiming to predict using this model is when LLMs will be capable of performing human-level reasoning/thinking reliably over long sequences.

training on whichever distribution does give human-level reasoning might have substantially different scaling regularities.

Also, this post is more about the method rather than any conclusions I may have drawn. I hope this model can be updated with better data some day.

Strong upvote. I think the methods used in this post are very promising for accurately forecasting TAI for the reasons explained below.

Scaling laws for language models seem to me like possibly the most effective option we have for forecasting TAI accurately for several reasons:

It seems as though the closest ML models to TAI that currently exist are language models and therefore predictive uncertainty should be lower for forecasting TAI from language models than from other types of less capable models.
A lot of economically valuable work such as writing and programming involves text and therefore language models tend to excel at these kinds of tasks.
The simple training objective of language models makes it easier to reason about their properties and capabilities. Also, despite their simple training objective, large language models demonstrate impressive levels of generalization and even reasoning (e.g. chain-of-thought prompting).
Language model scaling laws are well-studied and highly accurate for predicting language model losses.
There are many existing examples of language models and their capabilities. Previous capabilities can be used as a baseline for predicting future capabilities.

As I mentioned in this post there are limitations to using scaling laws such as the possibility of sudden emergent capabilities and the difficulty of predicting algorithmic advances.

^{^}
Exceptions include deep RL work by DeepMind such as AlphaTensor.

In Stephen Byrnes recent post https://www.lesswrong.com/posts/PDx4ueLpvz5gxPEus/why-i-m-not-working-on-debate-rrm-elk-natural-abstractions he states: " Two different perspectives are:

AGI is about knowing how to do lots of things
AGI is about not knowing how to do something, and then being able to figure it out.

I’m strongly in the second camp.

Excellent post, thank you. I look forward to playing with the notebook model.

One observation on the timelines (which I realise is somewhat tangential to the core content of the post) is that I believe your estimates around compute cost scaling feel quite optimistic to me.

Memory and compute density scaling is already dropping off meanwhile everyone seems to be struggling with yields on sub-3nm nodes despite huge spend, exacerbating the problem for cost scaling.

https://fuse.wikichip.org/news/7343/iedm-2022-did-we-just-witness-the-death-of-sram/

https://www.semianalysis.com/p/tsmcs-3nm-conundrum-does-it-even

We don't really have many more promising technologies in the pipeline to radically get cost per transistor down so I expect improvements to slow quite a bit towards the end of this decade.

Sorry I agree that comment and those links left some big inferential gaps.

https://www.fabricatedknowledge.com/p/the-rising-tide-of-semiconductor

This still leaves a gap between cost per transistor and overall compute cost, but that's a much smaller leap e.g. frequency being bound by physical constraints like the speed of light. etc..

Personally, I don't think a 12% chance that TAI already arrived is that bad, given that the model is deduced a priori. Others could reasonably disagree though.

122

A proposed method for forecasting transformative AI

122

Ω 32

Some background

Summary of the Direct Approach

Interpreting the training loss

Building a more realistic model

When will TAI arrive?

Comparison to Bio Anchors

122

Ω 32

122

Ω 32