Background Context:
I'm interested in this debate mainly because my views on timelines have been very influenced by whoever I have talked with most recently, over the past few years (while definitely getting shorter on average over that period). If I've been talking with Tsvi or with Sam Eisenstat, my median-time-to-superintelligence is measured in the decades, while if I've been talking with Daniel Kokotajlo or Samuel Buteau, it's measured in years (single-digit).
More recently, I've been more inclined towards the short end of the scale. The release of o1 made me update towards frontier labs not being too locked into their specific paradigm to innovate when existing methods hit diminishing returns. The AI 2027 report solidified this short-timeline view, specifically by making the argument that LLMs don't need to show steady progress on all fronts in order to be on a trajectory for strong superintelligence; so long as LLMs continue to make improvements in the key capabilities related to an intelligence explosion, other capabilities that might seem to lag behind can catch up later.
I was talking about some of these things with Tsvi recently, and he said something like "argue or update" -- so, it seemed like a good opportunity to see whether I could defend my current views or whether they'll once again prove highly variable based on who I talk to.
A Naive Argument:
One of the arguments I made early on in the discussion was "it would seem like an odd coincidence if progress stopped right around human level."
Since Tsvi put some emphasis on trying to figure out what the carefully-spelled-out argument is, I'll unpack this further:
Argument 1
- GPT1 (June 2018) was roughly elementary-school level in its writing ability.
- GPT2 (February 2019) was roughly middle-school level.
- GPT3 (June 2020) was roughly highschool-level.
- GPT4 (March 2023) was roughly undergrad-level (but in all the majors at once).
- Claude 3 Opus (March 2024) was roughly graduate-school level (but in all the majors at once).
Now, obviously, this comes with a lot of caveats. For example, while GPT4 scored very well on the math SAT, it still made elementary-school mistakes on basic arithmetic questions. Similarly, the ARC-AGI challenge highlights IQ-test-like visual analogy problems where humans perform well compared with LLMs. LLMs also lag behind in physical intuitions, as exemplified by EG the HellaSwag benchmark; although modern models basically ace this benchmark, I think performance lagged behind what the education-level heuristic would suggest.
Still, the above comparisons are far from meaningless, and a naive extrapolation suggests that if AI keeps getting better at a similar pace, it will soon surpass the best humans in every field, across a wide variety of tasks.
There's a lot to unpack here, but I worry about getting side-tracked... so, back to the discussion with Tsvi.
Tsvi's immediate reaction to my "it would seem like an odd coincidence if progress stopped right around the human level" was to point out AI's heavy reliance on data; the data we have is generally generated by humans (with the exception of data created by algorithms, such as chess AI and so on). As such, it makes a lot of sense that the progress indicated in my bullet-points above could grind to a halt at performance levels within the human range.
I think this is a good and important point. I think it invalidates Argument 1, at least as written.
Why continued progress seems probable to me anyway:
As I said near the beginning, a major point in my short-timeline intuitions is that OpenAI and others have shown the ability to pivot from "pure scaling" to more substantive architectural improvements. We saw the first such pivot with ChatGPT (aka GPT3.5) in November 2022; the LLM pivoted from pure generative pre-training ("GPT") to GPT + chat training (mainly, adding RLHF after the GPT training). Then, in September 2024, we saw the second such pivot with the rise of "reasoning models" via a type of training now called RL with Verifiable Feedback (RLVF).
GPT alone is clearly bottlenecked by the quality of the training data. Since it is mainly trained on human-generated data, human-level performance is a clear ceiling for this method. (Or, more accurately: its ceiling is (at best) whatever humans can generate a lot of data for, by any means.)
RLHF lifts this ceiling by training a reinforcement module which can distinguish better and worse outputs. The new ceiling might (at best) be the human ability to discern better and worse answers. In practice, it'll be worse than this, since the reinforcement module will only partially learn to mimic human quality-discernment (and since we still need a lot of data to train the reinforcement module, so OpenAI and others have to cut corners with data-quality; in practice, the human feedback is often generated quickly and under circumstances which are not ideal for knowledge curation).
RLVF lifts this ceiling further by leveraging artificially-generated data. Roughly: there are a lot of tasks for which we can grade answers precisely, rather than relying on human judgement. For these tasks, we can let models try to answer with long chain-of-thought reasoning (rather than asking them to just answer right away). We can then keep only the samples of chain-of-thought reasoning which perform well on the given tasks, and fine-tune the model to get it to reason like that in general. This focuses the model on ways of reasoning which work well empirically. Although this only directly trains the model to perform well on these well-defined tasks, we can rely on some amount of generalization; the resulting models perform better on many tasks. (This is not too surprising, since we already knew that asking models to "reason step-by-step" rather than answering right away was known to increase performance for many tasks already. RLVF boosts this effect by steering the step-by-step reasoning towards reasoning steps which actually work well in practice.)
So, as I said, that's two big pivots in LLM technology in the past four years. What might we expect in the next four years?
The Deductive Closure:
During the live debate Tsvi linked to, TJ (an attendee of the event) referred to the modern LLM paradigm providing a way to take the deductive closure of human knowledge: LLMs can memorize all of existing human knowledge, and can leverage chain-of-thought reasoning to combine that knowledge iteratively, making new conclusions. RLVF might hit limits, here, but more innovative techniques might push past those limits to achieve something like the "deductive closure of human knowledge": all conclusions which can be inferred by some combination of existing knowledge.
What might this deductive closure look like? Certainly it would surpass the question-answering ability of all human experts, at least when it comes to expertise-centric questions which do not involve the kind of "creativity" which Tsvi ascribes to humans. Arguably this would be quite dangerous already.
The Inductive Closure:
Another point which came up in the live debate was the connect-the-dots paper by Johannes Treutline et al, which shows that LLMs generate new explicit knowledge which is not present in the training data, but which can be inductively inferred from existing data-points. For example, when trained only on the input-output behavior of some unspecified python function f, LLMs can sometimes generate the python code for f.
This suggests an even higher ceiling than the deductive closure, which we might call the "inductive closure" of human knowledge; IE, rather than starting with just human knowledge and then deducing everything which follows from it, I think it is also reasonable to imagine a near-term LLM paradigm which takes the deductive closure and adds everything that can be surmised by induction (then takes the deductive closure of that, then induces from those further datapoints, etc).
Again, this provides further motivation for thinking that realistic innovations in training techniques could shoot past the human-performance maximum which would have been a ceiling for GPT, or the human-discernment maximum which would have been a ceiling for RLHF.
Fundamental Limits of LLMs?
I feel this reply would be quite incomplete without addressing Tsvi's argument that the things LLMs can do fundamentally fall short of specific crucial aspects of human intelligence.
As Tsvi indicated, I agree with many of Tsvi's remarks about shortcomings of LLMs.
Example 1:
- I can have a nice long discussion about category theory in which I treat an LLM like an interactive textbook. I can learn a lot, and although I double-check everything the LLM says (because I know that LLMs are prone to confabulate a lot), I find no flaw in its explanations.
- However, as soon as I ask it to apply its knowledge in a somewhat novel way, the illusion of mathematical expertise falls apart. When the question I ask isn't quite like the examples you'll find in a textbook, the LLM makes basic mistakes.
Example 2:
- Perhaps relatedly (or perhaps not), when I ask an LLM to try and prove a novel theorem, the LLM will typically come up with a proof which at first looks plausible, but upon closer examination, contains a step with a basic logical error, usually amounting to assuming what was to be proven. My experience is that these errors don't go away when the model increments version numbers; instead, they just get harder to spot!
This calls into question whether anything similar to current LLMs can reach the "deductive closure" ceiling. Notably, Example 1 and Example 2 sound a lot like the capabilities of students who have memorized everything in the textbooks but who haven't actually done any of the exercises. Such students will seem incredibly knowledgeable until you push them to apply the knowledge to new cases.
My intuition is that example 2 is mainly an alignment problem: modern LLMs are trained with a huge bias towards doing what humans ask (eg answering the question as stated), rather than admitting that they have uncertainty or don't know how to do it, or other conversational moves which are crucial for research-style conversations but which aren't incentivized by the training. The bias towards satisfying the user request swamps out the learned patterns of valid proofs, so that the LLM becomes a "clever arguer" rather than sticking to valid proof steps (even though it has a good understanding of "valid proof step" across many areas of mathematics).
Example 1 might be a related problem: perhaps LLMs try to answer too quickly, rather than reasoning things out step-by-step, due to strong priors about what knowledgeable people answering questions should look like. On this hypothesis, Example 1 type failures would probably be resolved by the same sorts of intellectual-honesty training which could resolve Example 2 type failures.
I should note that I haven't tried the sort of category-theoretic discussion from Example 1 with reasoning LLMs. It seems possible that reasoning LLMs are significantly better at applying the patterns of mathematical reasoning correctly to not-quite-textbook examples (this is exactly the sort of thing they're supposed to be good at!). However, I am a little pessimistic about this, because in my experience, problems like Example 2 persist in reasoning models. This seems to be due to an alignment problem; reasoning models have a serious lying problem.
We should also consider the hypothesis that Example 1 and Example 2 derive from a more fundamental issue in the generalization ability of LLMs: basically, they are capable of "interpolation" (they can do things that are very similar to what they've seen in textbooks) but are very bad at "extrapolation" (applying these ideas to new cases).
The Whack-A-Mole Argument
During the live debate, Mateusz (an attendee of the event) made the following argument:
- There's a common pattern in AI doom debates where the doomer makes a specific risk argument, the techno-optimist comes up with a way of addressing that problem, the doomer describes a second risk argument, the optimist comes up with a way of handling that problem, etc. After this goes back-and-forth for a bit, the doomer calls on the optimist to generalize:
- "I can keep naming potential problems, and you can keep naming ways to avoid that specific problem, but even if you're optimistic about all of your solutions not only panning out research-wise, but also being implemented in frontier models, you should expect to be killed by yet another problem which no one has thought of yet. You're essentially playing a game of whack-a-mole where the first mole which you don't wack in time is game over. This is why we need a systematic solution to the AI safety problem, which addresses all potential problems in advance, rather than simply patching problems as we see them."
- Mateusz compares this to my debate with Tsvi. Tsvi can point out a specific shortcoming of LLMs, and I can suggest a plausible way of getting around that shortcoming -- but at some point I should generalize, and expect LLMs to have shortcomings which haven't been articulated yet. This is why "expecting strong superintelligence soon" needs to come with a systematic understanding of intelligence which addresses all potential shortcomings in advance, rather than playing whack-a-mole with potential obstacles.
I'm not sure how well this reflects Tsvi's position. Maybe Tsvi is pointing to one big shortcoming of LLMs (something like "creativity" or "originariness") rather than naming one specific shortcoming after another. Nonetheless, Mateusz' position seems like a plausible objection: maybe human intelligence relies on a lot of specific stuff, and the long-timelines intuition can be defended by arguing that it will take humans a long time to figure out all that stuff. As Tsvi said above:
2. Evolution got a bunch of algorithmic ideas by running a very rich search (along many algorithmic dimensions, across lots of serial time, in a big beam search / genetic search) with a very rich feedback signal ("how well does this architecture do at setting up the matrix out of which a strong mind grows given many serial seconds of sense data / muscle output / internal play of ideas").
3. We humans do not have many such ideas, and the ones we have aren't that impressive.
My reply is twofold.
First, I don't buy Mateusz' conclusion from the whack-a-mole analogy. AI safety is hard because, once AIs are superintelligent, the first problem you don't catch can kill you. AI capability research is relatively easy because when you fail, you can try again. If AI safety is like a game of whack-a-mole where you lose the first time you miss, AI capabilities is like whack-a-mole with infinite retries. My argument does not need to involve AI capability researchers coming up with a fully general solution to all the problems (unlike safety). Instead, AI capability researchers can just keep playing whack-a-mole till the end.
Second, as I said near the beginning, I don't need to argue that humans can solve all the problems via whack-a-mole. Instead, I only need to argue that key capabilities required for an intelligence explosion can continue to advance at rapid pace. It is possible that LLMs will continue to have basic limitations compared to humans, but will nonetheless be capable enough to "take the wheel" (perhaps "take the mallet") with respect to the whack-a-mole game, accelerating progress greatly.
Generalization, Size, & Training
What if it isn't a game of whack-a-mole; instead, there's a big, fundamental failure in LLMs which reflects a fundamental difference between LLMs and human intelligence? The whack-a-mole picture suggests that there's lots of individual differences, but each individual difference can be addressed within the current paradigm (IE, we can keep whacking moles). What if, instead, there's at least one fundamental difference that requires really new ideas? Something fundamentally beyond the Deep Learning paradigm?
4. The observed performance of current Architectures doesn't provide very strong evidence that they have the makings of a strong mind. E.g.:
a. poor performance on truly novel / creative tasks,
b. poor sample complexity,
c. huge mismatch on novel tasks compared to "what could a human do, if that human could also do all the performance that the gippity actually can do"--i.e. a very very different generalization profile compared to humans.
I agree with Tsvi on the following:
- Current LLMs show poor performance on novel/creative tasks.
- Current LLMs are very data-hungry in comparison to humans; they require a lot more data to learn the same thing.
- If a human knew all the things that current LLMs knew, that human would also be able to do a lot of things that current LLMs cannot do. They would not merely be a noted expert in lots of fields at once; they would have a sort of synthesis capability (something like the "deductive closure" and "inductive closure" ideas mentioned earlier).
If these properties of Deep Learning continues to hold into the future, it suggests longer timelines.
Unfortunately, I don't think these properties are so fundamental.
- First and foremost, I updated away from this view when I read about the BabyLM Challenge. The purpose of this challenge is to learn language with amounts of data which are comparable to what humans learn from, rather than the massive quantities of data which ChatGPT, Claude, Gemini, Grok, etc are trained on. This has been broadly successful: by implementing some architectural tweaks and iterating training on the given data more times, it is possible for Transformer-based models to achieve GPT2 levels of competence on human-scale training data.
- Thus, as frontier capability labs hit a data bottleneck, they might implement strategies similar to those seen in the BabyLM challenge to overcome that bottleneck. The resulting gains in generalization might eliminate the sorts of limitations to LLM generalization that we are currently seeing.
- Second, larger models are generally more data-efficient. This observation opens up the possibility that the fundamental limitations of LLMs mentioned by Tsvi are primarily due to size. Think of modern LLMs like a parrot trained on the whole internet. (I am not claiming that the modern LLM sizes are exactly parrot-like; the point here is just that parrots have smaller brains than humans.) It makes sense that the parrot might be great at textbook-like examples but struggle to generalize. Thus, the limitations of LLMs might disappear as models continue to grow in size.
Creativity & Originariness
The ML-centric frame of "generalization" could be accused of being overly broad. Failure to generalize is actually a huge grab-bag of specific learning failures when you squint at it. Tsvi does some work to point at a more specific sort of failure, which he sometimes calls "creativity" but here calls "originariness".
To clarify a little bit: there's two ways to get an idea.
- Originarily. If you have an idea in an originary way, you're the origin of (your apprehension of) the idea. The origin of something is something like "from whence it rises / stirs" (apparently not cognate with "-gen").
- Non-originarily. For example, you copied the idea.
Originariness is not the same as novelty. Novel implies originary, but an originary idea could be "independently reinvented".
Human children do most of their learning originarily. They do not mainly copy the concept of a chair. Rather, they learn to think of chairs largely independently--originarily--and then they learn to hook up that concept with the word "chair". (This is not to say that words don't play an important role in thinking, include in terms of transmission--they do--but still.)
Gippities and diffusers don't do that.
Tsvi anticipates my main two replies to this:
The hidden assumption I'm not sure how to state exactly, or maybe it varies from person to person (which is part of why I try to elicit this by asking questions). The assumption might be like
The tasks that Architectures have had success on have expanded to include performance that's relatively more and more novel and creative; this trend will continue.
Or it could be
Current Architectures are not very creative, but they don't need to be in order to make human AGI researchers get to creative AI in the next couple years.
In my own words:
- Current LLMs are a little bit creative, rather than zero creative. I think this is somewhat demonstrated by the connect-the-dots paper. Current LLMs mostly learn about chairs by copying from humans, rather than inventing the concept independently and then later learning the word for it, like human infants. However, they are somewhat able to learn new concepts inductively. They are not completely lacking this capability. This ability seems liable to improve over time, mostly as a simple consequence of the models getting larger, and also as a consequence of focused effort to improve capabilities.
- An intelligence explosion within the next five years does not centrally require this type of creativity. Frontier labs are focusing on programming capabilities and agency, in part because this is what they need to continue to automate more and more of what current ML researchers do. As they automate more of this type of work, they'll get better feedback loops wrt what capabilities are needed. If you automate all the 'hard work' parts of the research, ML engineers will be freed up to think more creatively themselves, which will lead to faster iteration over paradigms -- the next paradigm shifts of comparable size to RLHF or RLVF will come at an increasing pace.
If it's something like
The tasks that Architectures have had success on have expanded to include performance that's relatively more and more novel and creative; this trend will continue.
then we have an even more annoying enthymeme. WHAT JUSTIFIES THIS INDUCTION??
To sum up my argument thus far, what justifies the induction is the following:
- The abstract ceiling of "deductive closure" seems like a high ceiling, which already seems pretty dangerous in itself. This is a ceiling which current LLMs cannot hit, but which abstractly seems quite possible to hit.
- While current models often fail to generalize in seemingly simple ways, this seems like it might be an alignment issue (IE possible to solve with better ideas of how to train LLMs), or a model size issue (possible to solve by continuing to scale up), or a more basic training issue (possible to solve with techniques similar to what was employed in the BabyLM challenge), or some combination of those things.
- If these failures are more whack-a-mole like, it seems possible to solve them by continuing to play the currently-popular game of trying to train LLMs to perform well on benchmarks. (People will continue to make benchmarks like ARC-AGI which demonstrate the shortcomings of current LLMs.)
- I somewhat doubt that these issues are more fundamental to the overall Deep Learning paradigm, due to the BabyLM results and to a lesser extent because generalization ability is tied to model size, which continues to increase.