1.     Introduction

1.1     Summary of key claims

  • Even without further breakthroughs in AI, language models will have big impacts in the coming years, as people start sorting out proper applications
    • The early important applications will be automation of expert advisors, management, and perhaps software development
    • The more transformative but harder prizes are automation of research and automation of executive capacity
  • In their most straightforward form (“foundation models”), language models are a technology which naturally scales to something in the vicinity of human-level (because it’s about emulating human outputs), not one that naturally shoots way past human-level performance
    • i.e. it is a mistake-in-principle to imagine projecting out the GPT-2—GPT-3—GPT-4 capability trend into the far-superhuman range
    • Although they’re likely to be augmented by things which accelerate progress, this still increases the likelihood of a relatively slow takeoff — several years (rather than weeks or months) of transformative growth before truly wild things are happening seems plausible
    • NB version of “speed superintelligence” could still be transformative even while performance on individual tasks is still firmly human level
  • There are two main techniques which can be used (probably in conjunction) to get language models to do more powerful things than foundation models are capable of:
    • Scaffolding: structured systems to provide appropriate prompts, including as a function of previous answers
    • Finetuning: altering model weights to select for task performance on a particular task
  • Each of these techniques has a path to potentially scale to strong superintelligence; alternatively language models might at some point be obsoleted by another form of AI
    • Timelines for any of these things seem pretty unclear
  • From a safety perspective, language model agents whose agency comes from scaffolding look greatly superior than ones whose agency comes from finetuning
    • Because you can get an extremely high degree of transparency by construction
    • Finetuning is more likely an important tool for instilling virtues (e.g. honesty) in systems
    • Sutton’s Bitter Lesson raises questions for this strategy, but needn’t mean it’s doomed to be outcompeted
  • On the likely development trajectory there are a number of distinct existential risks
    • e.g. guarding against takeover from early language model agents is pretty different from differential technological development to ensure that we automate safety-enhancing research before risk-increasing research
    • The current portfolio of work on AI risk is over-indexed on work which treats “transformative AI” as a black box and tries to plan around that. I think that we can and should be peering inside that box (and this may involve plans targeted at more specific risks).

1.2     Meta

We know that AI is likely to be a very transformative technology. But a lot of the analysis of this point treats something like “AGI” as a black box, without thinking too much about the underlying tech which gets there. I think that’s a useful mode, but it’s also helpful to look at specific forms of AI technology and ask where they’re going and what the implications are.

This doc does that for language models. It’s a guide for thinking about them from various angles with an eye to what the strategic implications might be. Basically I’ve tried to write the thing I wish I’d read a couple of years ago; I’m sharing now in case it’s helpful for others.

The epistemic status of this is “I thought pretty hard about this and these are my takes”; I’m sure there are still holes in my thinking (NB I don’t actually do direct work with language models), and I’d appreciate pushback; but I’m also pretty sure I’m capturing some important dynamics which aren’t as broadly appreciated as they should be. Many of the particular insights here are due to other people. I want to say thanks to Adam Bales, Anna Wang, Buck Shlegeris, Carl Shulman, Daniel Dewey, Eric Drexler, Max Dalton, Nate Soares, Rebecca Cotton-Barratt, Rohin Shah, Rose Hadshar, Tom Davidson, and especially Beth Barnes, David Manheim, Lukas Finnveden, and Toby Ord, for helpful comments and/or conversations.

2.     What type of thing are language models?

2.1     Emulating civilization, not individual people

The field of AI was originally about reproducing human intelligence. Humans are good at finding patterns and learning things. If we could automate the type of thinking they do, that would be a big deal. If we could build automated systems which were better general learners and thinkers than humans, it would transform the world.

Language models aren’t really trying to do the same thing. This may be a surprising claim; they’re a type of machine learning, which is doing exactly this. However, I think it’s clearer to think of language models as a specialized application of machine learning. Sure, they make use of machine learning techniques, but their game isn’t really “be better than humans at learning from a certain amount of language” (indeed they’re fed with so much data that they can be much more inefficient than humans, and I don’t think this is a crux for how important they will be). It’s “replicate the kind of things humans say”.

This is powerful because humans, collectively, know a bunch of stuff, both implicitly and explicitly. There’s a lot of knowledge and intelligence which is crystallized in our writing. If the language models of today seem to know a lot of things, this isn’t because they’ve gone out and understood the world directly, but because they’re leveraging knowledge which is represented in human text. 

Moreover language is the medium via which we construct concepts and make explicit arguments — powerful tools for understanding and acting in the world. The ability to approximate human writing — even if not based on the same underlying learning abilities — might reproduce a lot of that intelligence. 

All of this matters for thinking about the impacts language models are likely to have, and where they might be going. In slogan form, perhaps:

Machine learning reimplements human intelligence.

Language models emulate humanity’s collective intelligence.

Note that language models could be used to emulate the written output of individual people, if a prompt was specific enough that it tightly specified the author. But this isn’t their default mode — mostly predicting text will depend on averages across a lot of different (possible) people (weighted by how likely those people were to be writing about the topic).

2.2     An extremely crude picture of how language models work

For the purposes of this document, what I think is important:

  • Language models are doing “next-token” generation. This amounts to doing things word-by-word — given an input string, they produce a further token continuing it, and if you want a longer text you repeat this.
  • Language models are large neural nets, with some specific architectures (notably transformers) which have been found to be efficient for language learning.
  • When a language model is run to make a token prediction (a “forward pass” through the model), it can only do a relatively limited amount of computation; however since they are large they may contain things approximating large lookup tables, and contain a lot of crystallized intelligence that way.
  • Language models are trained by gradient descent — searching through the (vast) space of possible parameters to find ones which iteratively perform better.
  • “Foundation models” are the most straightforward version of language models. They are trained on a corpus of text, to generate text from the same distribution. If they are given a prompt, they will predict what the most likely continuation of that prompt would be, if it were drawn from the same distribution as the training data.
  • Some of the more interesting and powerful applications of language models do something beyond just working with foundation models.

2.3     What are foundation models approximating?

We can think of foundation models as a series of approximations. A given foundation model Wi approximates the limit WText of what we could achieve with ideal machine learning and all extant text. This in turn approximates WOmega, which is the true distribution human writing is drawn from. Foundation models can never actually achieve “the true distribution”, but understanding that this is what they’re approximating may help us to understand their scope as a technology.

Here’s a digression digging a bit deeper on these concepts:

  • WOmega is the hypothetical which foundation models approximate, predicting human language continuations in line with the true underlying distribution
    • It’s a little vague how this “the true distribution” is defined, but let’s suppose that we get writing from not just our world but a bunch of close counterfactual worlds (including ones which are slightly in the future), with probability mass falling off for worlds which are at greater counterfactual distance from our own.
    • WOmega is a black box, not structured as a neural net
      • It’s likely that it can’t be perfectly instantiated as a neural net (of non-astronomical size):
        • In order to perform optimally, it would need in some sense contains implicit models at least as rich as all humans — if it’s predicting text, it wants to be able to make inferences about the type of person that might be writing it, and if it’s a long way into a piece of text then it will be making complex inferences about their psychology to predict the next words
        • Almost certainly there are enough cases where there’s an irreducibly large amount of computation needed to make predictions that it can’t fit all of that into a forward pass — nor store enough information to have cached answers in all of those cases
    • A black box which literally gave access to WOmega would be a very powerful artefact
      • As well as excellent predictions about individual humans, WOmega probably contains a lot of implicit knowledge about the the world — perhaps more than humanity understands. It will be able to tell which hypothetical scientific theories are plausible continuations of current progress, and which are not. With the right coaxing it might be possible to extract some of this knowledge from it.
        • This probably won’t work for radically different technological regimes — these are far enough out of distribution that if a piece of text starts describing a radically new form of science or technology, it may be more likely to be some kind of fiction or mistake than some distant counterfactual world with much more advanced technology
  • WText is a theoretical limit of what’s achievable with machine learning in the actual world
    • It’s again vague how it’s defined, but roughly:
      • It’s only trained on actually-extant human writing
      • It trains a neural net, which may be larger than existing ones or make use of new architectures, but isn’t astronomically large
      • It only has access to a finite amount of compute (say all currently extant compute for a year)
    • WText  is an approximation to WOmega, and likely has several of the same properties, in weaker forms:
      • Computational constraints will prevent it from considering very complex world models like simulations
        • It therefore won’t contain models “as rich as all humans”; however it likely still will be good or very good at predicting human psychology, and it will have a lot of implicit knowledge about the world
      • I think it may more often be an issue that it only has text as a window to understand the world — if there’s something that’s only been written about a handful of times, or that people have been pretty bad at understanding/describing, that thing may be severely underdetermined from the perspective of WText (whereas WOmega has direct access to the underlying distribution)
        • Though see the discussion of multimodal models in Section 4.3
    • It’s instantiated as a neural net
      • This means that if in the process of learning to do good token prediction it’s implicitly learned about important structures in the world, that knowledge is somehow encoded in its data structures, and there may be ways other than simply asking for token predictions to access/use and extract value from that
  • Actual foundation models Wi are approximations to WText
    • They are weaker in two ways, analogously to how WText is weaker than WOmega:
      • 1) They are trained on smaller corpuses
      • 2) They are weaker predictors even for these smaller corpuses
        • Their hypothesis space is smaller/worse (presumably using less variables, and likely also lacking some architectural benefits)
        • They are further away from being trained to convergence
    • Both of these weaknesses have real impacts
      • The net effect is that rather than something of superhuman intelligence, we get something like an “inebriated” version of WText
      • Stronger language models are less inebriated than weaker versions, although they generally seem to stay below the threshold of human expertise (for now)

3.     Techniques for getting value from language models

A major focus of research on language models has been on improving the foundation models — getting better approximations to WText. But there is important complementary research in the question: for a fixed foundation model Wi, how can you do useful things? There are few different techniques:

3.1     Prompt engineering

The output of foundation models depends on the prompts they are given. This would be true of WOmega — the value of being able to sample from all possible human documents would be importantly dependent on the ability to steer towards the most useful parts of document space. For the weaker foundation models we have, there may be other helpful tricks in designing prompts.

Over the last couple of years, as people have played around with language models, there has been a lot of parallelized labour into finding the style of prompts that is most likely to lead to good things. To the extent that people are finding knowledge about how to get value out of WOmega, this will generalize to future language models; to the extent that they’re learning tricks peculiar to the current generation of foundation models, it may not.

3.2     Scaffolding

Scaffolding is the general category of designing environments around language models which feed them prompts and process their outputs. Scaffolding is a broad category of which the most straightforward case is just prompt engineering, but in general it allows for complex procedures where the output in response to earlier prompts is fed into other software tools, and these determine what is put into later prompts.

For example, scaffolding could allow for a model to make multi-stage plans and then call separate instances to execute each of those stages without losing track or where it is, and to make use of tools such as browsing the internet and writing and executing software.

Limits of what might be achievable via scaffolding are discussed in Section 6.3.2.

3.3     Finetuning

Finetuning takes a foundation model and runs more machine learning to adjust just some of the weights — using the foundation model to give an inductive in its search for more refined models. The idea is that it’s much easier to find models which are smart in arbitrary ways if you’re restricted to a much smaller-dimensional search space. For small amounts of finetuning, we might think of the inductive bias as being roughly “only consider saying things that humans might say”. For larger amounts of finetuning the bias might be more structural, making use (in opaque ways) of implicit knowledge the language model has to restrict the search space.

Finetuning relies on having some metric, or feedback loop, to train things towards. This could be given by some body of text it’s trying to emulate, or by some other function of text output. 

3.4     Combining these

Scaffolding and finetuning can be combined. Generically I think they will be. For it not to make sense to use scaffolding it would be the case that the trivial scaffold performed (roughly) as well as anything else. I think this is implausible at least in the short term. And it would be even more surprising if foundation models — which were selected for their ability to emulate human outputs — happened to be optimized among close by systems for their performance when used in an effective scaffold. I therefore think it’s implausible that it won’t be optimal to make use of finetuning.

We might think of finetuning as analogous to on-the-job training for the use-case at hand, and scaffolding as analogous to setting up a good management structure and organizational protocols. The analogy supports the idea that a combination of the two may be most effective.

4.     Natural limits of language models

In Section 5 we’ll start to look at the impacts language models will have in the world as they are further developed and deployed. In order to facilitate that, in this section we’ll look at some natural limits on the kind of things language models are doing. We’ll be concerned with “what kind of outputs can they produce?”; questions of how fast they can produce those, or how they are integrated into society are of central importance for how much impact they end up having, but out of scope for what I want to explore in this section.

4.1     Approximating human capabilities, not superhuman capabilities

There’s a common argument about AI that goes roughly:

There’s nothing special about human capability levels. In any given domain, if AI capabilities are advancing rapidly towards human-level, they’ll probably continue advancing rapidly way past human-level.

Foundation models have been rapidly advancing towards giving human-level responses to many different types of questions: they are rapidly approaching human-level at writing poetry, or explaining physics, or concocting recipes — in the sense that they are far closer to human level now than they were three years ago. Foundation models, however, are emulating human outputs. To the extent that they have human capabilities, they have these via emulation. So the argument doesn’t apply (at least in the straightforward way); rather, we should expect progress to slow down when the quality of their outputs are somewhere in the vicinity of (peak) human performance.

There are a couple of important caveats here:

  • There are techniques based on scaffolding and on finetuning which could help to push performance to the superhuman regime
    • In this case, human performance could still be a distinguished point in terms of what can be achieved via emulation rather than a different approach, but at some point it might rapidly be obsoleted
    • I’ll discuss possible paths for scaling language models towards superintelligence in a later section
  • We can apply the argument (that we should expect performance to blow quickly past human level) to capabilities which belong to the process that trains language models, rather than abilities the trained models have learned from their text corpus
    • e.g. “how good are they at learning language?”, or probably “how good are they at generalizing patterns?”
    • But not “how good are they at knowing the law?”, or “how good are they at providing strategic advice?”

4.2     Limited cognition per forward pass

To produce a single token, a language model makes a single forward pass over the neural net. To produce longer pieces of text, it repeatedly produces single tokens, with everything it’s produced so far added to the context.

Each forward pass amounts to something of similar complexity to multiplying together some large matrices. This gives lots of room for something like consulting an index and accessing stored knowledge, but relatively limited space for something like “thinking new thoughts”. 

By analogy, when humans learn arithmetic they do it by a mix of rote memorization — many of us see “3x7” and instinctively know that the answer is “21” without calculating anything — and processes for calculating things (e.g. long division). Language models are structured in a way that can make them good at the rote memorization part, but they cannot in a single forward pass do a large amount of following a process.

This means that we can construct tasks that even very strong foundation models will predictably be weak at. e.g. — 

The remainder of [352 digit number] when divided by [219 digit number] is …

WOmega probably gets this right most of the time. But WText probably gets it wrong almost all of the time. (Unless there are some heuristic tricks I’m unaware of. I’d be more confident in my example if it asked for prime factorizations.)

There are three important caveats here:

  • In a single forward pass, language models are not restricted to rote memorization, but can use an arsenal of simple heuristics as well. The key constraint is that the amount of new thinking is very limited, not that there is none.
  • It’s hard to infer limits on “how smart a language model can be”, since it could have a lot of extremely smart ideas as ~cached thoughts (an analogue of rote memorization) which it could reproduce given the right prompt. 
    • This could be thought of as analogous to the “Chinese room” thought experiment
    • If we’re positing a language model which has extremely smart ideas in this way, it begs the question “by what process did these thoughts originate?”
      • In the preceding section, I give an argument that for today’s language models, we’re seeing the fruits of them having humanity’s past thinking as their cached thoughts, but we can entertain processes which produce smarter things
  • With many forward passes, language models could do a lot of cognition
    • In order to have this meaningfully address the different parts of the problem, some structure would need to ensure that they meaningfully decomposed the necessary thinking, and that different forward passes received the relevant context to let them do the respective pieces
      • By default I think that getting many forward passes just by letting language models produce long text answers won’t help in the relevant way
      • “Chain of thought”, where language models are prompted to show their thinking, could help significantly with this, as different places in the thinking could make it clear what needs to be thought about next
      • More generally, scaffolding seems like a powerful tool for helping with this

4.3     Missing cognitive moves?

Language models are capable of reproducing some types of ~atomic cognitive move that humans use. There may be others — at least at any given moment in time — that they cannot reproduce.

Reasons that they might not be able to reproduce a given cognitive move:

  • The architecture used for the model doesn’t support it as a simple structure
  • There isn’t enough space in the neural net to do the move in a single forward pass
  • The training process doesn’t give a good path to learning the move (even if it could in theory be supported on the given neural net)

It’s worth being aware that there could be constraints from these on what language models can do, but that this might change as architectures improve or models become bigger. (Furthermore, it might be that at some stage — if not already — language models can make useful cognitive moves that humans are incapable of.)

Multimodal models

One concern might be that language models are only equipped to deal with things in language. How do multimodal models affect this picture? Multimodal language language models are the same basic technology as language models, but they use encodings of non-text data into a kind of text to allow the models to interface with this non-text data. They can output non-text via the encoding if that’s the thing that the language model predicts will happen.

Multimodal language models are therefore able to interface with and think about non-text data. But they may (at least for now) be more likely to lack the correct architecture to reproduce the type of cognitive moves humans do with non-text data. However, language models could be augmented with various capacities by using scaffolding to give them access to interfaces which permit them to query other kinds of objects (e.g. image processing; running physics simulations).

5.     Early major impacts of language models?

5.1     Principles for thinking about this

The main metaphor I use to think about this goes as follows:

Suppose you have a large workforce of relatively expert people — whom you can train at significant expense and then will work very fast for a very small fraction of minimum wage — but they’re all a bit drunk and only working from home. What can you usefully do with them?

Of course this metaphor isn’t perfect (and readers may want to think about its imperfections to critique the conclusions I draw from it), but I think it’s probably pretty good as a starting point. A major intuition that I have about that scenario — which I think is probably accurate about the actual situation with language models — is “wow, there’s a really big prize available here for whoever can figure out how to use these folks to do useful stuff”. And there will certainly be incentives to develop techniques to mitigate the obvious disadvantages of being drunk (e.g. via automated error checking).

A couple of people have mentioned to me another metaphor: a large force of interns. I think this is also good; it’s a little better in suggesting that by default they don’t know much about the task at hand, but a little worse in suggesting that they get their knowledge about the domain by looking things up rather than by half-remembering (or occasionally fabricating) facts, and in suggesting strategies like “identify the good interns” which don’t really translate over.

A quick note/aside on the economics:

 As of mid-2023, GPT-4 costs around $0.1 per 1,000 tokens (around 750 words). That’s about 10 minutes of typing at 75wpm. So we’re looking at getting this work at around $0.6/hour — equivalent to perhaps 5% of minimum wage in rich countries. I don’t know how high the markup that OpenAI charges is relative to their marginal cost of providing the service, but I wouldn’t be surprised if the production cost at the margin is much cheaper than that (they charge like 2% of that — 0.1% of minimum wage — for older models, and my guess is that that’s much closer to the marginal production cost for them, and could still be significantly above marginal production cost). Perhaps newer more sophisticated models will be more expensive, but also perhaps progress (or improving compute) drives down prices. (Of course if compute starts being super valuable for this purpose that is likely to push the price of compute up, at least until compute manufacturing can be scaled up to meet demand again.)

OK, so that’s the groundwork. Now to think about what this could mean for where the transformative impacts come. Some observations:

  • Likely there are some tasks for which it’s much easier to get useful work out of drunk experts than others
    • Especially ones where you don’t need them to be super reliable or to follow long/complex chains of reasoning
  • We should expect the early major applications to be ones where:
    • It’s easy to get useful work out of drunk experts, &
    • There’s a lot of work that can be done by the same experts (without retraining to niche cases, which could get expensive)
  • There may be some significant impacts on the labour market as some work is automated away — and this could have serious social/political effects — but it’s unlikely to be directly transformative in cases where the elasticity of demand isn’t high (so the volume of work doesn’t go up much even when the price drops a lot)
  • Potentially transformative: work where elasticity of demand is high (so that when the price drops a lot we just get a lot more of that work done)

5.2     Important early areas for automation

There are several categories of intellectual labour that I think might be automatable with language models and really important. Three of them together I think might change the world a lot — perhaps on a comparable scale to the industrial revolution, but probably not radically beyond that. In roughly increasing order of importance, they are:

  • Expert advice (teaching, medical advice, coaching, legal advice, consultancy, …)
    • There are lots of things where people benefit from professional advice, but it’s expensive so many don’t get to access it
    • In a world where this is cheap this affects a lot of people’s lives
      • Elasticity of demand is pretty high, and this likely caps out many times higher than current demand — but it ultimately does probably cap out
    • I think it’s very likely that major applications are possible here without any fundamental advances in the foundation models
    • Of course people would prefer advice from experts who aren’t drunk and don’t hallucinate, but:
      • If this kind of error is only occasional, they may prefer to have it than not to get the advice
      • People are likely to work out tricks to reduce the frequency and impact of these issues 
  • Software engineering
    • The reason this is potentially a big deal is that software builds on software — we get layers of architecture assembling into largescale valuable things. Elasticity of demand could be high if civilization uses this to develop new digital capabilities.
    • This also probably facilitates the big two (see below).
    • We may expect to see success here from language models, since:
      • Software is naturally language-based, and we’ve already seen LLMs do well with it
      • Software is a domain where it’s easy to evaluate success, which means that it’s a relatively easy domain to train things to do well at
    • Big successes (involving planning the architecture for complex pieces of software) may or may not require significant improvements in foundation models 
  • Management
    • Overseeing established systems and processes and keeping them running well
    • Can include all different kind of scales of management, e.g.:
      • Personnel management (keeping in touch with people working and ensuring that they’re staying on task and well-motivated, and issues they have can be raised appropriately)
      • Project management
      • Inventory management
      • Customer service (managing a good relationship with a customer)
      • Implementation of an interface with standard rules for what should happen (e.g. could potentially replace many government services)
    • NB effective automation of management may be trickier than effective automation of expert advice for a couple of reasons:
      • It’s really crucial to be able to handle the inebriation issue — you want your management structures to be highly reliable
      • Effective implementation is going to require integration into messy human structures, which will likely take a bunch of time to experiment with and for people to get used to, beyond the point where it would be technologically feasible
    • Nonetheless I think it’s highly likely that a lot of management could be automated without significant further progress in foundation models
    • Automation of management seems like a big deal in terms of enabling the construction of large classes of automated structures/systems
      • In this sense it’s a broadening of software (which is the automation of very precisely specified tasks and the assembly of these into larger structures); I don’t know what all of the applications might be, but it seems like a big deal as a possible platform technology 

5.3     The big two applications

More important than the preceding, I think there are two really important applications, which have the potential to radically reshape the world:

  • Research
    • The ability to develop and test out new ideas, adding to the body of knowledge we have accumulated
    • Automating this would be a massive deal for the usual reasons about feeding back into growth rates, facilitating something like a singularity
      • In particular the automation of further AI development is likely to be important
    • There are many types of possible research, and automation may look quite different for e.g. empirical medical research vs fundamental physics vs political philosophy
      • The sequence in which we get the ability to automate different types of research could be pretty important for determining what trajectory the world is on
  • Executive capacity
    • The ability to look at the world, form views about how it should be different, and form and enact plans to make it different
    • (People sometimes use “agency” to describe a property in this vicinity)
    • This is the central thing that leads to new things getting done in the world. If this were fully automated we might have large fully autonomous companies building more and more complex things towards effective purposes.
    • This is also the thing which, (if/)when automated, creates concerns about AI takeover risk.

I think that these two categories are likely at least somewhat harder to get high quality automated work out of than expert advice or management. Why?

  • They seem to have to keep more things in mind and deal with complex situations 
    • So that “being drunk” seems like more of a hindrance
  • The most important work is often:
    • less close to “working from a script / playbook” 
      • (things that language models excel at)
    • more close to “stare into the void until you have a vision that functions well with the shape of the world, and then use language to articulate the thing you’re thinking about” 
      • (not a natural MO for language models, at least on the face of it; this could relate to the discussion in the previous section on missing cognitive moves)

I’m not sure how big/thorny these obstructions are. The prizes from automating them are very high, so there will be a lot of pressure to find the paths of least resistance. e.g. even if the most efficient way for humans (and hypothetical ideal AI) to do work here is more like “stare into the void and then bring it back to the domain of language” rather than just doing all the reasoning at the verbal level, if there’s a way to get comparably good results by doing everything at the explicit verbal level and it’s just 100x slower, that could still be enough to get you something transformative.

High quality software engineering has some of the same obstructions, but because it’s so easy to get a high-quality success metric, we may expect self-play to help push model performance up to human-level and beyond relatively early. Research and executive capacity face issues with epistemic grounding: how can you be confident that one angle leads to better takes than another? We may ultimately need to rely on real-world feedback loops to help learn this, but they may be slow.

We should probably expect research and executive capacity to be partially automated (and so performed by centaurs, i.e. human–AI teams) before they’re fully automated. At minimum, many people in research and executive roles spend good fractions of their time on software or management tasks, so automating the latter would increase total capacity for the former.

6.     Timelines and takeoffs

6.1     How quickly is all of this likely to happen?

My view is that for a lot of the pieces with significant societal impacts, the fundamental technology is already here. Over the next 5–10 years we might see people building and deploying systems which do a lot of stuff in the world, based on near-term-accessible language models. A lot of innovation will come from startups doing “X with AI”, for various applications X mostly providing expert advice or management services. They will often start by doing it in ways that have human oversight for quality control and training purposes, but reduce the degree of human oversight over time. By default the developers will make use of both finetuning and scaffolding — just hackily throwing stuff together to find out what works. 

The vibe I’m imagining for this is something like the Industrial Revolution or the Wild West, not a nuclear arms race. This could be enough to create significant social unease, centred in the middle classes, as many people see their livelihoods threatened, and more feel uncomfortable with how fast everything is changing.

(If I’m wrong about them having big impacts over this timescale, it’s probably because of some important missing cognitive move which restricts their usability — perhaps something about reliability. But my guess is that these kind of issues will turn out not to be a big problem, or will be surmountable given the scale of the prizes.)

We may see something more like a race for big-2 capabilities. Because if fully automated they can potentially be deployed at very large scale by a single actor (rather than quickly saturating demand), the incentives for a pure race could exist. However, I think it’s most likely that for a while centaurs will significantly outperform fully automated systems — if this is right then while there’s quite likely to be a race for full automation at some point, that would occur in a world which looks significantly transformed from the one we see today (where research has already been accelerated by centaur human-AI teams, and a lot of important planning in the world is done by humans aided by AI). The duration of this centaur period — especially how long we have in the “late centaur” period where efficiency of research is many times what it is today — could be important for determining how different that future world is.

I’m pretty unsure how far we are from ~full automation of big-2 capabilities. When I try to visualize future world trajectories and look for the most coherent ones, I think it’s most likely that this is somewhere in the range 5–15 years away; but I’m not confident in this. At the point where that process is really taking off I expect it will overtake the kind of broad societal impacts I’ve just been discussing, if it is not otherwise constrained.

6.2     Scaling language models towards superintelligence

Foundation models get their oomph from approximating human writing. They can approximate smart or knowledgable humans (with the right prompts, or the right training corpus). But for getting significantly superhuman performance, they would need something else. What could that be?

Two techniques which might be helpful components:

  • Finetuning for superhuman task performance
  • Scaffolding for amplification via reflection

6.2.1     Finetuning for superhuman task performance

For tasks with well-defined success metrics, simply training to do well on those tasks could produce superhuman performance. How quickly this will happen is likely to depend on the task. In the limit with a rich enough model space, enough training data, and enough training time, we might expect to end up approximating optimal performance (and hence exceeding human performance) at every task. But in practice performance on some tasks might be capped by what is achievable within the model space, and might face challenges in getting good enough data.

Still, finetuning for superhuman performance seems like an important part of the picture. At tasks like “write an argument which is persuasive to X audience”, where there is lots of data available on the reactions of that audience, we might expect language models to do pretty well pretty quickly (especially to the extent that persuasiveness is a function of local sentence choice and not larger-scale structures of how arguments fit together). At tasks like “give a winning chess move”, we can generate high quality synthetic data so that it’s likely that we can finetune model performance to exceed top human intuitive play. (Though note that within the confines of a single forward pass, the limit on cognition could prevent too much tree search through future game states, which could mean that performance still lags behind systems which are capable of tree search.)

For open-ended tasks like “build a company that will make a lot of money”, I guess that we will for the near future be unable to give enough data and train deep enough to get superhuman performance on this just with finetuning.

6.2.2     Scaffolding for amplification via reflection

Humans are able to benefit from time to reflect. Our slow answers to questions are often better than our snap judgements. But often we don’t actually get the time to reflect, and do act on the basis of our snap judgements.

Since “thinking time” can be very cheap for language models, if they could similarly benefit from extra reflection time, this could help them to boost their task performance significantly above their non-reflective performance. And if their non-reflective performance is approximating human performance, their reflective performance could naturally be superhuman. (Albeit if this were the only mechanism for getting superhuman performance, it might be capped at “what groups of humans going slowly and carefully could do”.)

Scaffolding provides a toolset to help facilitate this reflection. The language models of today already benefit from extra thinking time — they perform better when prompted to think out loud, and scaffolding techniques like running things multiple times and taking a vote can improve performance.

6.3     Recursive improvement and takeoff

An intelligence explosion based on language models would need a mechanism for recursive improvement — something that could repeatedly ratchet towards better performance, where improved performance would help with the next round of improvements.

6.3.1     Reflection-based takeoff

If more thinking time leads to better takes in a relatively unbounded way, this could be a mechanism for takeoff. The key threshold here is not “does performance increase with extra thinking time?” (a bar that language models already clear), but “can performance scale ~arbitrarily far with extra thinking time?” (a bar that humanity as a whole probably crosses, but the language models of today probably don’t).

Even if this bar is crossed, improvement isn’t automatically recursive. But if we know how to use extra compute to produce superhuman performance, we can then use that to construct new data sets to be approximated. These could be used as part of finetuning, or even to build new text corpora, which represent (initially modestly) superhuman levels of intelligence.

This, then, could be iterated. The hope would be that reflection by systems which are approximating smarter answers will be more effective, and lead to yet smarter answers. The system could gradually bootstrap its way to strong superintelligence — essentially continuing the process whereby 21st Century humans are in many ways meaningfully smarter than 11th Century humans.

I say “gradually”, but with large enough amounts of compute this process could potentially play out quickly. Here’s some hacky first-pass analysis:

  • GPT-4 was trained on a corpus of a petabyte. To produce that much text out of GPT-4 would at the prices charged commercially cost something of the order of $10B
  • Factors affecting the real price to upgrade the training corpus:
    • Probably you can get away with upgrading much less than the whole training data for GPT-4 (lowering the price, ?perhaps by several orders of magnitude?)
    • Maybe wholesale production costs significantly undercut commercial costs (lowering the price by ?maybe an order of magnitude?)
    • Maybe increased global demand for compute raises prices (increasing the price by ?maybe an order of magnitude?)
    • Likely to upgrade the training corpus in repeatable ways the individual sentences have to be not just ones generated by the current generation of model, but which are the output of a significant reflection process, which is more expensive to implement than just the final sentence (increasing the price, ?perhaps by a few orders of magnitude?)
  • It’s pretty unclear to me right now how this will shake out, but I guess a price of $100M – $100B for a successive upgrade to the training corpus feels likely
  • It’s also super unclear how much benefit each such “upgrade step” would create; I guess corresponding to something like years or decades of human progress, but this could be way off
    • (In practice I think it might be more continuous than this rather than consisting of discrete upgrade steps, but for the purposes of the first pass analysis it seems reasonable to treat it as discrete)
  • This is enough to look like: once it’s underway and working well, this “slow, boring” way could still be blisteringly fast by standards we’re used to — getting centuries’ worth of progress (on how to think smarter, not on exogenous tech) in a year
  • There may be a period where it’s starting to be useful but hasn’t yet achieved this efficiency, where things are moving at something closer to the pace we’re used to

Still, overall I think this could be thought of as something like “the slow, boring path to superintelligence”. Perhaps it will be the first one that works. But I think it’s a good likelihood that some other things help it to move faster.

6.3.2     Scaffolding-based takeoff

It’s unclear what the performance returns to better scaffolding will look like. At least right now, it seems like nobody has invested that much in building good scaffolding (compared to the investments in building good foundation models), so there might be low-hanging fruit remaining.

How good can scaffolding ever get? One thought is that perhaps a given foundation model has something like a level of “latent potential”, and ideal scaffolding unlocks that but never exceeds it. However, with the right scaffolding one could reimplement an arbitrary GOFAI; while wildly impractical, this is a thought experiment which demonstrates that there is no natural ceiling on capabilities imposed by the foundation model. 

Scaffolding is a language-based construction, so language models could plausibly learn how to contribute to better scaffolding (which can then be experimented with, and could recursively feed into further improvements to scaffolding). We are therefore interested in a question like “what is the returns curve to investment on improving scaffolding?”, which is an empirical question. For some possible shapes of the curve, improvements to scaffolding could precipitate an intelligence explosion, gathering pace faster and faster as successive generations of scaffolding are more effective than the last at further improving the scaffolding. My guess is that the parameters don’t quite shake out that way, but this feels very guesswork-y for such an important parameter.

6.3.3     Finetuning-based takeoff

I’m hazier on the details of how this would play out (and a bit sceptical that it would enable a truly runaway feedback loop), but more sophisticated systems could help to gather the real-world data to make subsequent finetuning efforts more effective.

6.3.4     Mixed takeoffs

Perhaps most likely is that there is no single silver-bullet, but takeoff contains elements of all of these processes, and others, blended together in a vortex of increasing speed. e.g. as well as improved scaffolding feeding into improved reflection which can help with the next generation of scaffolding, improvements in AI performance could help to accelerate developments in chip fabrication, so that there are greater amounts of compute available to help this process run more quickly.

This should be faster than what we would get out of any single mechanism. The main reason we wouldn’t see such a mixed takeoff is if one of the components is individually so fast that it leaves everything else behind.

One possibility that arises as part of a mixed takeoff is using machine learning to optimize for the most effective scaffolding. I’ll discuss further in a later section (on the bitter lesson).

6.3.5     Systems not built on language models

I’ve been considering recursive improvement for language models. But the general arguments for an intelligence explosion don’t assume anything like the particular form of language models. Whether or not an intelligence explosion based on language models is possible, it’s likely the case that an intelligence explosion based on other forms of AI technology will eventually be possible. (& the argument about things which exceed human level rapidly blowing past human level is more likely to directly apply to such technologies.)

Could this matter? Yes, in two possible worlds:

  • Language models turn out not to effectively scale towards superintelligence (i.e. they might get there in theory, but recursive improvement doesn’t give runaway dynamics)
  • Some other technique which scales faster towards superintelligence overtakes language models in producing the most capable systems
    • This could happen early, or could happen after we already have highly transformative and superhuman language models

7.     Language model agents and transparency

7.1     Where does agency come from?

Suppose we have an agent-like system built out of language models. The foundation models themselves weren’t agent-like. So where could the agency have “come from”?

I think the answer will be one, or a combination, of three possibilities:

  1. The system could be emulating a human or other agent represented in the corpus
    • i.e. it’s implicitly predicting “what would this agent do in this circumstance?”, where the agent and the circumstance have somehow (explicitly or implicitly) been specified
  2. The agency could be selected for (presumably via finetuning)
    • If the developers have selected a system that performs well on a particular task, it is quite plausible that part of the selection pressure has gone towards agency (since this is a generically useful capability)
      • cf. humans and evolution
  3. The agency could be explicitly built in via scaffolding
    • e.g. a prompt gives a language model an explicit goal and asks it to generate plans towards that goal, and then its answers are taken and processed into new prompts to get the plans implemented

I think we should have quite different attitudes towards these, from an AI safety perspective. 

1) seems like mostly a sideshow — while we could get agency from this, unless people are trying hard I don’t think it would tend to find especially competent agents to emulate, and may not have a good handle on what’s going on in the world.

2) seems scary. This is the classic case of mesa-optimization. By default I’d think we should expect not to really understand the goals of agents that have been selected for this way. There may be clever work that could be done to ensure things are safe, but this is the kind of story that makes AI risk seem large and thorny.

3) seems promising. An agent built in this way would come with a massive amount of transparency-by-construction:

  • Supposing that it benefits from time reflecting — then it articulates its thoughts and plans in text in the process of thinking them through and critiquing them. All of these are written in natural human-readable language. 
    •  → We have read-access to all of the system’s conscious thoughts. (We don’t have read access to the process producing those conscious thoughts, but those processes may be much less agent-like)
  • Its goal is specified explicitly, and sub-goals also get specified explicitly (to be passed to other parts of the scaffolding). Then we also have write-access to the system’s goals.

This is probably a vast volume of thoughts to handle, but everything is in a very legible form and we can probably take steps to automate oversight. In general: all the normal reasons people are keen on transparency make it seem like a great idea to try to use architecture which is extremely transparent. (This includes both wanting transparency to facilitate long-term AI safety, and wanting transparency to enable auditing of AI applications in the shorter term.)

In practice things may often use a combination of these. And a combination could be concerning: if we have top-level agency coming from 2), then we’re less able to trust the transparency from 3), since the system might have incentives to misrepresent its own thoughts.

7.2     Strategy: avoid selection pressure for agency

A lot of putative safety techniques are around assuming that we have something potentially dangerous and catching it. I think these are well worth investing (defence in depth seems valuable), but as a complementary strategy I’m pretty attracted to the idea that we should build systems where we have reason to believe that they shouldn’t have anything dangerous going on.

In the case of language model agents, this means: I think we should avoid any intensive search/selection processes towards high-level effectiveness of agents towards particular tasks. So far as possible we should aim for high-level agency to enter explicitly via scaffolding, and not via anything else.

Tentatively, I think this would mean:

  • It’s OK (and probably good) to use finetuning to encourage systems to produce helpful truthful answers in local ways, but not OK to use finetuning to aim at systems which are effective at doing holistic external things
  • We should be wary of accidentally selecting for agency that’s (opaquely) baked into our scaffolding, if we’ve chosen our scaffolding via an intensive search/selection process
    • Seems OK to do some automated testing and selection of different scaffolding architectures according to how they perform on a wide variety of tasks (since it’s probably hard for this to select for agency towards a particular end)
    • Seems concerning to do a lot of automated testing and selection of different scaffolding architectures on a fixed task, or fixed cluster of tasks (since it might be effective for it to learn how to be agentic towards those tasks in a way that is not explicitly represented in the messages it’s passing to itself)

Of course there’s a whole research agenda here. But I think that the basic point is straightforward and might be quite important to have broadly understood. I think this is somewhere where humanity by default makes systems which are selected to have agency (because we just try everything and see what works), but because the alternative of introducing agency via scaffolding is a pretty good substitute, it might be within political reach to build norms which exclude the problematic type of selection.

7.3     The bitter lesson?

Richard Sutton’s “bitter lesson” from 70 years of AI research is that building knowledge into AI agents may help in the short term but in the longer term is consistently overtaken by general-purpose methods that make use of more computation. This raises a couple of concerns about maintaining transparency:

  1. Even if the most effective agents combine scaffolding and finetuning, the scaffolding might stop being human-comprehensible as compute scales
  2. Even if the internal communication between parts of a scaffolded agent are initially in natural human language, as things are optimized they may find more efficient ways to communicate

Essentially, one might think that even if early scaffolded agents are more transparent, these will be obsoleted by more sophisticated AI which does end-to-end training for effectiveness over the entire system (including the scaffolding).

I take this concern to have some bite. I do think that a scaffolded agent which was purely optimized would be unlikely to have transparent internals. Nonetheless there are a few reasons why I don’t think the bitter lesson means that hopes for transparency are necessarily doomed:

  • The transparent versions of things can also scale to make full use of larger amounts of compute
    • e.g. rather than searching through all possible forms of scaffolding, there might be a search through essentially-transparent scaffolding setups which are proposed by language models
      • This might actually be more efficient — leveraging implicit knowledge contained in language to give an inductive bias about the types of structure that will plausibly be effective could restrict the search space and allow it to find highly effective complex structures faster
    • There is an analogue here with biological evolution, which can sometimes “lock in” foundational choices (e.g. DNA) even if they’re not optimal, in order to faster iterate on levels which build on top of them
  • If the early systems of this type are transparent, and transparency is important to us, researchers could specify that transparency is maintained while searching for more effective structures. Even if there is an efficiency hit from doing this, transparency may be seen as valuable enough that people are willing to do this
  • If there is pressure towards developing more efficient language for machine–machine communication (e.g. as Eric Drexler has written about), we could maintain translation tools to keep meaningful transparency
  • In the longer term superintelligent systems might need to think thoughts which are not human-comprehensible, but maintaining “transparency” could still allow for auditing by slightly-less-powerful systems, which in turn are audited by slightly-less-powerful-systems, and so on until reaching human-comprehensibility, thereby giving humans some degree of meaningful oversight over the entire structure

8.     Risks & strategies

8.1     A rough taxonomy of risks

There are several different points which might be dangerous. Here’s one way of slicing things up:

  1. Early language model agent misalignment risk
    • Early systems which are over the autonomous replication threshold, if there isn’t a good regime in place to handle them, could get into the world and then hang around and do destructive things, e.g. —
      • Preparing things which are destructive to discourse as political moves to try to ensure that there aren’t concerted efforts to find and close them down at some later point
      • “AI-run mafia”: bribing and extorting people in ways that build up larger power bases
    • At the first points where this becomes a risk it isn’t very credible that they would be able to outstrip the rest of civilization at the AI improvement game, nor that they would be able to directly cause a global catastrophe; however they might still create an exacerbating risk factor
    • As language model agents become more competent there might be a moment where we haven’t learned how to responsibly handle such systems, and a powerful one gets free in a way that does more directly threaten a global catastrophe
  2. Many opaque language model agents 
    • If people automating executives do so in ways that aren’t naturally transparent about their goals (e.g. because of heavy selection for strong performance), we may end up with a lot of systems in positions of some influence which are at least subtly misaligned, and these could end up with a majority of power in the world
    • This could be bad because:
      • 2A: The future might be determined by processes which are less in touch with human values
      • 2B: There is the possibility with smarter systems of a coordinated treacherous turn
  3. Wrong research automated first
    • At the point where we’re starting to automate most research, if the foundations for the automation of AI research are such that it’s much easier to automate capabilities research than to automate safe capabilities research, we might see a runaway process where the cutting edge in the world doesn’t have safety as a key embodied value, and then this ends up producing some extremely powerful-but-dangerous systems
      • This could happen either because:
        • “Automating capabilities research safely” is just much harder, and we fail to work out how to do that before the key time; or
        • There’s a moratorium or significant attempt to slow down AI development/deployment, which is not globally effective, which leads to the open-source / not carefully/ethically developed systems become cutting edge (because the more white hat stuff has really slowed down)
  4. Vulnerable world
    • If the fruits of research are not tightly held, and if the underlying technical landscape lies a certain way, we might end up in a vulnerable-world-hypothesis type scenario, where there is some broadly available destructive technology
    • Absent strong coordination to avoid its use, this could lead to a global catastrophe
  5. Coercive singleton
    • Automation could lead to strong centralization of power (e.g. if the fruits of automated research are tightly held by a single actor). If one actor gains enough power, they could expropriate control from the rest of the world
      • This is concerning whether that actor is a human, AI system, or institution built out of humans and/or AIs
      • Some of the strategies for avoiding this will vary with the type of actor that is being guarded against; others are cross-cutting
  6. Misalignment from successor paradigms
    • If language-based AI becomes uncompetitive at some point, misalignment from the successor systems could be a serious risk
    • This is especially concerning since the possibility of making language model agents transparent-by-construction seems idiosyncratic to this technology; we might expect transparency to much harder with the successor systems
  7. Butlerian Jihad
    • No catastrophe caused by AI, but in a knee-jerk reaction of fear-of-AI-systems, humanity locks in some things which cut us off from the most valuable futures
    • There are versions of this which involve permanently locking things in, and other versions which don’t necessarily have permanence, but leave things in the hands of humans long enough that we mess it up some other way
  8. (Flawed success)
    • No catastrophe caused by AI, but we somehow fail to build good futures anyway
    • Perhaps because we’re extrapolating values in a bad way, choosing an unhelpful starting point for that process, or choosing to use AI to lock in some properties that would have better not been locked in

I could offer views about the relative degree of existential risk posed by these, or the degree to which we should be prioritizing them (where these come apart because we may have disproportionate leverage over some). But I’m really not very confident in my relative assessment, and I’m much more confident in a meta-level take, so I’ll restrict myself to that:

I think that all of these risks (and it’s quite possible I’m missing some) are potentially grave. I wouldn’t currently feel comfortable assigning less than 1% risk of existential catastrophe to any of them — easily enough that if correct it would justify massive attention to address.

I also think that the actions people should take to understand and mitigate the various risks are likely to differ significantly. I therefore think that it should be a significant priority to better characterize the various risks, to assess how large they are in absolute terms, and to produce plans which are targeted specifically at reducing that risk. This can then feed into better prioritization of actions across the space — it’s likely that we should have a portfolio which includes work well-targeted at a number of these different risks.

8.2     Example strategies for mitigating the different risks

Here are some brainstormed thoughts on strategies for the various things here, to start things off. Take them or leave them.

  1. Early language model agent misalignment risk
    • Monitoring model capabilities
    • Conventions against deploying certain types of agents
    • Restriction of model access by major AI labs to make it harder for third parties to create such agents
    • Defensive measures which make it hard for “escaped” agents to increase their power
  2. Many opaque language model agents 
    • Developing techniques for making highly effective and highly transparent language model agents
    • Conventions against creating agents via finetuning, and otherwise restricting the amount of optimization power that can be exercised at the top level of creating agents
    • Transparency research, to make default-opaque agents less opaque
    • Work to instill virtuous behaviour and "good culture" (e.g. high levels of honesty) in language model agents, even if they're not fully transparent
  3. Wrong research automated first
    • Differential development of research that might be important to automate early
      • Perhaps via centralized Manhattan-Project–style work
    • Conventions restricting the automation of potentially-scary branches of research
    • General strategies for handling differential technological development
  4. Vulnerable world
    • Political centralization of automated research
    • AI-mediated treaties for automated arms control
  5. Coercive singleton
    • Avoiding too much centralized power by any actor
    • Automation of bargaining/negotiation/cooperation, to facilitate reaching cooperative singletons first
  6. Misalignment from successor paradigms
    • Continuation of traditional AI alignment research, and laying the groundwork for its automation
    • Political coordination to restrict research into such paradigms except in extremely careful ways
  7. Butlerian Jihad
    • Working out actually-good paths forwards vis-a-vis humanity’s relationship with AI
    • Convening discussions with thought leaders to minimize polarization of issues
    • Careful communication to build coalitions for Paretotopian futures
  8. (Flawed success)
    • Research into key things that will be needed for creating good futures, and avoiding bad ones
    • Starting to build coalitions and support for any key steps that are anticipated

8.3     Thoughts on tactical implications

I’m not at all confident what people who are concerned about navigating AI well should be doing. But I feel that the current portfolio is over-indexed on work which treats “transformative AI” as a black box and tries to plan around that. I think that we can and should be peering inside that box.

I’d like to better understand the plausibility of the kind of technological trajectory I’m outlining. I’d like to develop a better sense of how the different risks relate to this. And I’d like to see some plans which step through how we might successfully navigate the different phases of this technological development. I think that this is a kind of zoomed-in prioritization which could help us to keep our eyes on the most important balls, and which we haven’t been doing a great deal of.

New to LessWrong?

New Comment
17 comments, sorted by Click to highlight new comments since: Today at 8:29 AM

This is a fantastic post. Big upvote.

I couldn't agree more with your opening and ending thesis, which you put ever so gently:

the current portfolio is over-indexed on work which treats “transformative AI” as a black box

It seems obvious to me that trying to figure out alignment without talking about AGI designs is going to be highly confusing. It also seems likely to stop short of a decent estimate of the difficulty. It's hard to judge whether a plan is likely to fail when there's no actual plan to judge.  And it seems like any actual plan for alignment would reference a way AGI might use knowledge and make decisions.


WRT the language model agent route, you've probably seen my posts, which are broadly in agreement with your take:

Capabilities and alignment of LLM cognitive architectures

Internal independent review for language model agent alignment

The second focuses more on the range of alignment techniques applicable to LMAs/LMCAs. I wind up rather optimistic, particularly when the target of alignment is corrigibility or DWIM-and-check.

It seems like even if LMAs achieve AGI, they might progress slowly beyond the human-level source of the LLM training. That could be a really good thing. I want to think about this more. 

I'm unsure how much to publish on possible routes. Right now it seems to me that advancing progress on LMAs is actually a good thing, since they're more transparent and directable than any other AGI approach I can think of. But I don't trust my own judgment when there's been so little discussion from the hardcore alignment-is-hard-crowd.

It boggles my mind that posts like this, forecasting real routes to AGI and alignment, don't get more attention and discussion. What exactly are people hoping for as alignment solutions if not work like this?


Again, great post, keep it up.

Good post!

In their most straightforward form (“foundation models”), language models are a technology which naturally scales to something in the vicinity of human-level (because it’s about emulating human outputs), not one that naturally shoots way past human-level performance

You address this to some extent later on in the post, but I think it's worth emphasizing the extent to which this specifically holds in the context of language models trained on human outputs. If you take a transformer with the same architecture but train it on a bunch of tokenized output streams of a specific model of weather station, it will learn to predict the next token of the output stream of weather stations, at a level of accuracy that does not particularly have to do with how good humans are at that task.

And in fact for tasks like "produce plausible continuations of weather sensor data, or apache access logs, or stack traces, or nucleotide sequences" the performance of LLMs does not particularly resemble the performance of humans on those tasks.

I’m not at all confident what people who are concerned about navigating AI well should be doing. But I feel that the current portfolio is over-indexed on work which treats “transformative AI” as a black box and tries to plan around that. I think that we can and should be peering inside that box.

I’d like to better understand the plausibility of the kind of technological trajectory I’m outlining. I’d like to develop a better sense of how the different risks relate to this. And I’d like to see some plans which step through how we might successfully navigate the different phases of this technological development. I think that this is a kind of zoomed-in prioritization which could help us to keep our eyes on the most important balls, and which we haven’t been doing a great deal of.

Agree. I think there are pretty strong reasons to believe that with a concerted effort, we can very likely (> 90% probability) build safe scaffolded LM agents capable of automating ~all human-level alignment research while also being incapable of doing non-trivial consequentialist reasoning in a single forward pass. Also (still) looking for collaborators for this related research agenda on evaluating the promise of automated alignment research

In their most straightforward form (“foundation models”), language models are a technology which naturally scales to something in the vicinity of human-level (because it’s about emulating human outputs), not one that naturally shoots way past human-level performance

For a more detailed analysis of how this problem could be overcome but why doing so is unlikely to be a fast process, see my post LLMs May Find it Hard to FOOM. (Later parts of your post have some overlap with that, but there are some specifics such as conditioning and extrapolation that you don't discuss, so readers with find some more useful content there.)

I think there are two really important applications, which have the potential to radically reshape the world:

  • Research
    • The ability to develop and test out new ideas, adding to the body of knowledge we have accumulated
    • Automating this would be a massive deal for the usual reasons about feeding back into growth rates, facilitating something like a singularity
      • In particular the automation of further AI development is likely to be important
    • There are many types of possible research, and automation may look quite different for e.g. empirical medical research vs fundamental physics vs political philosophy
      • The sequence in which we get the ability to automate different types of research could be pretty important for determining what trajectory the world is on
  • Executive capacity
    • The ability to look at the world, form views about how it should be different, and form and enact plans to make it different
    • (People sometimes use “agency” to describe a property in this vicinity)
    • This is the central thing that leads to new things getting done in the world. If this were fully automated we might have large fully autonomous companies building more and more complex things towards effective purposes.
    • This is also the thing which, (if/)when automated, creates concerns about AI takeover risk.


I agree. I tentatively think (and have been arguing in private for a while) that these are 'basically the same thing'. They're both ultimately about

  • forming good predictions on the basis of existing models
  • efficiently choosing 'experiments' to navigate around uncertainties
    • (and thereby improve models!)
  • using resources (inc. knowledge) to acquire more resources

They differ (just as research disciplines differ from other disciplines, and executing in one domain differs from other domains) in the specifics, especially on what existing models are useful and the 'research taste' required to generate experiment ideas and estimate value-of-information. But the high level loop is kinda the same.

Unclear to me what these are bottlenecked by, but I think the latent 'research taste' may be basically it (potentially explains why some orgs are far more effective than others, why talented humans take a while to transfer between domains, why mentorship is so valuable, why the scientific revolution took so long to get started...?)

In particular, the 'big two' are both characterised by driving beyond the frontier of the well-understood which means by necessity they're about efficiently deliberately setting up informative/serendipitous scenarios to get novel informative data. When you're by necessity navigating beyond the well-understood, you have to bottom out your plans with heuristic guesses about VOI, and you have to make plans which (at least sometimes) have good VOI. Those have to ground out somewhere, and that's the 'research taste' at the system-1-ish level.

I think it’s most likely that for a while centaurs will significantly outperform fully automated systems

Agree, and a lot of my justification comes from this feeling that 'research taste' is quite latent, somewhat expensive to transfer, and a bottleneck for the big 2.

Very high-effort, comprehensive post. Any interest in making some of your predictions into markets on Manifold or some other prediction market website? Might help get some quantifications. 

At tasks like “give a winning chess move”, we can generate high quality synthetic data so that it’s likely that we can finetune model performance to exceed top human intuitive play.

With some more effort, this also applies to "prove this mathematical conjecture" (using automated proof checkers like Lean) and (with suitably large and well-deigned automated test suites) also to "write code to solve this problem". These seem like areas broad enough that scaling them up to far superhuman levels, as well as being inherently useful, might also carry over towards other tasks requiring rational and logical thinking. Also, this would probably be ab ideal forum in which to work on solutions to the 'drunkenness' issue.

1) seems like mostly a sideshow — while we could get agency from this, unless people are trying hard I don’t think it would tend to find especially competent agents to emulate, and may not have a good handle on what’s going on in the world.

I'm very puzzled by this opinion. If we can reduce the 'drunkenness' issue, this type of agency scales to at least the competence level of most competent humans (or indeed, fictional characters) in existence, and probably at least some distance beyond by extrapolation (and run cheaply in faster than realtime). These agents are not safe: humans are not fully aligned to human values, power corrupts, and Joseph Stalin was not well aligned with the needs to the citizenry of Russia. This seems like plenty to be concerned about, rather than a sideshow. Now, the ways in which they're not aligned are at least ones we have a good intuitive and practical understanding of, and some partial solutions for controlling (things like love, guilt, salaries, and law enforcement).

I’m hazier on the details of how this would play out (and a bit sceptical that it would enable a truly runaway feedback loop), but more sophisticated systems could help to gather the real-world data to make subsequent finetuning efforts more effective.

On the contrary, I think proactive gathering of data is very plausibly the bottleneck, and (smarts) -> (better data gathering) -> (more smarts) is high on my list of candidates for the critical feedback loop.

In a world where the 'big two' (R&D and executive capacity) are characterised by driving beyond the frontier of the well-understood it's all about data gathering and sample-efficient incorporation of the data.

FWIW I don't think vanilla 'fine tuning' necessarily achieves this, but coupled with retrieval augmented generation and similar scaffolding, incorporation of new data becomes more fluent.

Notable techniques for getting value out of language models that are not mentioned:

Also, I would say, retrieval-augmented generation (RAG) is not just a mundane way to industrialise language model, but an important concept whose properties should be studied separately from scaffolding or fine-tuning or other techniques that I listed in the comment above.

Thanks. At a first look at what you're saying I'm understanding these to be subcategories of using finetuning or scaffolding (in the case of leveraging semantic knowledge graphs) in order to get useful tools. But I don't understand the sense in which you think finetuning in this context has completely different properties. Do you mean different properties from the point where I discuss agency entering via finetuning? If so I agree.

(Apologies for not having thought this through in greater depth.)

I think you tied yourself too much to the strict binary classification that you invented (finetuning/scaffolding). You overgeneralise and your classification blocks the truth more than clarifies things.

All the different things that can be done by LLMs: tool use, scaffolded reasoning aka LM agents, RAG, fine-tuning, semantic knowledge graph mining, reasoning with semantic knowledge graph, finetuning for following "virtue" (persona, character, role, style, etc.), finetuning for model checking, finetuning for heuristics for theorem proving, finetuning for generating causal models, (what else?), just don't easily fit into two simple categories with the properties that are consistent within the category.

But I don't understand the sense in which you think finetuning in this context has completely different properties.

In the summary (note: I actually didn't read the rest of the post, I've read only the summary), you write something that implies that finetuning is obscure or un-interpretable:

From a safety perspective, language model agents whose agency comes from scaffolding look greatly superior than ones whose agency comes from finetuning

  • Because you can get an extremely high degree of transparency by construction

But this totally doesn't apply to these other variants of finetuning that I mentioned. If the LLM creates is a heuristic engine to generate mathematical proofs that are later verified with Lean, it just stops to make any sense to discuss how interpretable or transparent these theorem-proving or model-checking LLM-based heuristic engine.

Strong upvote. We are definetely not talking enough about what Scaffolded Language Model Agents mean for AI alignment. They are the light of hope, interpretable by design systems with tractable alignment and slow take off potential.

One possibility that arises as part of a mixed takeoff is using machine learning to optimize for the most effective scaffolding.

This should be forbidden. Turning explicitly written in code scaffolding into another black box not only will greatly damage interpretability but also poses huge risks of accidentally creating a sentient entity without noticing it. Scaffolding for LMAs serve a very similar role to consciousness for humans, so we should be very careful in this regard.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?