Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The core argument: language models are more transparent and less prone to develop agency and superintelligence

I argue that compared to alternative approaches such as open-ended reinforcement learning, the recent paradigm of achieving human-level AGI with language models has the potential to be relatively safe. There are three main reasons why I believe LM-based AGI systems could be safe:

  1. They operate on text that is intelligible to humans, which makes them relatively interpretable and easier to monitor.
  2. They are subject to weaker pressure to surpass human-level capabilities than systems trained on more open-ended tasks, such as those pursued by reinforcement learning.
  3. Since language models are trained as predictors, there is weaker pressure for them to develop agentic behavior.

I acknowledge that these arguments have been criticized. I will try to defend these statements, delving into the nuances and explaining how I envision relatively safe LM-based AGI systems. I want to note upfront that although I believe this paradigm is safer than other alternatives I've come across, I still think it poses significant dangers, so I’m not suggesting we should all just chill. It's also difficult to reason about these sorts of things in the abstract, and there's a good chance I may be overlooking critical considerations.

Human-level AGI might be reachable with existing language models

There are indications that GPT-4 could be considered an early form of AGI. But even though it surpasses human level over many standardized tests, as a general-purpose AGI it’s not yet at human level, at least in its vanilla form. For example, it’s still not very good at separating facts from fiction, and I wouldn’t give it full access to my email and social media accounts and ask it to throw a birthday party for me.

To address these limitations, there's a project underway that seeks to develop more capable and agentic AI systems based on language models through chained operations such as chain-of-thought reasoning. Auto-GPT is a very early example of this approach, and I expect to see much better LM-based AI systems very soon. To understand this paradigm, we can think of each GPT-4 prompt completion operation as an atomic operation that can be chained in sophisticated ways to create deeper and more intelligent thought processes. This is analogous to how humans engage in internal conversations to produce better decisions, instead of spitting out the first thought that comes to mind. Like humans, language models can deliver better results if allowed to engage in internal dialogues and access external tools and sources of data. I can’t think of any cognitive activity that a human can perform instantaneously (through a “single cognitive step”) and GPT-4 can’t, indicating that it might be just a matter of letting GPT-4 talk with itself for long enough before it approaches human level. An important property of these internal dialogues is that they are produced in plain human language.

It remains to be seen how much progress in AI capabilities will be reached by 1) coming up with effective chaining schemes versus 2) improving the underlying language models and making each atomic operation more capable. Given the ease of chaining existing models compared to the difficulty of training state-of-the-art language models, there's a good chance that 1 would progress faster than 2. While there are very few companies with the resources to train more powerful language models, literally everyone can play with chain-of-thought prompting, and startups and commercial products in this area are already emerging. I believe there's a chance that AGI systems that are close to human level in both the breadth of tasks they could solve and robustness would emerge even with GPT-4-level language models.

Compositional AI systems are easier to align than end-to-end deep-learning models

If the scenario I just described happens and progress in AI capabilities will be driven more by how atomic LM operations are chained than the power of each atomic operation, then this has important consequences for AI alignment.

First, while each atomic LM operation is quite opaque and involves giant inscrutable matrices of floating-point numbers, chains of internal LM dialogues in English are much easier to follow and understand. It's not uncommon for emergent high-level systems to be more intelligible than their underlying atomic parts.

Second, if progress in AI is driven more by sophisticated chaining of text operations than by improvement to the underlying deep learning models, I believe it significantly reduces the risk of creating alien, superintelligent systems. While AI systems may surpass human capabilities in terms of speed and knowledge, their underlying thinking patterns are likely to be familiar (even if they are made of LM models of giant inscrutable matrices that think internally in a very non-human way).

Third, AI misalignment risks are more conceivable with AI agents that operate in the world and try to achieve goals, compared to machine learning models that just try to make predictions. While language models can give rise to agentic AI systems through the chaining paradigm, the internal thinking and intermediate goals of such systems are designed to be transparent. If an LM-based AGI system starts making unintended dangerous plans like accumulating resources or hiding its true goals, these thoughts should be recorded in the system's internal dialogues.

It's theoretically possible for an LM-based system to obscure its true intentions and develop a secret language that appears aligned with humans but is actually malicious. However, this thought too must occur within the system's chain of thought. The only way for this thought to be truly concealed is if it occurs within a single atomic operation, somewhere within the 200 or whatever number of layers GPT-4 has, which is a much higher bar. Additionally, the language model must come up with an innocent-looking text that encodes a message to start the betrayal and trust that its future self (and potentially other language models monitoring it) will go along with the scheme. Since language models are trained as predictors, not agents, it’s not clear if such an agentic behavior could arise spontaneously.

What exactly do I mean by agency and why do I think language models are unlikely to have it?

Let’s consider a question-answering AI. Under what conditions should it be seen as an agent? I believe it depends on what this AI does internally to answer questions. If it comes up with a plan for how to produce a good answer (e.g. through Google searches) and then executes this plan, I’d say that this AI shows agency. If, on the other hand, it just makes some internal (very complex) computations and spits out an answer, then I’d say it’s not agentic. Even if it has a very good internal model of the external world, to the point that it could realistically simulate humans and other agents or passionately argue that it wants things and describe plans in detail, I’d still argue that the AI itself is not agentic, as long as it doesn’t actually plan to do anything. It’s the difference between knowing what it means to care and actually caring.

Given how language models are trained, I don’t see anything that would push them to develop agency. In order for a capability to arise in a model, this capability has to be rewarded by the training objective. But given a language model A that just knows how to spit out satisfactory answers and a language model B that also has the ability to come up with exactly the right text that would allow its future self to execute a plan that a human wouldn’t understand, there’s nothing in the training objective of language models that favors B over A.

However, as people are starting to experiment with LM-based AI agents like Auto-GPT, I expect we’ll soon find out that these AI systems actually perform better when the underlying language models are trained with the explicit goal of communicating well with their future selves. I worry that this will reward agentic behavior, with consequences we can’t really predict. I therefore think there’s room to advocate against doing that and sticking with AI systems that rely on pure language models optimized only with respect to direct outputs.

Why do I expect weaker pressure to reach superintelligence with language modeling compared to reinforcement learning?

Many in the AI alignment community have pointed out that there is no reason for language models to cap at human-level intelligence. While I agree that language models could theoretically exceed human level, I expect their optimization process to induce much weaker pressure to reach superhuman levels compared to alternative training paradigms such as reinforcement learning.

A useful example is AIs that learn to play games such as chess and Go. Models trained to imitate human games (by predicting their moves) will reach lower capabilities compared to models trained to simply maximize the game's objective through reinforcement learning (like AlphaZero). Likewise, predicting human thoughts captured in text pushes language models towards general intelligence, but letting them learn how to solve problems on their own (not bound by human experience) would in principle allow them to reach much greater levels of intelligence. Even if language models eventually surpass human level, all of their capabilities will still be anchored in human capabilities. If no human has ever known some fact about the world and it has no bearing to how humans think, then a language model has no reason whatsoever to learn it. Reinforcement learning agents, on the other hand, are not anchored in any way by human capabilities. For AlphaZero, human level in chess or Go means absolutely nothing. 

How is the human-like thinking patterns of LM-based AI systems going to help us monitor them?

LM-based AI systems can be seen as forms of factored cognition, namely as automation of human thinking, which seems easier to understand and align than models trained to maximize an objective end to end. While atomic language models imitate human thinking only externally (but remain mostly inscrutable internally), compositional LM-based AI systems simulate human cognition also internally. 

There are many ways to monitor the chain of thought of such AI systems (and I’d love to see more research on this). Here are some trivial ideas:

  • We could take another language model, roughly as capable as the language model making the CoT (but trained independently), and ask it to report concerning thoughts in the CoT.
  • We could sample some parts of the CoT and review them manually.
  • A specific red flag we may be looking for is logical gaps in the CoT (i.e. an idea that doesn’t seem to logically follow from the preceding CoT). We should also expect compositional AI systems to perform better on difficult tasks than atomic LM operations. 

Of course all of these checks won’t provide absolute guarantee that the model hasn’t developed superintelligence and is fooling us by pretending to follow a logical CoT (who knows, maybe GPT-4 is already superintelligent and misaligned and just waiting for an opportunity to strike). However I think that such empirical tests could still provide some degree of reassurance (or reasons for concern). I also think we should be a lot more suspicious of CoTs produced through an atomic LM operation compared to CoTs created one step at a time (provided that language models cannot carry a hidden state between operations).

Consequences for AI safety efforts

If we accept that language models offer a safer path to AGI, this has practical implications for AI safety efforts.

We may want to support (or at least not oppose) the LM chaining paradigm to AGI

The recent call to temporarily pause the development of AI more capable than GPT-4 has received, as expected, a lot of backlash. An obvious reason why people don’t like the idea of pausing AI development is that they are excited and looking forward to even more capable AI systems which could really improve the world. I assume that the intent of the open letter was not to pause R&D on chaining GPT-4-like models, which anyone can do anyway, but rather to pause the development of more powerful (and potentially dangerous) deep learning models like GPT-5, which only a small number of well-resourced AI labs can develop. If people believed that GPT-4-like models were sufficient for about-human-level AGI systems, given enough time to develop effective methods for composing and integrating them with external tools, then there might be more willingness to pause the training of more powerful models.

I think this is the least dangerous roadmap to AGI we currently have, so it might be best to just pursue it. I also think it’s almost inevitable. At this point, nothing short of an international ban will prevent people from chaining large language models. Avoiding the training of even greater models, on the other hand, is something that could be coordinated among a small group of actors.

I think it will be much safer to increase AI capabilities by constructing compositional AI systems that rely on pure language models, rather than training more capable models. I think it will be particularly dangerous to train models on tasks that require planning ahead and reward long-term consequences. 

We need more research on how to safely chain language models

I think it can be important to study to what extent one LM-based system can be trusted to monitor the chain of thought of another LM-based AGI system and report concerning thoughts. Empirical research can complement theoretical approaches to understanding worst-case scenarios, even at the risk of weaker safety guarantees. Research that looks for concerning signs of misbehavior by language models is already underway and can be expanded. Additionally, investigating what's happening within the opaque matrices of floating-point numbers that language models are made of would also be beneficial, but it is a more challenging area of research.

Summary

No one really knows what stack of AI technologies will be developed in the coming years until we reach (and possibly exceed) human-level AGI. However I do think it’s at least possible that the main piece of technology to get us there will be language models roughly as capable as present-day models (GPT-4) leveraged with effective chaining techniques. This seems to me like the safest scenario I can think of at the moment of writing this, and maybe we should actively aim for this to happen.

I am thankful to Edo Arad, Bary Levi, Shay Ben Moshe and Will Seltzer for helpful feedback and comments on early versions of this post.

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 4:51 AM

I think this is wrong, but a useful argument to make.

I disagree even though I generally agree with each of your sub-points. The key problem is that the points can all be correct, but don't add to the conclusion that this is safe. For example, perhaps an interpretable model is only 99.998% likely to be a misaligned AI system, instead of 99.999% for a less interpretable one. I also think that the current paradigm is shortening timelines, and regardless of how we do safety, less time makes it less likely that we will find effective approaches in time to preempt disaster.

(I would endorse the weaker claim that LLMs are more plausibly amenable to current approaches to safety than alternative approaches, but it's less clear that we wouldn't have other and even more promising angles to consider if a different paradigm was dominant.)

Thank you for this comment. I'm curious to understand the source of disagreement between us, given that you generally agree with each of the sub-points. Do you really think that the chances of misalignment with LM-based AI systems is above 90%? What exactly do you mean by misalignment in this context and why do you think it's the most likely result with such AI? Do you think it will happen even if humanity sticks with the paradigm I described (of chaining pure language models while avoiding training models on open-ended tasks)?

I want to also note that my argument is less about "developing language models was counterfactually a good thing" and more "given that language models have been developed (which is now a historic fact), the safest path towards human-level AGI might be to stick with pure language models".

I agree with a lot of things here. I think using predictive models as components of larger systems is definitely interesting, but that also there are dangers that end-to-end training might "move goals into the predictive model" in ways that improve performance but worsen understandability / safety.

I disagree that LLMs should be thought of as internally human-like (rather than just producing human-like outputs). This is relevant mainly for a few things: how likely LLMs are to make mistakes a human would never make, how we should expect LLMs to generalize to far-outside-distribution tasks, and how well the safety properties of LLMs should hold up under optimization pressure on some function of the output.

Strategically, I don't really endorse building human-ish-level AGI even if it's mostly harmless. I think it doesn't really change the gameboard in positive ways - unless you're using it to take really drastic dystopian actions, both people who are working on safe superintelligent AI and people who are working on dangerous superintelligent AI can use your AGI to help them, which seems bad unless you expect your AGI to be differentially better at alignment research relative to capabilities research.

Thank you for this comment!

I first want to note that your comment implies my post is saying things which I don't think it does (or at least I didn't mean it to):
- I didn't argue that language models are internally human-like, only externally. I do think however that compositional AI systems made of language models should be internally human-like.
- I didn't advocate for training AGI end-to-end (but rather taking the factored cognition approach).

I agree with you that a human-ish-level AGI would be VERY dangerous regardless of how aligned it is (mostly because there are 8 billion people on this planet, and it's sufficient for just a few of them to use it in stupid/evil ways to put us all in danger).

Nice post, thanks!
Are you planning or currently doing any relevant research? 

Thank you!
I don't have any concrete plans, but maybe.