Language models are not inherently safe

Olli Järviniemi

Giving my take on a popular^[1] topic; writing a thousand ways to Rome. Broad lines likely right, less sure of details.

For some time I've had the thought "intelligent agent things are dangerous, so you better try getting an intelligent thing that is not agentic, like language models". Recently I thought about this a bit more and realized that my views were off - language models are not inherently safe (and the insight generalizes to other intelligent systems).

Three sections follow: my old view, what made me change my mind, and how I currently see the issue.

Agents bad, other AI less bad?

My canonical example of an agentic system is AlphaZero: The outer behavior of the system is well described by "it maximizes its win probability", and internally there is a search procedure to pick good moves. I sure hope we don't deploy OmegaZero that takes "real-life" actions.

But not all AI systems are agentic. Language models are better modeled as simulators. Yes, they can generate agentic simulacra. But maybe as long as we are careful and don't do that, we are fine? And perhaps we can still get useful stuff out of the language models: maybe you could have some prompt like

"Problem 1: How to build computer chips?

Solution 1: [insert instructions for computer chip building here]

Problem 2: How to build [insert advanced technology here]?

Solution 2:''

and, while the resulting output might not work because the prompt is bad or we are out-of-distribution etc., surely nothing dangerous is happening? I now think this reasoning is wrong.

How I came around

Where does the belief come from that systems vaguely like AlphaZero are dangerous but systems like GPT-3 are not? A basic description of what these two kinds of models do is that AlphaZero is built on reinforcement learning and takes actions to win a game (which, again, feels quite agentic), whereas GPT-3 models the text in its training data (or the distribution it was sampled from), and predicts the next token for some input. These appear very different, but digging deeper the differences are not fundamental:

First, you could build a reinforcement learning system which plays the "guess the next token" game, the set of possible actions being the set of tokens, and reward being obtained if the prediction is correct. This gives you a text predictor. Conversely, you can treat many problems as token prediction problems and apply transformers to a wide variety of tasks.^[2]

Second, both RL systems and GPT-3 learn stuff about their environment. I've thought that RL systems are more dangerous because they explicitly learn to approximate future rewards, so that their world model is "pointed" toward things relevant for reward, whereas there is no such explicit world model within language models. However, GPT models definitely do learn to model the world, and their world models are not inherently "neutral" either: they are pointed toward things that result in low loss.

(And don't get me started on reward not being the optimization target, which increases uncertainty on what type of minds different architectures and training procedures actually create.)

Okay, the dangerousness of a model is not (solely) determined by whether it's an RL system or language model. So what is it that makes some models more dangerous than others?

It's the computation, stupid

To return to the example prompt before, consider what happens inside a language model as it predicts the next character/token/word. There are, of course, many options depending on what type of model we have at hand, such as:

The model always outputs a uniformly random character
The model uses built-in linguistical rules, together with n-gram frequencies obtained from text datasets, to generate the next word
The model is GPT-3 and does whatever it is that GPT-3 does
The model, in order to model the distribution the training data was sampled from, uses the training data given to it to model the world and then simulates (some approximation of) the whole human civilization

(For the fourth one, leave aside questions like "Isn't that, like, totally unrealistic/unlikely/unfeasible?" - these are not central here.)

It is clear that the first two example models are not dangerous. The fourth one, while maybe not dangerous, would be quite interesting. You can take that hypothetical to various directions, like "they realize they are in a simulation and want out", but the specifics are not important. The point is that models whose outer behavior is (or which are trained on the objective of) "predicts text correctly with high probability" can have vastly different inner behavior, and this can have implications beyond the quality of completed text.

Also note that the fourth outcome is very far from the worst possibilities - "you simulate the human civilization" is pretty good as these things go. One can certainly come up with scenarios that are more worrisome (read: scenarios where you die).

Taken together, we get the main point:

Language models are not inherently safe. Simulations are not inherently safe. There are computations that kill you.

It's precisely the computation the model performs that kills you.

Not whether you arrived at the model via reinforcement learning or training on next-token prediction. Not whether the model is "taking actions" or "just" predicting the next token. Not whether you call it an agent or a language model.

(This is not to deny that some choices of architecture and training process are more dangerous than others. But they only affect dangerousness via affecting the probability we end up doing dangerous computations! Also, the choice of action space (e.g. set-of-Go-moves or set-of-tokens) of the model might influence the level of capability needed for the model to cause harm, though I'm unsure on whether the effect is non-negligible when considering outcomes causing existential catastrophes.)

^{^}
In particular Section 3.3 in Ngo and Yudkowsky on alignment difficulty touches on this topic.
^{^}
A well-known example is Gato. Gwern has a long list of examples, such as this one.

[-]Vladimir_Nesov1y30

With LLMs, the alignment hope is a combination of LLM itself not being agentic, and a simulacrum it channels being a human imitation. So eventually it's an aligned-by-default agent running on a non-agentic substrate (like humans run on physics). This framing gets into trouble when the simulacrum abstraction leaks (which RL might mitigate), or the substrate wakes up (which RL might cause).

And more potential trouble after too much self-generated training data, which might be important for capabilities. With human-generated data increasingly diluted by LLM-generated synthetic data, eventually a simulacrum might lose its initial alignment. The cognitive architecture doesn't match the original humans, and subtle errors could compound towards an equilibrium that's far from where human cognitive architecture anchors human culture.

LESSWRONG
LW

Language models are not inherently safe

11

Agents bad, other AI less bad?

How I came around

It's the computation, stupid

New to LessWrong?

11