Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

1 minute takeaways

  • It's actually pretty easy to train and run a language model to function as an agent for a specific task, rather than as a non-agentic simulator
  • The resulting agents seem pretty powerful at least in limited domains and therefore might turn out to be quite useful. They also have some possibly concerning properties
  • More specifically, they're evidential decision theorists (EDTs), which are known for one-boxing and cooperating with copies of themselves.
    • Incidentally, they're also known for struggling with causality, which is related to why LLMs hallucinate.
    • It's also possible to make causal decision theorists (CDTs), which are maybe not so bad but still not ideal
  • Alignment-wise, this means you outer align them by giving them a utility function. Inner alignment is still a nightmare with the added challenge of making sure the model is correctly inferring the utility function.

How to make an EDT agent out of a language model:

You can find a thorough explanation in this Neurips paper from last year, which laid out how to create such agents and showed that they were state of the art.

The gist is, you take the standard reinforcement learning loop of 'action/state/reward' and train a transformer to simulate it. So the transformer is outputting tokens corresponding to the actions of an agent, the state of the world, and the agent's reward, and the resultant string shows how these progress over time. It is effectively simulating an RL agent. Crucially, the 'reward' is both current reward and total expected future reward.

You run the transformer as follows: instead of predicting how the agent will act, you have it iterate through every possible action the agent might take and simulate the total expected future reward. Then, you take whichever action has the highest expected future reward, and have the agent 'take that action', adding it to the prompt. It is effectively choosing the action which provides the best evidence of it maximising utility. This coincides perfectly with the definition of an evidential decision theorist.

How to make a CDT

As far as I can tell, that's what's happening in this DeepMind paper. Basically you have a slightly different loss function which uses "counterfactual teaching" to have the model treat agent actions as causal interventions. In this paper the simulated agent is being used to imitate an expert, and they demonstrate that it does so in a manner which avoids hallucination and standard EDT problems. To actually create a CDT you still need to implement the above loop of iterating through actions and checking conditional utility, but after that's done it should work just as well as the EDT, treating its actions as causal interventions rather than evidence.

Decision Transformers

You can also speed up the whole process by sacrificing some performance. Whereas the above approach is to condition utility on each possible action, you can also simply specify a high utility, and then condition a single action on it. This is already enough to get state of the art RL agents that can infer strategies better than what they see in their data. But of course it gets confused sometimes when you prompt it with a utility function it can't achieve, among other things.

Alignment Consequences

At the very least, we cannot rely on the hope that LLMs simply aren't agent-like and would never become agent-like. They can, and there are good reasons people will want them to.

These agents will probably lag behind transformers in power because they need to be specially trained for a certain task. But, being utility maximisers, they should just straightforwardly do all the instrumentally convergent power seeking we might hope to avoid, and insofar as these models are currently very good at simple games, it seems not wildly unlikely that they'll scale in the same way LLMs do.

Fortunately, the two most natural decision theories for them to implement are two most well-studied, including in MIRI's agent foundations work. Unfortunately they have both been noted as having deceptive properties we'd want to avoid, even without inner alignment problems. 


Ω 10

New Comment

New to LessWrong?