Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

1 minute takeaways

  • It's actually pretty easy to train and run a language model to function as an agent for a specific task, rather than as a non-agentic simulator
  • The resulting agents seem pretty powerful at least in limited domains and therefore might turn out to be quite useful. They also have some possibly concerning properties
  • More specifically, they're evidential decision theorists (EDTs), which are known for one-boxing and cooperating with copies of themselves.
    • Incidentally, they're also known for struggling with causality, which is related to why LLMs hallucinate.
    • It's also possible to make causal decision theorists (CDTs), which are maybe not so bad but still not ideal
  • Alignment-wise, this means you outer align them by giving them a utility function. Inner alignment is still a nightmare with the added challenge of making sure the model is correctly inferring the utility function.


How to make an EDT agent out of a language model:

You can find a thorough explanation in this Neurips paper from last year, which laid out how to create such agents and showed that they were state of the art.

The gist is, you take the standard reinforcement learning loop of 'action/state/reward' and train a transformer to simulate it. So the transformer is outputting tokens corresponding to the actions of an agent, the state of the world, and the agent's reward, and the resultant string shows how these progress over time. It is effectively simulating an RL agent. Crucially, the 'reward' is both current reward and total expected future reward.

You run the transformer as follows: instead of predicting how the agent will act, you have it iterate through every possible action the agent might take and simulate the total expected future reward. Then, you take whichever action has the highest expected future reward, and have the agent 'take that action', adding it to the prompt. It is effectively choosing the action which provides the best evidence of it maximising utility. This coincides perfectly with the definition of an evidential decision theorist.

How to make a CDT

As far as I can tell, that's what's happening in this DeepMind paper. Basically you have a slightly different loss function which uses "counterfactual teaching" to have the model treat agent actions as causal interventions. In this paper the simulated agent is being used to imitate an expert, and they demonstrate that it does so in a manner which avoids hallucination and standard EDT problems. To actually create a CDT you still need to implement the above loop of iterating through actions and checking conditional utility, but after that's done it should work just as well as the EDT, treating its actions as causal interventions rather than evidence.

Decision Transformers

You can also speed up the whole process by sacrificing some performance. Whereas the above approach is to condition utility on each possible action, you can also simply specify a high utility, and then condition a single action on it. This is already enough to get state of the art RL agents that can infer strategies better than what they see in their data. But of course it gets confused sometimes when you prompt it with a utility function it can't achieve, among other things.

Alignment Consequences

At the very least, we cannot rely on the hope that LLMs simply aren't agent-like and would never become agent-like. They can, and there are good reasons people will want them to.

These agents will probably lag behind transformers in power because they need to be specially trained for a certain task. But, being utility maximisers, they should just straightforwardly do all the instrumentally convergent power seeking we might hope to avoid, and insofar as these models are currently very good at simple games, it seems not wildly unlikely that they'll scale in the same way LLMs do.

Fortunately, the two most natural decision theories for them to implement are two most well-studied, including in MIRI's agent foundations work. Unfortunately they have both been noted as having deceptive properties we'd want to avoid, even without inner alignment problems. 

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 8:15 PM

Very interesting post! Unfortunately, I found this a bit hard to understand because the linked papers don’t talk about EDT versus CDT or scenarios where these two come apart and because both papers are (at least in part) about sequential decision problems, which complicates things. (CDT versus EDT can mostly be considered in the case of a single decision and there are various complications in multi-decision scenarios, like updatelessness.)

Here’s an attempt at trying to describe the relation of the two papers to CDT and EDT, including prior work on these topics. Please correct me if I’m misunderstanding anything! The writing is not very polished -- sorry!

Ignoring all the sequential stuff, my understanding is that the first paper basically does this: First, we train a model to predict utilities after observing actions, i.e., make predictions conditional on actions. So in particular, we get a function a ---> E[utility | a] that maps an observed action by the agent onto a prediction of future reward/utility. Then if we use some procedure to find the action a that maximizes E[utility | a], it seems that we have an EDT agent. I think this is essentially the case of an “EDT overseer” who rewards based on actions (rather than outcomes) in “Approval-directed agency and the decision theory of Newcomb-like problems”. Also see the discussion of Obstacle 1 in "Two Major Obstacles for Logical Inductor Decision Theory".

Now what could go wrong with this? I think in some sense the problem is generally that it's unclear how the predictive model works, or where it comes from. The second paper (the DeepMind one) basically points out one issue with this. Other issues are known to this community. I’ll start with an issue that has been known to this community: the 5 and 10 problem / the problem of counterfactuals. If the agent always (reliably) chooses the action a that maximizes E[utility | a], then the predictive model’s counterfactual predictions (i.e., predictions for all other actions) could be nonsensical without being strictly speaking wrong. So for example, in 5 and 10, you choose between a five dollar bill and a ten dollar bill. (There’s no catch and you should clearly just take the ten dollar bill.) The model predicts that if you take the five dollar bill, you will get five dollars, and (spuriously / intuitively falsely) that if you take the ten dollar bill, you get nothing. Because you are maximizing expected utility according to this particular predictive model, you take the five dollars. So the crazy prediction for what happens if you take the ten dollars is never falsified.

In non-Newcomb-like scenarios, a simple, extremely standard solution to this problem is to train the predictive model (the thing that gives a ---> E[utility | a]) while the agent follows some policy that randomizes over all actions (perhaps one that takes actions with probabilities in proportion to the model's predictions E[utility | a]). My understanding is that this is how the first paper avoids these issues and gives good results. Unfortunately, in Newcomb-like problems these approaches tend to lead to pretty CDT-ish behavior, as shown in "Reinforcement Learning in Newcomblike Environments".

Anyway, the second paper (the DeepMind one) points out another issue related to where the E[utility | action] model comes from. Roughly, the story — which I think is very well described in Section 2 — seems to be the following: the E[utility | action] model is trained on the actions of an expert who knows whether X=1,2 and acts on that fact by choosing A=X; then the E[utility | action] model won't work for a non-expert agent, i.e., one who doesn’t observe X. I view this as a distributional shift issue — you train a model (the a ---> E[utility | a] one) in a setting where A=X, and then you apply it in a setting where sometimes A and X are uncorrelated.

It’s also similar to the Smoking Lesion/medical Newcomb-like problems! Consider the following medical Newcomb-like problem: First we learn the fact that sick people go to the doctor and healthy people don’t go to the doctor. Then without looking at how healthy I am, I don’t go to the doctor so as to gain evidence that I am healthy. Arguably what goes wrong here is also that I’m using a rule for prediction out of distribution on someone who doesn’t look at whether they’re sick. I think it relates to one of the least challenging versions of medical Newcomb-like problems and it’s handled comfortably by the so-called tickle defense.

Interlude: The paper talks about how this relates to hallucination in LLMs. So what’s that about? IIUC, the idea is that when generating text, LLMs incorrectly update based on the text they generate themselves. For example, imagine that you want an LLM to generate ten tokens. Then after generating the first nine tokens, it will predict the tenth token from its learned distribution . But this distribution was trained on fully human- not LLM-written text. So (in my way of thinking),  might do poorly (i.e., not give a human-like continuation of ), because it was trained on seeing nine tokens created by a human and having to predict a continuation by a human rather than nine tokens by itself/an LLM and having to predict a continuation by a human. For example, we might imagine that if  are words that only a human expert confident in a particular claim C would say, then the LLM will predict continuations that confidently defend claim C, even if the LLM doesn’t know anything about C. I'm not sure I really buy this explanation of hallucination. I think the claim would need more evidence than the authors provide. But it's definitely a very interesting point.

Now, back to the original toy model. Again, I would view this as a distribution shift problem. If we make some assumptions, though, we can infer/guess a model (i.e. function a ---> E[utility | a]) that predicts the utility obtained by a non-expert, i.e., an agent who doesn't observe X. Specifically, let’s assume that we are told the conditional distributions P(utility | X=1, A=0) and P(utility | X=0, A=1) (which we never see in training if the agent in training always knows and acts on X). Let’s also assume that we know that the difference between the training distribution and the new setting is that in the new setting the agent chooses A independently of X. Then in the new model we just need to make X and A independent and change nothing else. Formally you use the new distribution P’(X,U|A) = P(X)P(U|A,X), where the Ps on the right-hand side are just the old distribution, instead of P(X,U|A) = P(X|A)P(U|A,X).

It turns out that if we put the original distribution into a causal graph with edges X->A and A->U and X->U and then make a do-intervention on A (a la Pearl), then we get this exact distribution, i.e., P(X,U|do(A)) = P’(X,U|A). (Intuitively, removing the inference from A to X is exactly what the do(A) does if A's parent is X.) So in particular maximizing E[U | do(A)] gives the same result as maximizing E’[U|A]. Anyway, the paper uses the do operator to construct the new predictor, rather than the above argument. They seem to claim that the causal structure (or reasoning about causality) is necessary to construct the new predictor, with which I disagree.

Is this really CDT? I’m not sure… In the above type of case, this doesn’t come apart from EDT. If we buy that their scenario is a bit like a Smoking Lesion, then one could argue that part of the point of CDT is to solve this type of scenario. (In some sense my response is as in most versions of the Smoking Lesion: Because of the tickle defense, EDT applied properly gets this right anyway, so there’s actually nothing to fix here.) In my view it’s basically just about using the do-calculus to concisely specify the scenario P’ (based on P plus a particular causal graph for P). It seems that one can do these things without being committed to using do(A) in a scenario where there’s some non-causal dependence between A and U (that doesn't disappear outside of training), perhaps via some common cause Y. In any case, the paper doesn’t tell us how to distinguish between U <- Y -> A and A -> Y -> U — all causal relationships are assumed. So while nominally they construct their predictor as E[U | do(A)], it’s a bit unclear how wedded they are to CDT.

Anyway, with a (maybe-causalist) E[U | do(A)] in hand, we can of course build a (maybe-)CDT agent by choosing a to maximize E[U | do(A)]. But I think the paper doesn’t say anything about where to get the causal model from that gives us E[U | do(A)]. They pretty much assume that the model is provided.

I think the “counterfactual teaching” stuff doesn’t really say anything about CDT versus EDT, either. IIUC the basic idea is this. Imagine you want to train an LLM and you want to prevent the issue above. Then intuitively — in my distribution shift view — what we need to do is just train the LLM to make a good prediction  upon observing  that were generated by itself (rather than humans). The simplest, most obvious way to do this is to let the LLM generate some tokens , then get a probabilistic prediction about the next token from the LLM and then ask a human to give a next token . The loss of the LLM is just the, e.g., log loss of its prediction against the  provided by the human. One slightly tricky point here is that we only train the LLM to make good predictions on . We don’t want to train it to output  that make  easier to predict. So we need to be careful to choose the right gradient. I think that’s basically all they’re doing, though. It doesn’t seem like there’s anything causalist here.

So, in conclusion: While very interesting, I don't think these papers tell us anything new about how to build an EDT or a CDT agent.