Language Models can be Utility-Maximising Agents

Very interesting post! Unfortunately, I found this a bit hard to understand because the linked papers don’t talk about EDT versus CDT or scenarios where these two come apart and because both papers are (at least in part) about sequential decision problems, which complicates things. (CDT versus EDT can mostly be considered in the case of a single decision and there are various complications in multi-decision scenarios, like updatelessness.)

Here’s an attempt at trying to describe the relation of the two papers to CDT and EDT, including prior work on these topics. Please correct me if I’m misunderstanding anything! The writing is not very polished -- sorry!

Ignoring all the sequential stuff, my understanding is that the first paper basically does this: First, we train a model to predict utilities after observing actions, i.e., make predictions conditional on actions. So in particular, we get a function a ---> E[utility | a] that maps an observed action by the agent onto a prediction of future reward/utility. Then if we use some procedure to find the action a that maximizes E[utility | a], it seems that we have an EDT agent. I think this is essentially the case of an “EDT overseer” who rewards based on actions (rather than outcomes) in “Approval-directed agency and the decision theory of Newcomb-like problems”. Also see the discussion of Obstacle 1 in "Two Major Obstacles for Logical Inductor Decision Theory".

Now what could go wrong with this? I think in some sense the problem is generally that it's unclear how the predictive model works, or where it comes from. The second paper (the DeepMind one) basically points out one issue with this. Other issues are known to this community. I’ll start with an issue that has been known to this community: the 5 and 10 problem / the problem of counterfactuals. If the agent always (reliably) chooses the action a that maximizes E[utility | a], then the predictive model’s counterfactual predictions (i.e., predictions for all other actions) could be nonsensical without being strictly speaking wrong. So for example, in 5 and 10, you choose between a five dollar bill and a ten dollar bill. (There’s no catch and you should clearly just take the ten dollar bill.) The model predicts that if you take the five dollar bill, you will get five dollars, and (spuriously / intuitively falsely) that if you take the ten dollar bill, you get nothing. Because you are maximizing expected utility according to this particular predictive model, you take the five dollars. So the crazy prediction for what happens if you take the ten dollars is never falsified.

In non-Newcomb-like scenarios, a simple, extremely standard solution to this problem is to train the predictive model (the thing that gives a ---> E[utility | a]) while the agent follows some policy that randomizes over all actions (perhaps one that takes actions with probabilities in proportion to the model's predictions E[utility | a]). My understanding is that this is how the first paper avoids these issues and gives good results. Unfortunately, in Newcomb-like problems these approaches tend to lead to pretty CDT-ish behavior, as shown in "Reinforcement Learning in Newcomblike Environments".

Anyway, the second paper (the DeepMind one) points out another issue related to where the E[utility | action] model comes from. Roughly, the story — which I think is very well described in Section 2 — seems to be the following: the E[utility | action] model is trained on the actions of an expert who knows whether X=1,2 and acts on that fact by choosing A=X; then the E[utility | action] model won't work for a non-expert agent, i.e., one who doesn’t observe X. I view this as a distributional shift issue — you train a model (the a ---> E[utility | a] one) in a setting where A=X, and then you apply it in a setting where sometimes A and X are uncorrelated.

It’s also similar to the Smoking Lesion/medical Newcomb-like problems! Consider the following medical Newcomb-like problem: First we learn the fact that sick people go to the doctor and healthy people don’t go to the doctor. Then without looking at how healthy I am, I don’t go to the doctor so as to gain evidence that I am healthy. Arguably what goes wrong here is also that I’m using a rule for prediction out of distribution on someone who doesn’t look at whether they’re sick. I think it relates to one of the least challenging versions of medical Newcomb-like problems and it’s handled comfortably by the so-called tickle defense.

Interlude: The paper talks about how this relates to hallucination in LLMs. So what’s that about? IIUC, the idea is that when generating text, LLMs incorrectly update based on the text they generate themselves. For example, imagine that you want an LLM to generate ten tokens. Then after generating the first nine tokens, it will predict the tenth token from its learned distribution . But this distribution was trained on fully human- not LLM-written text. So (in my way of thinking), $P (x_{10} | x_{1}, \dots, x_{9})$ might do poorly (i.e., not give a human-like continuation of $x_{1}, . . ., x_{9}$ ), because it was trained on seeing nine tokens created by a human and having to predict a continuation by a human rather than nine tokens by itself/an LLM and having to predict a continuation by a human. For example, we might imagine that if $x_{1}, \dots, x_{9}$ are words that only a human expert confident in a particular claim C would say, then the LLM will predict continuations that confidently defend claim C, even if the LLM doesn’t know anything about C. I'm not sure I really buy this explanation of hallucination. I think the claim would need more evidence than the authors provide. But it's definitely a very interesting point.

Now, back to the original toy model. Again, I would view this as a distribution shift problem. If we make some assumptions, though, we can infer/guess a model (i.e. function a ---> E[utility | a]) that predicts the utility obtained by a non-expert, i.e., an agent who doesn't observe X. Specifically, let’s assume that we are told the conditional distributions P(utility | X=1, A=0) and P(utility | X=0, A=1) (which we never see in training if the agent in training always knows and acts on X). Let’s also assume that we know that the difference between the training distribution and the new setting is that in the new setting the agent chooses A independently of X. Then in the new model we just need to make X and A independent and change nothing else. Formally you use the new distribution P’(X,U|A) = P(X)P(U|A,X), where the Ps on the right-hand side are just the old distribution, instead of P(X,U|A) = P(X|A)P(U|A,X).

It turns out that if we put the original distribution into a causal graph with edges X->A and A->U and X->U and then make a do-intervention on A (a la Pearl), then we get this exact distribution, i.e., P(X,U|do(A)) = P’(X,U|A). (Intuitively, removing the inference from A to X is exactly what the do(A) does if A's parent is X.) So in particular maximizing E[U | do(A)] gives the same result as maximizing E’[U|A]. Anyway, the paper uses the do operator to construct the new predictor, rather than the above argument. They seem to claim that the causal structure (or reasoning about causality) is necessary to construct the new predictor, with which I disagree.

Is this really CDT? I’m not sure… In the above type of case, this doesn’t come apart from EDT. If we buy that their scenario is a bit like a Smoking Lesion, then one could argue that part of the point of CDT is to solve this type of scenario. (In some sense my response is as in most versions of the Smoking Lesion: Because of the tickle defense, EDT applied properly gets this right anyway, so there’s actually nothing to fix here.) In my view it’s basically just about using the do-calculus to concisely specify the scenario P’ (based on P plus a particular causal graph for P). It seems that one can do these things without being committed to using do(A) in a scenario where there’s some non-causal dependence between A and U (that doesn't disappear outside of training), perhaps via some common cause Y. In any case, the paper doesn’t tell us how to distinguish between U <- Y -> A and A -> Y -> U — all causal relationships are assumed. So while nominally they construct their predictor as E[U | do(A)], it’s a bit unclear how wedded they are to CDT.

Anyway, with a (maybe-causalist) E[U | do(A)] in hand, we can of course build a (maybe-)CDT agent by choosing a to maximize E[U | do(A)]. But I think the paper doesn’t say anything about where to get the causal model from that gives us E[U | do(A)]. They pretty much assume that the model is provided.

I think the “counterfactual teaching” stuff doesn’t really say anything about CDT versus EDT, either. IIUC the basic idea is this. Imagine you want to train an LLM and you want to prevent the issue above. Then intuitively — in my distribution shift view — what we need to do is just train the LLM to make a good prediction $x_{10}$ upon observing $x_{1} \dots x_{9}$ that were generated by itself (rather than humans). The simplest, most obvious way to do this is to let the LLM generate some tokens $x_{1} \dots . x_{9}$ , then get a probabilistic prediction about the next token from the LLM and then ask a human to give a next token $x_{10}$ . The loss of the LLM is just the, e.g., log loss of its prediction against the $x_{10}$ provided by the human. One slightly tricky point here is that we only train the LLM to make good predictions on $x_{10}$ . We don’t want to train it to output $x_{1} \dots x_{9}$ that make $x_{10}$ easier to predict. So we need to be careful to choose the right gradient. I think that’s basically all they’re doing, though. It doesn’t seem like there’s anything causalist here.

So, in conclusion: While very interesting, I don't think these papers tell us anything new about how to build an EDT or a CDT agent.

LESSWRONG
LW

LESSWRONG
LW

22

Language Models can be Utility-Maximising Agents

22

Ω 11

22

Ω 11

1 minute takeaways

How to make an EDT agent out of a language model:

How to make a CDT

Decision Transformers

Alignment Consequences