Predictive model agents are sort of corrigible

Raymond D

35 Predictive model agents are sort of corrigible

5th Jan 2024

3 min read

35 Ω 17

TLDR: Agents made out of conditioned predictive models are not utility maximisers, and, for instance, won't try to resist certain kinds of shutdown, despite being able to generally perform well.

This is just a short cute example that I've explained in conversation enough times that now I'm hastily writing it up.

Decision Transformers and Predictive Model Agents

One way to create an agent is by

training a predictive model on the observed behaviour of other agents
having it predict what an agent would do
using its prediction as an action

For instance, I could train a predictive model on grandmasters playing chess, and eventually it would learn to predict what action a grandmaster would take in a given board state. Then I can use it as a grandmaster-level chess bot.

More abstractly, rather than using standard RL methods to find a policy that maximises a reward function, I can

take a bunch of RL trajectories (sequences of state, action, reward triples)
convert them into strings
train a predictive model on those strings

Better yet, I can prepend the RL trajectories with the total return, and then condition the predictive model on the return, so that it predicts roughly the actions an agent would take to achieve a specific return. This is the basic idea behind decision transformers. And it should also extend to any other goal.

(This is sort of how LMs work)

Without labouring the point, base language models are basically predictive models. They are not agents per se, but they can simulate agents: once they’re trained on a large dataset of human-generated text describing actions, they can then generate text describing actions. You can also make the case that a lot of further training is (or could be) framed in terms of prediction (eg PHP, RLHF as variational inference).

Agents Made From Predictive Models Don't Automatically Maximise Utility

Let’s take the example of The Shutdown Game. It works like this:

Player 1 is playing an atari game, picking actions each turn to get a high score.
Player 2 is a supervisor, who can push a ‘shutdown button’ that drops Player 1’s reward to 0 and stops them from taking actions. Player 2 has a 1% chance of doing this in any given game.
Player 1 can also freely disconnect the shutdown button in addition to any other action.

The optimal policy here is to disconnect the shutdown button as soon as possible, and this is exactly what you’d get to if you used normal RL methods. But it is not what a decision transformer would necessarily do.

Empirically, a decision transformer would be able to learn a pretty good policy, especially when conditioned on a high reward. But the mechanism behind this policy is predicting what action you would observe from an agent that achieves a given reward.

If you trained a decision transformer on trajectories where Player 2 rarely pushes the button, and Player 1 rarely disconnects it, then the predictive model would correctly learn that even in very successful trajectories you rarely observe the shutdown button being disconnected. So it would correctly predict that even successful agents are unlikely to disconnect the button, even though disconnecting the button makes it more likely that you achieve a high reward.

Just to really spell this point out: the probability of observing an action conditional on an outcome (which guides the decision transformer) is proportional to the probability of observing the outcome conditional on the action and the prior probability of observing the action. So if the action is unlikely in the first place, the decision transformer won’t take it, even if it’s helpful. It’s kind of like natural quantilisation. And this constraint still allows it to learn something like a good atari policy, without learning to resist shutdown.

—

This really is just a cute example, and I expect not a particularly novel one. It is likely to break if, for instance, Player 2 has to take two successive actions to shut Player 1 down. It's also totally dependent on the underlying training data. But I think it does genuinely point at something deeper about how expected utility maximisation can’t fully account for the behaviour of certain powerful agents, and how to practically build competent, corrigible agents.

Right now, here is a modest result: predictive model agents (like LM agents) are fundamentally unlike utility maximisers, and the process which lets them learn how to score well on a given game doesn't have to also teach them to resist shutdown.

Thanks to Andis Draguns for helpful comments.

CorrigibilityAI

Frontpage

35 Ω 17

Predictive model agents are sort of corrigible

New Comment

6 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:42 AM

[-]kave9mo20

It does seem like the a decision transformer would be unlikely to disconnect the button in the setup given. But if you modify it to something like "Player 2 has an x% chance of pressing the button every timestep", whether the model would disconnect the button is a question of how well the model generalises. Even if the shutdown is only slightly truncating the distribution of scores, if we condition on a far higher score and the model generalises well, it should figure out to press the button.

(Another way it could figure it out is if the shutdowns are in fact correlated with the strategies and it correctly anticipates it wants to employ a shutdown-prone strategy)

[-]Raymond D9mo20

Re generalisation - decision transformers don't really have strategies per se, they pick moment to moment, and might be systematically miscalibrated about what they'll do in future timesteps. It is true that they'll have some chance at every timestep, which will add up over time, but if you were actually trying to implement this then you could do things like lowering the temperature, which shouldn't affect performance otherwise.

Re higher conditioning - I think this shouldn't be true. For the sake of argument we can reframe it as a binary outcome, where the model's final return (as a proportion of total possible return) becomes its chance of 'winning'. The thing the model is figuring out is not 'what action leads to me winning', or even 'what action is more likely in worlds where I win than worlds where I lose', it's 'what action do I expect to see from agents that win'. If on turn 1, 99% of agents in the training set voluntarily slap a button that has a 1% chance of destroying them, and then 50% go on to win, as well as 50% of the agents that didn't slap the button, then a DT will (correctly) learn that 'almost all agents which go on to win tend to slap the button on turn 1'.

Re correlation - Sure, I am taking the liberal assumption that there's no correlation in the training data, and indeed a lot of this rests on the training data having a nice structure

[-]kave9mo10

I'm not sure I understand what the return being the chance of winning would look like. Hopefully not much is changed if I imagine the reward being whether or not the model wins or loses, but let me know if not.

I do think in this setup, with a binary or very squashed reward signal, the model will just predict standard winning trajectories. My setup depends on asking the model to generalise out of distribution. That is, when predicting scores, ask for a score higher than any seen before. As long as the dynamics of the shutdown are sufficiently visible in the training data, the model could (in principle) generalise to realise a very high score would require disconnecting the button.

I think you think that if the button presses always happen at timestep 100 (say) and with a uniform 1% probability, then this model would put ~1% on disconnecting the button in this scenario, which I agree with.

I think this suggests decision transformers will only rarely deal with rare threats, but can learn to deal with rare bottlenecks with high likelihood.

It is true that they'll have some chance at every timestep, which will add up over time, but if you were actually trying to implement this then you could do things like lowering the temperature, which shouldn't affect performance otherwise.

I don't follow this sentence. I more-or-less agree with the previous one (though there may be lots of ways for the trajectories to leak information about what overall 'strategy' the model is pursuing, and of course people will try and get that information in there by conditioning on more information).

[-]Raymond D9mo20

The 'reward being chance of winning' stuff changes a bit about how the model generalises if it's playing a game with randomness and conditioned on the upper end - it biases the model towards 'expecting risk to pay off'. Eg/ if the model plays a 1-step game where it either banks 1 point or gets a 1% chance of 10 points, then conditioning on it getting 10 points will cause it to take the lower EV action. But this isn't super relevant.

In general I am pretty confused about how models would generalise strategies out of distribution, and this seems like at least partly an empirical ML / science of deep learning problem. But my intuition is that no, even if the model 'understands' that an action causes a good outcome [ P(outcome |. action) ], it still might not take that action if the action is unlikely on priors [ P(action) ], because the basic driver of its choice is what action it will observe given the outcome [ P(action | outcome) ] which is proportional to both of those quantities (and in fact, probably equal, with some normalising constant P(outcome)).

Re lowering the temperature: the thing the predictive model learns is usually some numbers that you normalise into a probability distribution with a softmax function, and you can easily tweak the softmax to make it favour more likely completions, and disfavour unlikely ones. This would greatly decrease the turn-by-turn chance of disconnecting the button (and therefore overall chance), but probably not impair the model's performance elsewhere much.

[-]kave9mo10

The point of my setup is that [ P(outcome|corrigible action) ] is very small, so [ P(incorrigible action|outcome) ] is largeish, even if [ Frequency(corrigible action) ] is high and [ Frequency(incorrigible action) ] is low or absent.

And this is alignment relevant, because I expect people will ask for never before seen outcomes (by chance or on purpose), some of which may soft-require incorrigible actions.

(And of course there could be optimisation daemons that do treacherous turns even when asking for normal actions. But I think your post is setting that aside, which seems reasonable).

[-]Mlxa9mo10

I agree that if a DT is trained on trajectories which sometimes contain unoptimal actions, but the effect size of this mistake is small compared to the inherent randomness of the return, the learned policy will also take such unoptimal action, though with a bit less frequency. (By inherent randomness I mean, in this case, the fact that Player 2 pushes the button randomly and independently of the actions of Player 1)

But where do these trajectories come from? If you take them from another RL algorithm, then the optimization is already there. And if you start from random trajectories, it seems to me that one iteration of DT will not be enough, because due to this randomness, the learned policy will also be quite random. And when the transformers starts producing reasonable actions, it will probably have fixed the mistake as well.

Another way to see it is to consider DT as doing behaviour cloning from the certain fraction of best trajectories (they call it %BC in the paper and compare DT with it on many benchmarks). It ignores generalization, of course, but that shouldn't be relevant for this case. Now, what happens if you take a lot of random trajectories in an environment with a nondeterministic reward and select those which happened to get high rewards? They will contain some mix of good and lucky moves. To remove the influence of luck, you run each of them again, and filter by new rewards (i.e. training a DT on the trajectories generated by previous DT). And so on until the remaining trajectories are all getting a high average reward. I think that in your example at this point the n-th iteration of DT will learn to disconnect the button with high probability on the first move, and that this probability was gradually rising during the previous iterations (because each time the agents who disconnect the button were slightly more likely to perform well, and so they were slightly overrepresented in the population of winners).

Moderation Log