Note that this post has been edited to clarify the difference between explicitly assigning a reward to an action based on its later consequences, versus implicitly reinforcing an action by assigning high reward during later timesteps when its consequences are observed. I'd previously conflated these in a confusing way; thanks to Rohin for highlighting this issue.
A number of people seem quite excited about training myopic reinforcement learning agents as an approach to AI safety (for instance this post on approval-directed agents, proposals 2, 3, 4, 10 and 11 here, and this paper and presentation), but I’m not. I’ve had a few detailed conversations about this recently, and although I now understand the arguments for using myopia better, I’m not much more optimistic about it than I was before. In short, it seems that evaluating agents’ actions by our predictions of their consequences, rather than our evaluations of the actual consequences, will make reinforcement learning a lot harder; yet I haven’t been able to identify clear safety benefits from doing so. I elaborate on these points below; thanks to Jon Uesato, Evan Hubinger, Ramana Kumar and Stephan Wäldchen for discussion and comments.
I’ll define a myopic reinforcement learner as a reinforcement learning agent trained to maximise the reward received in the next timestep, i.e. with a discount rate of 0. Because it doesn’t assign credit backwards over time, in order to train it to do anything useful, that reward function will need to contain an estimate of how valuable each (state, action, next state) transition will be for outcomes many steps later. Since that evaluation will need to extrapolate a long way forward anyway, knowing the next state doesn’t add much, and so we can limit our focus to myopic agents trained on reward functions R which ignore the resulting state: that is, where for some M. I'll call M the approval function; we can think of such agents as being trained to take actions that their supervisor approves of at the time the action is taken, without reference to how the rest of the trajectory actually plays out. This definition can also include imitation learners, for which the approval function is calculated based on the agent’s divergence from the supervisor’s policy.
Although it’s not the standard interpretation of MDPs, I’ll consider the agent’s thoughts during a timestep as part of their action, so we can give feedback on those too in this framework. Note also that I'm talking about myopia with respect to "world time", not with respect to "agent time". For example, in Debate, agents make a series of arguments and then update their policies based on the final reward (so they're nonmyopic in agent time), but this whole process occurs without the agents being rewarded for the consequences of their actions in their wider environment, so it still qualifies as myopic by my definition above. In this post, though, I'm focusing on agents which are trained to take actions in the world, not ones which are just trained to give language outputs.
Supervising based on predictions not observations is a significant handicap
Firstly, I want to emphasise how much more difficult it is for the supervisor to try and evaluate the values of actions immediately, without being able to give rewards after observing long-term outcomes of those actions. In order to do so, the supervisor needs to be able to predict in advance all the mechanisms which they want the agent to learn to use. In other words, the supervisor needs to be more intelligent than the agent - perhaps by a significant margin. Contrast this with the standard RL paradigm, in which we merely need to be able to recognise good outcomes, and the agent will learn by itself which of its actions led to them. It’s the difference between a simple programmatic reward function for training AlphaGo, versus needing to understand the advantages of each of AlphaGo’s moves before training it to that level - which would have set the development of AlphaGo back by years or decades.
One thing that makes this less than totally implausible is the possibility of the agent's knowledge being used by the supervisor. Schemes such as iterated amplification attempt to do this via recursive decomposition of the evaluation problem. I think iterated amplification is an important research direction, but I don’t see what value there is in making the supervisor output approval values to train a myopic agent on, rather than rewards to train a nonmyopic agent on. I’ll give more concrete examples and arguments about this later on; for now, it’s worth noting that I expect almost all nonmyopic training to happen in simulation, where rollouts are cheap. But once an agent is competent enough to deploy in the real world, we can also continue nonmyopic training over longer time periods. For example, once an agent has been deployed for a week or month, we can give it rewards for consequences which result from actions taken a week or a month ago (although in many cases I expect the relevant consequences to be apparent in a matter of minutes - e.g. the agent taking a wrong turn then backtracking).
This assumes that agents have good mechanisms for credit assignment. If it turns out that long-term credit assignment is a major bottleneck for RL, that would shift me towards thinking that myopia is more competitive. Further, although it's not standard practice, it seems plausible that we could wait until after observing a whole trajectory to give approval feedback on any actions. I'll call this semi-myopia, because the training algorithm itself is still myopic, but evaluations are based on the actual consequences of actions (not just their predicted consequences). Compared with myopia, semi-myopia incurs less disadvantage from removing automatic credit assignment over time, so it seems like an interesting direction to explore (although its benefits over nonmyopia still depend on arguments about manipulation which I attempt to rebut in a later section).
It’s also true that there are some types of evaluation which benefit much less from the propagation of credit from later rewards to earlier actions. I've already mentioned that my arguments are less applicable for agents which merely propose high-level ideas (like scientific theories). For agents which do take actions in the world, one example where we can evaluate immediately is when determining whether an agent is “thinking the right types of thoughts” - for example, whether it’s making manipulative plans or not. In theory this doesn’t depend on its future thoughts or actions - but in practice it'd still be useful to assign credit to thoughts partly based on later observations. For example, if its plan changes greatly when the supervisor starts paying more attention, that’s evidence that its original plan was manipulative, and so it would be useful to give a negative reward which discourages it from making that plan in the first place. Yet more importantly, supervision to prevent the wrong types of thoughts from arising will not be sufficient to train a highly competent AI. We also need supervision to encourage actions that lead to good consequences - which brings us back to the arguments from the previous paragraphs.
Of course, if we want to train agents that make plans on the scale of years or decades, waiting to give rewards later on will take prohibitively long, and so our feedback to them will need to involve predictions of future effects of their actions. So there’ll be some element of foresight required either way. But trying to evaluate agent actions without seeing any of their effects on the world would require a significant shift from the current trajectory of reinforcement learning. Given this bar, we should expect compelling reasons to believe that myopic training will actually be important for safety - which, I will argue, we don’t yet have.
Myopic training doesn’t inherently prevent dangerous long-term behaviour
It seems intuitive that if agents are never rewarded for the long-term consequences of their actions, they won’t make dangerous long-term plans - but in fact myopic training itself doesn’t make any inherent difference. Let’s compare an agent trained using standard RL on a reward function R, with an agent trained using myopic RL where its approval function is the optimal Q-function of R. The same types of cognition will lead to high-scoring agents in both cases. This is clearest in the case of Q-learning, where the Q-functions converge to the same values in the limit in both cases. The intuition here is: if the standard RL agent benefits from planning to figure out how to get to future states with high reward, then the myopically trained agent benefits from planning to figure out how to get to future states with high reward in order to choose actions with high Q-values. So a “myopic” training regime may lead to agents which make long-term plans, and generally may display the same dangerous behaviour as standard RL agents, for the same reasons; later in this post I’ll discuss in more detail how such behaviour might arise during myopic training.
The equivalence doesn’t just hold in a few cases - in fact, every function M which myopically assigns a value to all state-action pairs is the optimal Q-function for some reward function. So for any myopic training setup, there’s some equivalent nonmyopic training setup - specifically, the one with reward function . Notice that this equation is just the definition of the optimal Q-function, with M in place of Q, and rearranged so that we can calculate R in terms of M. It’s not always straightforward to calculate that reward function starting from the approval function (e.g. in continuous action spaces, where we can't check all the actions) but in practice I expect that given the capability to calculate a Q-function which takes into account all relevant future effects, we can come up with a reward function that’s very nearly equivalent. If this is a crux for proponents of myopia, I’d be interested to hear why.
I should note that so far I’ve been talking about myopia as a property of a training process. This is in contrast to the cognitive property that an agent might possess, of not making decisions directly on the basis of their long-term consequences; an example of the latter is approval-directed agents. (Throughout the rest of this post all references to “myopic agents” will refer to the way those agents were trained, not to how they make decisions). “Myopic thinking” has never been particularly well-specified - we have some intuitive sense of what it looks like in human psychology, but there are a lot of missing details, especially in translating it to the context of AI. I’d be excited to see these details filled in, since I think a major constraint in safety is our lack of understanding of the possible ways that minds might work. For the purposes of this blog post, though, what’s relevant is that proposals to build approval-directed agents or agents which “think myopically” tend to outline myopic training processes intended to produce them without actually justifying why the myopic training is necessary. But based on the arguments in this post I expect that, whatever the most reasonable interpretations of “approval-directed” or “myopic” cognition turn out to be, they could be developed in nonmyopic training regimes just as well as (or better than) in myopic training regimes.
What might this look like in practice? Consider the example of an agent trained myopically on the approval of HCH. To make this nonmyopic in a trivial sense, we merely need to convert that approval into a reward using the formula I gave above. However, after just the trivial change, myopic training will outperform nonmyopic training (because the latter requires the agent to do credit assignment across timesteps). To make it nonmyopic in an interesting and advantageous sense, HCH will need to notice when its earlier evaluations were suboptimal, and then assign additional rewards to correct for those errors. That would make evaluation significantly more accurate on many tasks - or, at least, allow HCH to put less effort into each evaluation (which is important given that evaluation will be the most expensive part of amplification by far). Nonmyopic training allows us to judge how good a speech is after having seen the audience’s reaction, not just from the draft; or how good a startup idea is after it’s been implemented, not just when first proposed.
Why might we not want this ability? The main objection is that seeing how things play out won’t just give us more information, but also bias our evaluation, if the agent is able to tamper with or manipulate the training process. I’ll discuss this possibility in the upcoming sections.
Myopic training won't prevent tampering in practice
The equivalence arguments I made above only apply if the training setup is uninfluenceable by the agent. If an agent can tamper with the implementation of the training algorithm during training, then myopia and nonmyopia can diverge sharply. In particular, it depends on what type of tampering is possible. If the agent is able to increase the feedback signal during some timesteps, then this will lead to large changes in the standard RL agent’s behaviour (since it will then be incentivised to reach those timesteps) and small changes in the myopic agent’s behaviour (it will learn that the tampering action is good in some states, but shouldn’t systematically seek them out). Of course it’s still possible in theory for myopic agents to learn to tamper consistently during training, but it seems like that’d require some very specific mechanisms for tampering (e.g. tampering with the approval function to change its evaluations of a closely-related subset of states) rather than easier mechanisms for tampering (e.g. overriding current reward signal with high values).
My main objection here is simply that we shouldn’t expect positive feedback during training to be an important reason why agents end up tampering with their training algorithm. Figuring out how to access and then modify the code that's being run in the training setup will be hard if we put in even basic precautions; and so it seems very unlikely that an agent will stumble upon it during random exploration. So if an agent finds a tampering action during training, that’ll be because it knew what it was doing, and wanted to tamper. Further, in order for this to actually be an important problem the agent would need to figure out how to prevent humans from shutting it down, and pursue a range of other convergent instrumental goals. It’s these intentions which we want to prevent from arising in the first place.
Even if the agent doesn’t ever get positive reinforcement for tampering, though, the difference between myopic and nonmyopic training will presumably change the agent’s observations at the point where it learns that it’s in a training process. In theory at least, this might influence its intentions. By analogy, after humans learned about evolution, a bunch of people started making the “appeal to nature” fallacy by justifying certain behaviours because they increase evolutionary fitness.
But I think that we shouldn’t expect AI goals to be very influenced just by the knowledge of their training process, for the same reason that such arguments aren’t very important for humans: by the time an agent is intelligent enough to understand the details of its own training process, it’ll already be generally intelligent and its goals will be already largely set. At that point it's probably too late for big updates to those goals. And even if they do make big updates, if their motivations end up depending sensitively on philosophical-style reasoning, then we’re in a pretty bad position either way.
I do want to flag this whole line of investigation as very speculative, though; I think a more thorough treatment would be very valuable (and I’m hoping to release a post on this soon).
Myopic training doesn't prevent manipulation of supervisors
The argument that agents won’t get rewarded for tampering during training fails if we consider manipulating humans to be “tampering”. Presumably there are a lot of different ways to manipulate humans who are giving reward feedback, and once a nonmyopic agent starts doing so a little bit, then it’ll get higher reward for doing so, and that behaviour will be reinforced, and its long-term goals will be affected.
I find this argument somewhat persuasive, but I am currently leaning towards thinking that this is just as much a problem for myopic agents, conditional on having competitive myopic agents. My core argument:
- Assume that there’s some sequence of outputs which an agent could use to manipulate its supervisor into giving it higher reward, without the supervisor realising. (If not, then neither myopic nor nonmyopic agents are incentivised to manipulate their supervisors).
- For myopic agents to be competitive on long-term tasks, their objective function needs to be set by a supervisor which is able to accurately predict how well their actions fulfil long-term goals.
- If that supervisor is accurately predicting and evaluating the future outputs of this myopic agent as well, then it can still be manipulated by those inputs.
- If the supervisor isn’t accurately predicting those outputs, then the myopic agent won’t exceed its supervisor, because it will never learn to take the first step of any plan that the supervisor can’t predict will be good.
I find it helpful to think about this using a concrete example. Suppose that a myopic agent M is trained to write bestselling novels, but is restricted to writing one paragraph at a time, to avoid giving it an incentive to manipulate us (assuming it can’t do so in a single paragraph). M plans a sequence of N paragraphs which make up a good novel - but which, when combined, contains a bunch of pro-AI propaganda. A supervisor S observes M’s first output in that sequence. In order to understand how good this is, S then predicts M’s next N -1 outputs, allowing it to read M’s pro-AI propaganda. S is influenced by that propaganda, and gives M slightly more reward for its original action, therefore reinforcing M’s cognitive process of “make a plan to manipulate the supervisor”.
Perhaps you’re thinking: well, S could just predict the fact that M will output pro-AI propaganda, without predicting its contents in detail. But how does S know that the pro-AI propaganda isn’t part of what makes M’s output a bestseller? Remember that, by assumption, S can’t predict what inputs it will be manipulated by (otherwise myopia doesn’t buy us any advantage). So it will need to analyse the propaganda well enough to understand its downstream effects. If that seems implausible, remember that most of M’s training will be in simulation, and so it may be possible for S to spin up another copy of the simulation to see how the actions play out. (Evaluating the same actions in a different simulation is really stretching the definition of “myopia”, but provides a useful intuition for why manipulation can happen either way).
Even given that, it’s quite possible that S won’t be able to predict M’s outputs in enough detail to recreate the propaganda. But then S will never be able to reward M for plans which rely on outputs as detailed as what’s required to manipulate S. If S can’t predict the details of M’s outputs over many timesteps, then it can’t evaluate plans which rely on those details. Perhaps manipulative plans are more “brittle” than regular plans - that is, imprecision in S’s predictions will prevent S from being manipulated while still allowing S to accurately evaluate regular plans. But in general engineering is more brittle than persuasion. Suppose S wants M to build a rocket in simulation. S needs to be able to predict which specific types of nuts and bolts the rocket needs when M creates them (simulated) months in advance, because otherwise S can’t give positive feedback for creating those particular nuts and bolts. I’m not ruling out the possibility that S’s predictions are high-fidelity enough to supervise rocket construction, while also not high-fidelity enough to recreate manipulative outputs - but it seems pretty unlikely.
I’ll rephrase this larger point, since it’s the core of my high-level concern with myopia: arguments for myopic training lean on the intuition that it’s safer for evaluation to not directly depend on the actual long-term consequences of agents’ actions, without acknowledging that supervision powerful enough to train competitive myopic agents will effectively be doing that evaluation anyway. Insofar as the evaluation isn’t accurate enough to recreate the manipulative effects, then it’s also not accurate enough to train agents which are competitive at achieving outcomes in the world.
In short, I don’t think any arguments so far have succeeded in making the case that we should attempt to develop AGIs which are trained primarily via myopic reinforcement learning. Further, while myopic training is a well-defined technical concept, the types of thinking we want our agents to develop under such training are not very clear, and the links between the two remain speculative.
Given the lack of clarity around myopia’s benefits, and the likely difficulty of supervising agents to become competent at achieving large-scale outcomes without the supervisor examining what outcomes the agents’ actions actually lead to, I’m more excited about research which aims to add on some component of myopic supervision to a largely nonmyopic process. For now, we should consider myopic training an interesting tool that’s being developed for the safety toolbox - but one which will likely incur significant disadvantages unless deployed alongside nonmyopic training regimes. The core idea of training agents which aren’t long-term consequentialists is a different thing, which will require other approaches and insights.