Note that this post has been edited to clarify the difference between explicitly assigning a reward to an action based on its later consequences, versus implicitly reinforcing an action by assigning high reward during later timesteps when its consequences are observed. I'd previously conflated these in a confusing way; thanks to Rohin for highlighting this issue.
A number of people seem quite excited about training myopic reinforcement learning agents as an approach to AI safety (for instance this post on approval-directed agents, proposals 2, 3, 4, 10 and 11 here, and this paper and presentation), but I’m not. I’ve had a few detailed conversations about this recently, and although I now understand the arguments for using myopia better, I’m not much more optimistic about it than I was before. In short, it seems that evaluating agents’ actions by our predictions of their consequences, rather than our evaluations of the actual consequences, will make reinforcement learning a lot harder; yet I haven’t been able to identify clear safety benefits from doing so. I elaborate on these points below; thanks to Jon Uesato, Evan Hubinger, Ramana Kumar and Stephan Wäldchen for discussion and comments.
I’ll define a myopic reinforcement learner as a reinforcement learning agent trained to maximise the reward received in the next timestep, i.e. with a discount rate of 0. Because it doesn’t assign credit backwards over time, in order to train it to do anything useful, that reward function will need to contain an estimate of how valuable each (state, action, next state) transition will be for outcomes many steps later. Since that evaluation will need to extrapolate a long way forward anyway, knowing the next state doesn’t add much, and so we can limit our focus to myopic agents trained on reward functions R which ignore the resulting state: that is, where for some M. I'll call M the approval function; we can think of such agents as being trained to take actions that their supervisor approves of at the time the action is taken, without reference to how the rest of the trajectory actually plays out. This definition can also include imitation learners, for which the approval function is calculated based on the agent’s divergence from the supervisor’s policy.
Although it’s not the standard interpretation of MDPs, I’ll consider the agent’s thoughts during a timestep as part of their action, so we can give feedback on those too in this framework. Note also that I'm talking about myopia with respect to "world time", not with respect to "agent time". For example, in Debate, agents make a series of arguments and then update their policies based on the final reward (so they're nonmyopic in agent time), but this whole process occurs without the agents being rewarded for the consequences of their actions in their wider environment, so it still qualifies as myopic by my definition above. In this post, though, I'm focusing on agents which are trained to take actions in the world, not ones which are just trained to give language outputs.
Supervising based on predictions not observations is a significant handicap
Firstly, I want to emphasise how much more difficult it is for the supervisor to try and evaluate the values of actions immediately, without being able to give rewards after observing long-term outcomes of those actions. In order to do so, the supervisor needs to be able to predict in advance all the mechanisms which they want the agent to learn to use. In other words, the supervisor needs to be more intelligent than the agent - perhaps by a significant margin. Contrast this with the standard RL paradigm, in which we merely need to be able to recognise good outcomes, and the agent will learn by itself which of its actions led to them. It’s the difference between a simple programmatic reward function for training AlphaGo, versus needing to understand the advantages of each of AlphaGo’s moves before training it to that level - which would have set the development of AlphaGo back by years or decades.
One thing that makes this less than totally implausible is the possibility of the agent's knowledge being used by the supervisor. Schemes such as iterated amplification attempt to do this via recursive decomposition of the evaluation problem. I think iterated amplification is an important research direction, but I don’t see what value there is in making the supervisor output approval values to train a myopic agent on, rather than rewards to train a nonmyopic agent on. I’ll give more concrete examples and arguments about this later on; for now, it’s worth noting that I expect almost all nonmyopic training to happen in simulation, where rollouts are cheap. But once an agent is competent enough to deploy in the real world, we can also continue nonmyopic training over longer time periods. For example, once an agent has been deployed for a week or month, we can give it rewards for consequences which result from actions taken a week or a month ago (although in many cases I expect the relevant consequences to be apparent in a matter of minutes - e.g. the agent taking a wrong turn then backtracking).
This assumes that agents have good mechanisms for credit assignment. If it turns out that long-term credit assignment is a major bottleneck for RL, that would shift me towards thinking that myopia is more competitive. Further, although it's not standard practice, it seems plausible that we could wait until after observing a whole trajectory to give approval feedback on any actions. I'll call this semi-myopia, because the training algorithm itself is still myopic, but evaluations are based on the actual consequences of actions (not just their predicted consequences). Compared with myopia, semi-myopia incurs less disadvantage from removing automatic credit assignment over time, so it seems like an interesting direction to explore (although its benefits over nonmyopia still depend on arguments about manipulation which I attempt to rebut in a later section).
It’s also true that there are some types of evaluation which benefit much less from the propagation of credit from later rewards to earlier actions. I've already mentioned that my arguments are less applicable for agents which merely propose high-level ideas (like scientific theories). For agents which do take actions in the world, one example where we can evaluate immediately is when determining whether an agent is “thinking the right types of thoughts” - for example, whether it’s making manipulative plans or not. In theory this doesn’t depend on its future thoughts or actions - but in practice it'd still be useful to assign credit to thoughts partly based on later observations. For example, if its plan changes greatly when the supervisor starts paying more attention, that’s evidence that its original plan was manipulative, and so it would be useful to give a negative reward which discourages it from making that plan in the first place. Yet more importantly, supervision to prevent the wrong types of thoughts from arising will not be sufficient to train a highly competent AI. We also need supervision to encourage actions that lead to good consequences - which brings us back to the arguments from the previous paragraphs.
Of course, if we want to train agents that make plans on the scale of years or decades, waiting to give rewards later on will take prohibitively long, and so our feedback to them will need to involve predictions of future effects of their actions. So there’ll be some element of foresight required either way. But trying to evaluate agent actions without seeing any of their effects on the world would require a significant shift from the current trajectory of reinforcement learning. Given this bar, we should expect compelling reasons to believe that myopic training will actually be important for safety - which, I will argue, we don’t yet have.
Myopic training doesn’t inherently prevent dangerous long-term behaviour
It seems intuitive that if agents are never rewarded for the long-term consequences of their actions, they won’t make dangerous long-term plans - but in fact myopic training itself doesn’t make any inherent difference. Let’s compare an agent trained using standard RL on a reward function R, with an agent trained using myopic RL where its approval function is the optimal Q-function of R. The same types of cognition will lead to high-scoring agents in both cases. This is clearest in the case of Q-learning, where the Q-functions converge to the same values in the limit in both cases. The intuition here is: if the standard RL agent benefits from planning to figure out how to get to future states with high reward, then the myopically trained agent benefits from planning to figure out how to get to future states with high reward in order to choose actions with high Q-values. So a “myopic” training regime may lead to agents which make long-term plans, and generally may display the same dangerous behaviour as standard RL agents, for the same reasons; later in this post I’ll discuss in more detail how such behaviour might arise during myopic training.
The equivalence doesn’t just hold in a few cases - in fact, every function M which myopically assigns a value to all state-action pairs is the optimal Q-function for some reward function. So for any myopic training setup, there’s some equivalent nonmyopic training setup - specifically, the one with reward function . Notice that this equation is just the definition of the optimal Q-function, with M in place of Q, and rearranged so that we can calculate R in terms of M. It’s not always straightforward to calculate that reward function starting from the approval function (e.g. in continuous action spaces, where we can't check all the actions) but in practice I expect that given the capability to calculate a Q-function which takes into account all relevant future effects, we can come up with a reward function that’s very nearly equivalent. If this is a crux for proponents of myopia, I’d be interested to hear why.
I should note that so far I’ve been talking about myopia as a property of a training process. This is in contrast to the cognitive property that an agent might possess, of not making decisions directly on the basis of their long-term consequences; an example of the latter is approval-directed agents. (Throughout the rest of this post all references to “myopic agents” will refer to the way those agents were trained, not to how they make decisions). “Myopic thinking” has never been particularly well-specified - we have some intuitive sense of what it looks like in human psychology, but there are a lot of missing details, especially in translating it to the context of AI. I’d be excited to see these details filled in, since I think a major constraint in safety is our lack of understanding of the possible ways that minds might work. For the purposes of this blog post, though, what’s relevant is that proposals to build approval-directed agents or agents which “think myopically” tend to outline myopic training processes intended to produce them without actually justifying why the myopic training is necessary. But based on the arguments in this post I expect that, whatever the most reasonable interpretations of “approval-directed” or “myopic” cognition turn out to be, they could be developed in nonmyopic training regimes just as well as (or better than) in myopic training regimes.
What might this look like in practice? Consider the example of an agent trained myopically on the approval of HCH. To make this nonmyopic in a trivial sense, we merely need to convert that approval into a reward using the formula I gave above. However, after just the trivial change, myopic training will outperform nonmyopic training (because the latter requires the agent to do credit assignment across timesteps). To make it nonmyopic in an interesting and advantageous sense, HCH will need to notice when its earlier evaluations were suboptimal, and then assign additional rewards to correct for those errors. That would make evaluation significantly more accurate on many tasks - or, at least, allow HCH to put less effort into each evaluation (which is important given that evaluation will be the most expensive part of amplification by far). Nonmyopic training allows us to judge how good a speech is after having seen the audience’s reaction, not just from the draft; or how good a startup idea is after it’s been implemented, not just when first proposed.
Why might we not want this ability? The main objection is that seeing how things play out won’t just give us more information, but also bias our evaluation, if the agent is able to tamper with or manipulate the training process. I’ll discuss this possibility in the upcoming sections.
Myopic training won't prevent tampering in practice
The equivalence arguments I made above only apply if the training setup is uninfluenceable by the agent. If an agent can tamper with the implementation of the training algorithm during training, then myopia and nonmyopia can diverge sharply. In particular, it depends on what type of tampering is possible. If the agent is able to increase the feedback signal during some timesteps, then this will lead to large changes in the standard RL agent’s behaviour (since it will then be incentivised to reach those timesteps) and small changes in the myopic agent’s behaviour (it will learn that the tampering action is good in some states, but shouldn’t systematically seek them out). Of course it’s still possible in theory for myopic agents to learn to tamper consistently during training, but it seems like that’d require some very specific mechanisms for tampering (e.g. tampering with the approval function to change its evaluations of a closely-related subset of states) rather than easier mechanisms for tampering (e.g. overriding current reward signal with high values).
My main objection here is simply that we shouldn’t expect positive feedback during training to be an important reason why agents end up tampering with their training algorithm. Figuring out how to access and then modify the code that's being run in the training setup will be hard if we put in even basic precautions; and so it seems very unlikely that an agent will stumble upon it during random exploration. So if an agent finds a tampering action during training, that’ll be because it knew what it was doing, and wanted to tamper. Further, in order for this to actually be an important problem the agent would need to figure out how to prevent humans from shutting it down, and pursue a range of other convergent instrumental goals. It’s these intentions which we want to prevent from arising in the first place.
Even if the agent doesn’t ever get positive reinforcement for tampering, though, the difference between myopic and nonmyopic training will presumably change the agent’s observations at the point where it learns that it’s in a training process. In theory at least, this might influence its intentions. By analogy, after humans learned about evolution, a bunch of people started making the “appeal to nature” fallacy by justifying certain behaviours because they increase evolutionary fitness.
But I think that we shouldn’t expect AI goals to be very influenced just by the knowledge of their training process, for the same reason that such arguments aren’t very important for humans: by the time an agent is intelligent enough to understand the details of its own training process, it’ll already be generally intelligent and its goals will be already largely set. At that point it's probably too late for big updates to those goals. And even if they do make big updates, if their motivations end up depending sensitively on philosophical-style reasoning, then we’re in a pretty bad position either way.
I do want to flag this whole line of investigation as very speculative, though; I think a more thorough treatment would be very valuable (and I’m hoping to release a post on this soon).
Myopic training doesn't prevent manipulation of supervisors
The argument that agents won’t get rewarded for tampering during training fails if we consider manipulating humans to be “tampering”. Presumably there are a lot of different ways to manipulate humans who are giving reward feedback, and once a nonmyopic agent starts doing so a little bit, then it’ll get higher reward for doing so, and that behaviour will be reinforced, and its long-term goals will be affected.
I find this argument somewhat persuasive, but I am currently leaning towards thinking that this is just as much a problem for myopic agents, conditional on having competitive myopic agents. My core argument:
- Assume that there’s some sequence of outputs which an agent could use to manipulate its supervisor into giving it higher reward, without the supervisor realising. (If not, then neither myopic nor nonmyopic agents are incentivised to manipulate their supervisors).
- For myopic agents to be competitive on long-term tasks, their objective function needs to be set by a supervisor which is able to accurately predict how well their actions fulfil long-term goals.
- If that supervisor is accurately predicting and evaluating the future outputs of this myopic agent as well, then it can still be manipulated by those inputs.
- If the supervisor isn’t accurately predicting those outputs, then the myopic agent won’t exceed its supervisor, because it will never learn to take the first step of any plan that the supervisor can’t predict will be good.
(Note: Evan argues below that this argument doesn't apply to imitative amplification, which in theory could train a superhuman agent without the supervisor needing to predict the consequences of the agent's actions. This is a reasonable point, although I don't expect imitative amplification to be competitive with other ML techniques, so it doesn't change my conclusions overall.)
I find it helpful to think about this using a concrete example. Suppose that a myopic agent M is trained to write bestselling novels, but is restricted to writing one paragraph at a time, to avoid giving it an incentive to manipulate us (assuming it can’t do so in a single paragraph). M plans a sequence of N paragraphs which make up a good novel - but which, when combined, contains a bunch of pro-AI propaganda. A supervisor S observes M’s first output in that sequence. In order to understand how good this is, S then predicts M’s next N -1 outputs, allowing it to read M’s pro-AI propaganda. S is influenced by that propaganda, and gives M slightly more reward for its original action, therefore reinforcing M’s cognitive process of “make a plan to manipulate the supervisor”.
Perhaps you’re thinking: well, S could just predict the fact that M will output pro-AI propaganda, without predicting its contents in detail. But how does S know that the pro-AI propaganda isn’t part of what makes M’s output a bestseller? Remember that, by assumption, S can’t predict what inputs it will be manipulated by (otherwise myopia doesn’t buy us any advantage). So it will need to analyse the propaganda well enough to understand its downstream effects. If that seems implausible, remember that most of M’s training will be in simulation, and so it may be possible for S to spin up another copy of the simulation to see how the actions play out. (Evaluating the same actions in a different simulation is really stretching the definition of “myopia”, but provides a useful intuition for why manipulation can happen either way).
Even given that, it’s quite possible that S won’t be able to predict M’s outputs in enough detail to recreate the propaganda. But then S will never be able to reward M for plans which rely on outputs as detailed as what’s required to manipulate S. If S can’t predict the details of M’s outputs over many timesteps, then it can’t evaluate plans which rely on those details. Perhaps manipulative plans are more “brittle” than regular plans - that is, imprecision in S’s predictions will prevent S from being manipulated while still allowing S to accurately evaluate regular plans. But in general engineering is more brittle than persuasion. Suppose S wants M to build a rocket in simulation. S needs to be able to predict which specific types of nuts and bolts the rocket needs when M creates them (simulated) months in advance, because otherwise S can’t give positive feedback for creating those particular nuts and bolts. I’m not ruling out the possibility that S’s predictions are high-fidelity enough to supervise rocket construction, while also not high-fidelity enough to recreate manipulative outputs - but it seems pretty unlikely.
I’ll rephrase this larger point, since it’s the core of my high-level concern with myopia: arguments for myopic training lean on the intuition that it’s safer for evaluation to not directly depend on the actual long-term consequences of agents’ actions, without acknowledging that supervision powerful enough to train competitive myopic agents will effectively be doing that evaluation anyway. Insofar as the evaluation isn’t accurate enough to recreate the manipulative effects, then it’s also not accurate enough to train agents which are competitive at achieving outcomes in the world.
In short, I don’t think any arguments so far have succeeded in making the case that we should attempt to develop AGIs which are trained primarily via myopic reinforcement learning. Further, while myopic training is a well-defined technical concept, the types of thinking we want our agents to develop under such training are not very clear, and the links between the two remain speculative.
Given the lack of clarity around myopia’s benefits, and the likely difficulty of supervising agents to become competent at achieving large-scale outcomes without the supervisor examining what outcomes the agents’ actions actually lead to, I’m more excited about research which aims to add on some component of myopic supervision to a largely nonmyopic process. For now, we should consider myopic training an interesting tool that’s being developed for the safety toolbox - but one which will likely incur significant disadvantages unless deployed alongside nonmyopic training regimes. The core idea of training agents which aren’t long-term consequentialists is a different thing, which will require other approaches and insights.
Things I agree with:
1. If humans could give correctly specified reward feedback, it is a significant handicap to have a human provide approval feedback rather than reward feedback, because that requires the human to compute the consequences of possible plans rather than offloading it to the agent.
2. If we could give perfect approval feedback, we could also provide perfect reward feedback (at least for a small action space), via your reduction.
3. Myopic training need not lead to myopic cognition (and isn't particularly likely to for generally intelligent systems).
But I don't think these counteract what I see as the main argument for myopic training:
While small errors in reward specification can incentivize catastrophic outcomes, small errors in approval feedback are unlikely to incentivize catastrophic outcomes.
(I'm using "incentivize" here to talk about outer alignment and not inner alignment.)
In other words, the point is that humans are capable of giving approval / myopic feedback (i.e. horizon = 1) with not-terrible incentives, whereas humans don't seem capable of giving reward feedback (i.e. horizon = infinity) with not-terrible incentives. The main argument for this is that most "simple" reward feedback leads to convergent instrumental subgoals, whereas approval / myopic feedback almost never does unless that's what the human says is correct. (Also we can just look at the long list of specification gaming examples so far.)
I'll rephrase your objections and then respond:
Response: Someone has to predict which action leads to good long-term effects, since we can't wait for 100 years to give feedback to the agent for a single action. In a "default" training setup, we don't want it to be the agent, because we can't trust that the agent selects actions based on what we think is "good". So we either need the human to take on this job (potentially with help from the agent), or we need to figure out some other way to trust that the agent selects "good" actions. Myopia / approval direction takes the first option. We don't really know of a good way to achieve the second option.
This doesn't seem to be true -- if you want, you can collect a full trajectory to see the consequences of the actions, and then provide approval feedback on each of the actions individually when computing gradients.
I agree that if you take the approval feedback that a human would give, apply this transformation, and then train a non-myopic RL agent on it, that would also not incentivize catastrophic outcomes. But if you start out with approval feedback, why would you want to do this? With approval feedback, the credit assignment problem has already been solved for the agent, whereas with the equivalent reward feedback, you've just undone the credit assignment and the agent now has to redo it all over again. (Like, instead of doing Q-learning, which has a non-stationary target, you could just use supervised learning to learn the fixed approval signal, surely this would be more efficient?)
On the tampering / manipulation points, I think those are special cases of the general point that it's easier for humans to provide non-catastrophe-incentivizing approval feedback than to provide non-catastrophe-incentivizing reward feedback.
I want to reiterate that I agree with the point that myopic training probably does not lead to myopic cognition (though this depends on what exactly we mean by "myopic cognition"), and I don't think of that as a major benefit of myopic training.
I think you mean γ instead of λ
It feels like we have two disagreements here. One is whether the thing you describe in this quote is "myopic" training. If you think that the core idea of myopia is that the evaluation of an action isn't based on its effects, then this is better described as nonmyopic. But if you think that the core idea of myopia is that the agent doesn't do its own credit assignment, then this is best described as myopic.
If you think, as I interpret you as saying, that the main reason myopia is useful is because it removes the incentive for agents to steer towards incorrectly high-reward states (which I'll call "manipulative" states), then you should be inclined towards the first definition. Because the approach you described above (of collecting and evaluating a full trajectory before giving feedback) means the agent still has an incentive to do multi-step manipulative plans.
More specifically: if a myopic agent's actions A_1 to A_n manipulate the supervisor into thinking that the N+1th state is really amazing, and the supervisor looks at the full trajectory before assigning approval, then the supervisor will give higher approval to all of the actions A_1 to A_n, and they'll all be reinforced, which is the same thing as would happen in a nonmyopic setup if the supervisor just gave the Nth action really high reward. In other words, it doesn't matter if the agent is doing its own credit assignment because the supervisor is basically doing the same credit assignment as the agent would. So if you count the approach you described above as myopic, then myopia doesn't do the thing you claim it does.
(I guess you could say that something count as a "small" error if it only affects a few states, and so what I just described is not a small error in the approval function? But it's a small error in the *process* of generating the approval function, which is the important thing. In general I don't think counting the size of an error in terms of the number of states affected makes much sense, since you can always arbitrarily change those numbers.)
The second disagreement is about:
I am kinda confused about what sort of approval feedback you're talking about. Suppose we have a simple reward function, which gives the agent more points for collecting more berries. Then the agent has lots of convergent instrumental subgoals. Okay, what about a simple approval function, which approves actions insofar as the supervisor expects them to lead to collecting more berries? Then the agent *also* learns convergent instrumental subgoals, because it learns to take whatever actions lead to collecting more berries (assuming the supervisor is right about that).
I picture you saying that the latter is not very simple, because it needs to make all these predictions about complex dependencies on future states. But that's what you need in any approval function that you want to use to train a competent agent. It seems like you're only picturing myopic feedback that doesn't actually solve the problem of figuring out which actions lead to which states - but as soon as you do, you get the same issues. It is no virtue of approval functions that most of them are safe, if none of the safe ones specify the behaviour we actually want from AIs.
... There's a lot of ways that reward functions go wrong besides manipulation. I agree that if what you're worried about is manipulation in N actions, then you shouldn't let the trajectory go on for N actions before evaluating.
Consider the boat racing example. I'm saying that we wouldn't have had the boat going around in circles if we had used approval feedback, because the human wouldn't have approved of the actions where the boat goes around in a circle.
(You might argue that if a human had been giving the reward signal, instead of having an automated reward function, that also would have avoided the bad behavior. I basically agree with that, but then my point would just be that humans are better at providing approval feedback than reward feedback -- we just aren't very used to thinking in terms of "rewards". See the COACH paper.)
When the supervisor sees the agent trying to take over the world in order to collect more berries, the supervisor disapproves, and the agent stops taking that action. (I suspect this ends up being the same disagreement as the first one, where you'd say "but the supervisor can do that with rewards too", and I say "sure, but humans are better at giving approval feedback than reward feedback".)
Again, I do agree with you that myopic training is not particularly likely to lead to myopic cognition. It seems to me like this is creeping into your arguments somewhere, but I may be wrong about that.
I'm calling them manipulative states because if the human notices that the reward function has gone wrong, they'll just change the reward they're giving. So there must be something that stops them from noticing this. But maybe it's a misleading term, and this isn't an important point, so for now I'll use "incorrectly rewarded states" instead.
This isn't quite my argument. My two arguments are:
1. IF an important reason you care about myopia is to prevent agents from making N-step plans to get to incorrectly rewarded states, THEN you can't defend the competitiveness of myopia by saying that we'll just look at the whole trajectory (as you did in your original reply).
2. However, even myopically cutting off the trajectory before the agent takes N actions is insufficient to prevent the agent from making N-step plans to get to incorrectly rewarded states.
Has this argument been written up anywhere? I think I kinda get what you mean by "better", but even if that's true, I don't know how to think about what the implications are. Also, I think it's false if we condition on the myopic agents actually being competitive.
My guess is that this disagreement is based on you thinking primarily about tasks where it's clear what we want the agent to do, and we just need to push it in that direction (like the ones discussed in the COACH paper). I agree that approval feedback is much more natural for this use case. But when I'm talking about competitive AGI, I'm talking about agents that can figure out novel approaches and strategies. Coming up with reward feedback that works for that is much easier than coming up with workable approval feedback, because we just don't know the values of different actions. If we do manage to train competitive myopic agents, I expect that the way we calculate the approval function is by looking at the action, predicting what outcomes it will lead to, and evaluating how good those outcomes are - which is basically just mentally calculating a reward function and converting it to a value function. But then we could just skip the "predicting" bit and actually look at the outcomes instead - i.e. making it nonmyopic.
If you have ideas for how we might supervise complex tasks like Go to a superhuman level, without assigning values to outcomes in a way that falls into the same traps as reward-based learning, or without benefiting greatly from looking at what the actual consequences are, then that would constitute a compelling argument against my position. E.g. maybe we can figure out what "good cognitive steps" are, and then reward the agent for doing those without bothering to figure out what outcomes good cognitive steps will lead to. That seems very hard, but it's the sort of thing I think you need to defend if you're going to defend myopia. (I expect Debate about which actions to take, for instance, to benefit greatly from the judge being able to refer to later outcomes of actions).
Another way of making this argument: humans very much think in terms of outcomes, and how good those outcomes are, by default. I agree that we are bad at giving step-by-step dense rewards. But the whole point of a reward function is that you don't need to do the step-by-step thing, you can mostly just focus on rewarding good outcomes, and the agent does the credit assignment itself. I picture you arguing that we'll need shaped rewards to help the agent explore, but a) we can get rid of those shaped rewards as soon as the agent has gotten off the ground, so that they don't affect long-term incentives, and b) even shaped rewards can still be quite outcome-focused (and therefore natural to think about) - e.g. +1 for killing Roshan in League of Legends.
In terms of catching and correcting mistakes in the specification, I agree that myopia forces the supervisor to keep watching the agent, which means that the supervisor is more likely to notice if they've accidentally incentivised the agent to do something bad. But whatever bad behaviour the supervisor is able to notice during myopic training, they could also notice during nonmyopic training if they were watching carefully. So perhaps myopia is useful as a commitment device to force supervisors to pay attention, but given the huge cost of calculating the likely outcomes of all actions, I doubt anyone will want to use it that way.
I can't speak for everyone else, but when I talk about myopic training vs. regular RL, I'm imagining that they have the same information available when feedback is given. If you would wait till the end of the trajectory before giving rewards in regular RL, then you would wait till the end of the trajectory before giving approval in myopic training.
... Iterated amplification? Debate?
The point of these methods is to have an overseer that is more powerful than the agent being trained, so that you never have to achieve super-overseer performance (but you do achieve superhuman performance). In debate, you can think of judge + agent 1 as the overseer for agent 2, and judge + agent 2 as the overseer for agent 1.
(You don't use the overseer itself as your ML system, because the overseer is slow while the agent is fast.)
I agree that if you're hoping to get an agent that is more powerful than its overseer, then you're counting on some form of generalization / transfer, and you shouldn't expect myopic training to be much better (if at all) than regular RL at getting the "right" generalization.
Approval-directed agents. Note a counterargument in against mimicry (technically argues against imitation, but I think it also applies to approval).
See above about being superhuman but sub-overseer. (Note that the agents can still come up with novel approaches and strategies that the overseer would have come up with, even if the overseer did not actually come up with them.)
... This does not match my experience at all. Most of the time it seems to me that we're executing habits and heuristics that we've learned over time, and only when we need to think about something novel do we start trying to predict consequences and rate how good they are in order to come to a conclusion. (E.g. most people intuitively reject the notion that we should kill one person for organs to save 5 lives. I don't think they are usually predicting outcomes and then figuring out whether those outcomes are good or not.)
I mean, yes, but I don't think it's particularly relevant to this disagreement.
TL;DR: I think our main disagreement is whether humans can give approval feedback in any way other than estimating how good the consequences of the action are (both observed and predicted in the future). I agree that if we are trying to have an overseer train a more intelligent agent, it seems likely that you'd have to focus on how good the consequences are. However, I think we will plausibly have the overseer be more intelligent than the agent, and so I expect that the overseer can provide feedback in other ways as well.
I broadly agree about what our main disagreement is. Note that I've been mainly considering the case where the supervisor is more intelligent than the agent as well. The actual resolution of this will depend on what's really going on during amplification, which is a bigger topic that I'll need to think about more.
On the side disagreement (of whether looking at future states before evaluation counts as "myopic") I think I was confused when I was discussing it above and in the original article, which made my position a bit of a mess. Sorry about that; I've added a clarifying note at the top of the post, and edited the post to reflect what I actually meant. My actual response to this:
Is that in the standard RL paradigm, we never look at the full trajectory before providing feedback in either myopic or nonmyopic training. However, in nonmyopic training this doesn't matter very much, because we can assign high or low reward to some later state in the trajectory, which then influences whether the agent learns to do the original action more or less. We can't do this in myopic training in the current paradigm, which is where the competitiveness sacrifice comes from.
E.g. my agent sends an email. Is it good or bad? In myopic training, you need to figure this out now. In nonmyopic training, you can shrug, give it 0 reward now, and then assign high or low reward to the agent when it gets a response that makes it clearer how good the email was. Then because the agent does credit assignment automatically, actions are in effect evaluated based on their medium-term consequences, although the supervisor never actually looks at future states during evaluations.
This is consistent with your position: "When I talk about myopic training vs. regular RL, I'm imagining that they have the same information available when feedback is given". However, it also raises the question of why we can't just wait until the end of the trajectory to give myopic feedback anyway. In my edits I've called this "semi-myopia". This wouldn't be as useful for nonmyopia, but I do agree that semi-myopia alleviates some competitiveness concerns, although at the cost of being more open to manipulation. The exact tradeoff here will depend on disagreement 1.
I mean, this is true in the sense that the Gym interface returns a reward with every transition, but the vast majority of deep RL algorithms don't do anything with those rewards until the trajectory is done (or, in the case of very long trajectories, until you've collected a lot of experience from this trajectory). So you could just as easily evaluate the rewards then, and the algorithms wouldn't change at all (though their implementation would).
I think there are secretly two notions of "leads to convergent instrumental subgoals" here. There's the outer-alignment notion: "do any of the optimal policies for this reward function pursue convergent instrumental subgoals?", and also the "given a fixed learning setup, is it probable that training on this reward signal induces learned policies which pursue convergent instrumental subgoals?".
But with respect to the kind of approval function we'd likely implement, optimally arg-max'ing the approval function doesn't seem to induce the same kinds of subgoals. AFAICT, it might lead to short-term manipulative behavior, but not long-term power-seeking.
Sure. But, consider maximizing “TurnTrout has a fun day”-reward (for some imperfect grounding of that concept), and maximizing my approval of actions based on whether i think they’ll lead to a fun adventure.
The former takes over the world, and I don’t have a very fun day. but what about the latter?
To some extent, I won’t approve of actions that cause the agent to break, so there will be at least some instrumental subgoal pursuit for the agent. But for a successful power-seeking policy to be optimal, there is a conjunctive burden — we aren’t maximizing long-term discounted reward anymore, and the actions are evaluated locally, independently of any explicit global reward signal.
Many quasi-independently predicted approval judgments must cohere into a dangerous policy. It’s quite possible that this happens, but I’m not very convinced of that right now.
"Many quasi-independently predicted approval judgments must cohere into a dangerous policy."
I described how this happens in the section on manipulating humans. In short, there is no "quasi-independence" because you are still evaluating every action based on whether you think it'll lead to a fun adventure. This is exactly analogous to why the reward function you described takes over the world.
Yes, but I don't understand your case for "finding chains of manipulative inputs which increase myopic reward" entailing power-seeking? Why would that behavior, in particular, lead to the highest myopic reward? If we didn't already know about power-seeking reward maximizers, why would we promote this hypothesis to attention?
I disagree? Those objectives seem qualitatively dissimilar.
"Why would that behavior, in particular, lead to the highest myopic reward?"
I addressed this in my original comment: "More specifically: if a myopic agent's actions A_1 to A_n manipulate the supervisor into thinking that the N+1th state is really amazing, and the supervisor looks at the full trajectory before assigning approval, then the supervisor will give higher approval to all of the actions A_1 to A_n, and they'll all be reinforced, which is the same thing as would happen in a nonmyopic setup if the supervisor just gave the Nth action really high reward."
That's not what I'm asking. Why would that lead to power-seeking? You seem to be identifying "manipulation" with "power-seeking"; power-seeking implies manipulation, but the converse isn't always true.
Why do nonmyopic agents end up power-seeking? Because the supervisor rates some states highly, and so the agent is incentivised to gain power in order to reach those states.
Why do myopic agents end up power-seeking? Because to train a competitive myopic agent, the supervisor will need to calculate how much approval they assign to actions based on how much those actions contribute to reaching valuable states. So the agent will be rewarded for taking actions which acquire it more power, since the supervisor will predict that those contribute to reaching valuable states.
(You might argue that, if the supervisor doesn't want the agent to be power-seeking, they'll only approve of actions which gain the agent more power in specified ways. But equivalently a reward function can also penalise unauthorised power-gaining, given equal ability to notice it by the supervisors in both cases.)
I now think that I was thinking of myopic cognition, whereas you are talking about myopic training. Oops! This is obvious in hindsight (and now I'm wondering how I missed it), but maybe you could edit the post to draw a clear contrast?
Ah, makes sense. There's already a paragraph on this (starting "I should note that so far"), but I'll edit to mention it earlier.
This is likely the crux of our disagreement, but I don't have time to reply ATM. Hope to return to this.
If the latter is implemented on Laplace's Demon, and simply looks through all actions and picks the one with the highest approval, then I think it depends on how you've defined "approval." If maximum approval could be bad (e.g. if approval is unbounded, or if it would take a lot of work to find a benign context where you always give it maximum approval), then this search process is searching for things that look like taking over the world.
But as we move away from Laplace's demon, then I agree that realistic solutions look more like only manipulating TurnTrout and his immediate spatiotemporal surroundings.
I think this is a really important point, thanks.
Did you mean "There's no difference between approval feedback and reward feedback"?
Yes, fixed, thanks.
+1, I was about to write an argument to this effect.
Also, you can’t always rationalize M as state-based reward maximization, but even if you could, that doesn’t tell you much. Taken on its own, the argument about M-equivalence proves too much, because it would imply random policies have convergent instrumental subgoals:
Let M(s,a) be uniformly randomly drawn from the unit interval, the first time it’s called. Have the agent choose the argmax for its policy. This can be rationalized as some R(s,a,s′) maximization, so it’s probably power-seeking.
This doesn’t hold, obviously. Any argument about approval maximization should use specific facts about how approval is computed.
Put otherwise, specifying an actual reward function seems to be a good way to get a catastrophic maximizer, but arbitrary action-scoring rules don’t seem to have this property, as Rohin said above. Most reward functions have power-seeking optimal policies, and every policy is optimal for some reward function, but most policies aren’t power-seeking.
My comments below are partially copied from earlier comments I made on a draft of this post that Richard shared with me.
This is possible for approval-based amplification, though it's worth noting that I'm not sure if it actually makes sense for imitative amplification. When the loss is just the distance between the overseer's output and the model's output, you already have the full feedback signal, so there's no reason to use a reward.
Though still not super well-specified, my current thinking is that an agent is thinking myopically if their goal is a function of their output across some Cartesian boundary. See the section on “Goals across Cartesian boundaries” in this post.
This is definitely the point here that I care most about. I care a lot more about myopic cognition than myopic training procedures—as I see myopic cognition as a solution to deceptive alignment—and I do find it quite plausible that you could use a non-myopic training procedure to train a myopic agent.
However, it's worth noting that the procedure given here really looks a lot more like approval-based amplification rather than imitative amplification. And approval-based amplification doesn't necessarily limit to HCH, which makes me somewhat skeptical of it. Furthermore, by allowing the overseer to see the model's output in giving its feedback, the procedure given here breaks the analogy to counterfactual oracles which means that a model acting like a counterfactual oracle will no longer always be optimal—which is a real problem if the sort of myopic cognition that I want behaves like a counterfactual oracle (which I think it does).
I think this is where I disagree with this argument. I think you can get myopic agents which are competitive on long-run tasks because they are trying to do something like “be as close to HCH as possible” which results in good long-run task performance without actually being specified in terms of the long-term consequences of the agent's actions.
I'm somewhat conflicted here. I sympathize with Rohin's sibling comment. One of my take-aways from the discussion between Rohin and Ricraz is that it's really not very meaningful to classify training as myopic/nonmyopic based on superficial features, such as whether feedback is aggregated across multiple rewards. As Ricraz repeatedly states, we can shift back and forth between "myopic" and "nonmyopic" by e.g. using a nonmyopic reasoner to provide the rewards to a perfectly myopic RL agent. Rohin pushes back on this point by pointing out that the important thing about a myopic approval-directed agent (for example) is the additional information we get from the human in the loop. An approval-directed berry-collecting agent will not gain approval from steps which build infrastructure to help fool the human judge, where a non-myopic approval-seeking RL agent could estimate high expected value for taking such steps. But the same cannot be said of an agent myopically trained to approximate an approval-seeking RL agent.
So it seems misleading to describe a system as myopically imitating a non-myopic system -- there is no significant difference between non-myopic Q-learning vs myopic imitation of Q-learning. A notion of "myopia" which agrees with your usage (allowing for myopic imitation of HCH) does not seem like a very useful notion of myopia. I see this as the heart of Ricraz' critique (or at least, the part that I agree with).
OTOH, Rohin's defense of myopia turns on a core claim:
So, if we want a useful concept of "myopic training", it seems it should support this claim -- IE, myopic training should be the sort of optimization for which small errors in loss function are unlikely to create huge errors in outcomes.
Going back to the example of myopically imitating HCH, it seems what's important here is how errors might be introduced. If we assume HCH is trusted, then a loss function which introduces independent noise on HCH's answers to different questions would be fine. On the other hand, an approximation which propagated those errors along HCH trees -- so a wrong conclusion about human values influences many many upstream computations -- would be not-fine, in the same way nonmyopic RL is non-fine.
I'm not sure how to resolve this in terms of a notion of "myopic training" which gets at the important thing.
I agree that there's no difference between the training setup where you do myopic RL on a Q function and the training setup where you just doing Q learning directly, but that doesn't at all imply that there's no difference between internally myopically imitating some other Q learning agent and internally doing some sort of non-myopic optimization process. As a silly example, if my Q learning agent has some weird idiosyncratic behavior, then the model which is imitating it will exhibit the same behavior where as the model which is just trying to optimize the reward directly won't.
This especially matters in the context of HCH because we care quite a lot about getting out as direct an imitation of HCH as possible. In particular, it matters quite a lot that our model be internally trying to myopically imitate HCH rather than internally trying to non-myopically get the least loss across episodes, as the latter will lead it to output simpler answers to make its job easier.
To my understanding Abram, Richard and I agree that myopic cognition (what you're calling "internally myopically imitating") would confer benefits, but we don't think that myopic training is likely to lead to myopic cognition. That might be the crux?
Sure, but imitative amplification can't be done without myopic training or it ceases to be imitative amplification and becomes approval-based amplification, which means you no longer have any nice guarantees about limiting to HCH.
What about imitating HCH using GAIL and AIRL? I wouldn't really call that myopic training (if you do, I'm curious what your definition of "myopic training" is).
Both GAIL and AIRL only use expert trajectories rather than expert evaluations, which means they both satisfy the counterfactual oracle analogy, and so I would call them myopic training algorithms in the sense that I was using that term above. That being said, I definitely agree that the term is very overloaded here—some conflicting definitions:
Note that the post explicitly chose the first definition (which GAIL and AIRL don't meet). But considering the other definitions:
Seems like the second is implied by the first if you respect the RL assumptions (Cartesian boundary, episode abstractions); if you don't respect the RL assumptions I don't know that "incentivizes CDT-like behavior in the limit" is achievable (if even definable, I'm not really sure what it means).
... Huh? You can't tell whether the model is good / what direction to update it without looking at some information about the model or its effects, and if not the output, then what?
One interpretation could be "the model is evaluated without a human looking at its output", but I don't see why the model is less likely to develop myopic cognition if a human is looking at the output rather than some automated program. (By this definition, the majority of deep RL policies were "myopically trained" -- if this is actually what you mean, let's not use "myopic" to describe this.)
Maybe you think that the human can be manipulated but the automated program can't be? I mostly think this is a non-problem, and am not sure how you create the automated program that can evaluate a manipulative action without being manipulated in the process, but I could imagine that being a benefit.
Yeah, I meant without a human looking at the output. I also agree with pretty much everything you just said. We're pretty deep in this comment chain now and I'm not exactly sure why we got here—I agree that Richard's original definition was based on the standard RL definition of myopia, though I was making the point that Richard's attempt to make imitative amplification non-myopic turned it into approval-based amplification. Richard's version has a human evaluate the output rather than a distance metric, which I see as the defining difference between imitative and approval-based amplification.
Many algorithms for imitation still involve non-myopic training (e.g. GAIL, sorta, and AIRL).
... Why isn't this compatible with saying that the supervisor (HCH) is "able to accurately predict how well their actions fulfil long-term goals"? Like, HCH presumably takes those actions because it thinks those actions are good for long-term goals.
In the imitative case, the overseer never makes a determination about how effective the model's actions will be at achieving anything. Rather, the overseer is only trying to produce the best answer for itself, and the loss is determined via a distance metric. While the overseer might very well try to determine how effective it's own actions will be at achieving long-term goals, it never evaluates how effective the model's actions will be. I see this sort of trick as the heart of what makes the counterfactual oracle analogy work.
I don't really understand what you're saying here. A thing you might be saying:
If that is what you're saying, I don't see why this is relevant to whether or not we should use myopic training?
(It's possible I need to reread the counterfactual oracle analogy, though I did skim it right now and didn't immediately see the relevance.)
My point here is that I think imitative amplification (if you believe it's competitive) is a counter-example to Richard's argument in his “Myopic training doesn't prevent manipulation of supervisors” section since any manipulative actions that an imitative amplification model takes aren't judged by their consequences but rather just by how closely they match up with what the overseer would do.
That seems to be a property of myopic cognition rather than myopic training? (See also this comment.)
I'm also confused.
"While the overseer might very well try to determine how effective it's own actions will be at achieving long-term goals, it never evaluates how effective the model's actions will be."
Evan, do you agree that for the model to imitate the actions of the supervisor, it would be useful to mimic some of the thought processes the supervisor uses when generating those actions?
In other words, if HCH is pursuing goal X, what feature of myopic training selects for a model that is internally thinking "I'm going to try to be as close to HCH as possible in this timestep, which involves reasoning about how HCH would pursue X", versus a model that's thinking "I'm going to pursue goal X"? (To the extent these are different, which I'm still confused about).
Planned summary for the Alignment Newsletter:
I think of myopia as part of a broader research direction of finding ways to weaken the notion of an agent ("agent" meaning that it chooses actions based on their predicted consequences) so the AI follows some commonsense notion of "does what you ask without making its own plans."
So if you remove all prediction of consequences beyond some timeframe, then you will get non-agent behavior relative to things outside the timeframe. This helps clarify for me why implicit predictions, like the AI modeling a human who makes predictions of the future, might lead to agential behavior even in an almost-myopic agent - the AI might now have incentives to choose actions based on their predicted (albeit by the human) consequences.
I think there are two endgames people have in mind for this research direction - using non-agent AI as an oracle and using it to help us choose good actions (implicitly making an agent of the combined human-oracle system), or using non-agent parts as understandable building blocks in an agent, which is supposed to be safe by virtue of having some particular structure. I think both of these have their own problems, but they do sort of mitigate your points about the direct performance of a myopic AI.
As an aside, you can't rationalize all M like this if you restrict yourself to state-/outcome-based reward functions. Relevant to the main point of the section.