This insight was made possible by many conversations with Quintin Pope, where he challenged my implicit assumptions about alignment. I’m not sure who came up with this particular idea.
In this essay, I call an agent a “reward optimizer” if it not only gets lots of reward, but if it reliably makes choices like “reward but no task completion” (e.g. receiving reward without eating pizza) over “task completion but no reward” (e.g. eating pizza without receiving reward). Under this definition, an agent can be a reward optimizer even if it doesn't contain an explicit representation of reward, or implement a search process for reward.
Reinforcement learning is learning what to do—how to map situations to actions so as to maximize a numerical reward signal. — Reinforcement learning: An introduction
Many people[1] seem to expect that reward will be the optimization target of really smart learned policies—that these policies will be reward optimizers. I strongly disagree. As I argue in this essay, reward is not, in general, that-which-is-optimized by RL agents.[2]
Separately, as far as I can tell, most[3] practitioners usually view reward as encoding the relative utilities of states and actions (e.g. it’s this good to have all the trash put away), as opposed to imposing a reinforcement schedule which builds certain computational edifices inside the model (e.g. reward for picking up trash → reinforce trash-recognition and trash-seeking and trash-putting-away subroutines). I think the former view is usually inappropriate, because in many setups, reward chisels cognitive grooves into an agent.
Therefore, reward is not the optimization target in two senses:
- Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
- Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent's network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.
Reward probably won’t be a deep RL agent’s primary optimization target
After work, you grab pizza with your friends. You eat a bite. The taste releases reward in your brain, which triggers credit assignment. Credit assignment identifies which thoughts and decisions were responsible for the release of that reward, and makes those decisions more likely to happen in similar situations in the future. Perhaps you had thoughts like
- “It’ll be fun to hang out with my friends” and
- “The pizza shop is nearby” and
- “Since I just ordered food at a cash register, execute
motor-subroutine-#51241
to take out my wallet” and - “If the pizza is in front of me and it’s mine and I’m hungry, raise the slice to my mouth” and
- “If the slice is near my mouth and I’m not already chewing, take a bite.”
Many of these thoughts will be judged responsible by credit assignment, and thereby become more likely to trigger in the future. This is what reinforcement learning is all about—the reward is the reinforcer of those things which came before it and the creator of new lines of cognition entirely (e.g. anglicized as "I shouldn't buy pizza when I'm mostly full"). The reward chisels cognition which increases the probability of the reward accruing next time.
Importantly, reward does not automatically spawn thoughts about reward, and reinforce those reward-focused thoughts! Just because common English endows “reward” with suggestive pleasurable connotations, that does not mean that an RL agent will terminally value reward!
What kinds of people (or non-tabular agents more generally) will become reward optimizers, such that the agent ends up terminally caring about reward (and little else)? Reconsider the pizza situation, but instead suppose you were thinking thoughts like “this pizza is going to be so rewarding” and “in this situation, eating pizza sure will activate my reward circuitry.”
You eat the pizza, triggering reward, triggering credit assignment, which correctly locates these reward-focused thoughts as contributing to the release of reward. Therefore, in the future, you will more often take actions because you think they will produce reward, and so you will become more of the kind of person who intrinsically cares about reward. This is a path[4] to reward-optimization and wireheading.
While it's possible to have activations on "pizza consumption predicted to be rewarding" and "execute motor-subroutine-#51241
" and then have credit assignment hook these up into a new motivational circuit, this is only one possible direction of value formation in the agent. Seemingly, the most direct way for an agent to become more of a reward optimizer is to already make decisions motivated by reward, and then have credit assignment further generalize that decision-making.
The siren-like suggestiveness of the word “reward”
Let’s strip away the suggestive word “reward”, and replace it by its substance: cognition-updater.
Suppose a human trains an RL agent by pressing the cognition-updater button when the agent puts trash in a trash can. While putting trash away, the AI’s policy network is probably “thinking about”[5] the actual world it’s interacting with, and so the cognition-updater reinforces those heuristics which lead to the trash getting put away (e.g. “if trash-classifier activates near center-of-visual-field, then grab trash using motor-subroutine-#642
”).
Then suppose this AI models the true fact that the button-pressing produces the cognition-updater. Suppose this AI, which has historically had its trash-related thoughts reinforced, considers the plan of pressing this button. “If I press the button, that triggers credit assignment, which will reinforce my decision to press the button, such that in the future I will press the button even more.”
Why, exactly, would the AI seize[6] the button? To reinforce itself into a certain corner of its policy space? The AI has not had antecedent-computation-reinforcer-thoughts reinforced in the past, and so its current decision will not be made in order to acquire the cognition-updater!
RL is not, in general, about training cognition-updater optimizers.
When is reward the optimization target of the agent?
If reward is guaranteed to become your optimization target, then your learning algorithm can force you to become a drug addict. Let me explain.
Convergence theorems provide conditions under which a reinforcement learning algorithm is guaranteed to converge to an optimal policy for a reward function. For example, value iteration maintains a table of value estimates for each state s, and iteratively propagates information about that value to the neighbors of s. If a far-away state f has huge reward, then that reward ripples back through the environmental dynamics via this “backup” operation. Nearby parents of f gain value, and then after lots of backups, far-away ancestor-states gain value due to f’s high reward.
Eventually, the “value ripples” settle down. The agent picks an (optimal) policy by acting to maximize the value-estimates for its post-action states.
Suppose it would be extremely rewarding to do drugs, but those drugs are on the other side of the world. Value iteration backs up that high value to your present space-time location, such that your policy necessarily gets at least that much reward. There’s no escaping it: After enough backup steps, you’re traveling across the world to do cocaine.
But obviously these conditions aren’t true in the real world. Your learning algorithm doesn’t force you to try drugs. Any AI which e.g. tried every action at least once would quickly kill itself, and so real-world general RL agents won’t explore like that because that would be stupid. So the RL agent’s algorithm won’t make it e.g. explore wireheading either, and so the convergence theorems don’t apply even a little—even in spirit.
Anticipated questions
- Why won’t early-stage agents think thoughts like “If putting trash away will lead to reward, then execute
motor-subroutine-#642
”, and then this gets reinforced into reward-focused cognition early on?- Suppose the agent puts away trash in a blue room. Why won’t early-stage agents think thoughts like “If putting trash away will lead to the wall being blue, then execute
motor-subroutine-#642
”, and then this gets reinforced into blue-wall-focused cognition early on? Why consider either scenario to begin with?
- Suppose the agent puts away trash in a blue room. Why won’t early-stage agents think thoughts like “If putting trash away will lead to the wall being blue, then execute
- But aren’t we implicitly selecting for agents with high cumulative reward, when we train those agents?
- Yeah. But on its own, this argument can’t possibly imply that selected agents will probably be reward optimizers. The argument would prove too much. Evolution selected for inclusive genetic fitness, and it did not get IGF optimizers.
- "We're selecting for agents on reward we get an agent which optimizes reward" is locally invalid. "We select for agents on X we get an agent which optimizes X" is not true for the case of evolution, and so is not true in general.
- Therefore, the argument isn't necessarily true in the AI reward-selection case. Even if RL did happen to train reward optimizers and this post were wrong, the selection argument is too weak on its own to establish that conclusion.
- Here’s the more concrete response: Selection isn’t just for agents which get lots of reward.
- For simplicity, consider the case where on the training distribution, the agent gets reward if and only if it reaches a goal state. Then any selection for reward is also selection for reaching the goal. And if the goal is the only red object, then selection for reward is also selection for reaching red objects.
- In general, selection for reward produces equally strong selection for reward’s necessary and sufficient conditions. In general, it seems like there should be a lot of those. Therefore, since selection is not only for reward but for anything which goes along with reward (e.g. reaching the goal), then selection won’t advantage reward optimizers over agents which reach goals quickly / pick up lots of trash / [do the objective].
- Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button.
- I think that before the agent can hit the particular attractor of reward-optimization, it will hit an attractor in which it optimizes for some aspect of a historical correlate of reward.
- We train agents which intelligently optimize for e.g. putting trash away, and this reinforces the trash-putting-away computations, which activate in a broad range of situations so as to steer agents into a future where trash has been put away. An intelligent agent will model the true fact that, if the agent reinforces itself into caring about cognition-updating, then it will no longer navigate to futures where trash is put away. Therefore, it decides to not hit the reward button.
- This reasoning follows for most inner goals by instrumental convergence.
- On my current best model, this is why people usually don’t wirehead. They learn their own values via deep RL, like caring about dogs, and these actual values are opposed to the person they would become if they wirehead.
- I think that before the agent can hit the particular attractor of reward-optimization, it will hit an attractor in which it optimizes for some aspect of a historical correlate of reward.
- Yeah. But on its own, this argument can’t possibly imply that selected agents will probably be reward optimizers. The argument would prove too much. Evolution selected for inclusive genetic fitness, and it did not get IGF optimizers.
- Don’t some people terminally care about reward?
- I think so! I think that generally intelligent RL agents will have secondary, relatively weaker values around reward, but that reward will not be a primary motivator. Under my current (weakly held) model, an AI will only start chiseled computations about reward after it has chiseled other kinds of computations (e.g. putting away trash). More on this in later essays.
- But what if the AI bops the reward button early in training, while exploring? Then credit assignment would make the AI more likely to hit the button again.
- Then keep the button away from the AI until it can model the effects of hitting the cognition-updater button.[7]
- For the reasons given in the “siren” section, a sufficiently reflective AI probably won’t seek the reward button on its own.
- AIXI—
- will always kill you and then wirehead forever, unless you gave it something like a constant reward function.
- And, IMO, this fact is not practically relevant to alignment. AIXI is explicitly a reward-maximizer. As far as I know, AIXI(-tl) is not the limiting form of any kind of real-world intelligence trained via reinforcement learning.
- Does the choice of RL algorithm matter?
- For point 1 (reward is not the trained agent's optimization target), it might matter.
- I started off analyzing model-free actor-based approaches, but have also considered a few model-based setups. I think the key lessons apply to the general case, but I think the setup will substantially affect which values tend to be grown.
- If the agent's curriculum is broad, then reward-based cognition may get reinforced from a confluence of tasks (solve mazes, write sonnets), while each task-specific cognitive structure is only narrowly contextually reinforced. That said, this is also selecting equally hard for agents which do the rewarded activities, and reward-motivation is only one possible value which produces those decisions.
- Pretraining a language model and then slotting that into an RL setup also changes the initial computations in a way which I have not yet tried to analyze.
- It’s possible there’s some kind of RL algorithm which does train agents which limit to reward optimization (and, of course, thereby “solves” inner alignment in its literal form of “find a policy which optimizes the outer objective signal”).
- I started off analyzing model-free actor-based approaches, but have also considered a few model-based setups. I think the key lessons apply to the general case, but I think the setup will substantially affect which values tend to be grown.
- For point 2 (reward provides local updates to the agent's cognition via credit assignment; reward is not best understood as specifying our preferences), the choice of RL algorithm should not matter, as long as it uses reward to compute local updates.
- A similar lesson applies to the updates provided by loss signals. A loss signal provides updates which deform the agent's cognition into a new shape.
- For point 1 (reward is not the trained agent's optimization target), it might matter.
- TurnTrout, you've been talking about an AI's learning process using English, but ML gradients may not neatly be expressible in our concepts. How do we know that it's appropriate to speculate in English?
- I am not certain that my model is legit, but it sure seems more legit than (my perception of) how people usually think about RL (i.e. in terms of reward maximization, and reward-as-optimization-target instead of as feedback signal which builds cognitive structures).
- I only have access to my own concepts and words, so I am provisionally reasoning ahead anyways, while keeping in mind the potential treacheries of anglicizing imaginary gradient updates (e.g. "be more likely to eat pizza in similar situations").
Dropping the old hypothesis
At this point, I don't see a strong reason to focus on the “reward optimizer” hypothesis. The idea that AIs will get really smart and primarily optimize some reward signal… I don’t know of any tight mechanistic stories for that. I’d love to hear some, if there are any.
As far as I’m aware, the strongest evidence left for agents intrinsically valuing cognition-updating is that some humans do strongly (but not uniquely) value cognition-updating,[8] and many humans seem to value it weakly, and humans are probably RL agents in the appropriate ways. So we definitely can’t rule out agents which strongly (and not just weakly) value the cognition-updater. But it’s also not the overdetermined default outcome. More on that in future essays.
It’s true that reward can be an agent’s optimization target, but what reward actually does is reinforce computations which lead to it. A particular alignment proposal might argue that a reward function will reinforce the agent into a shape such that it intrinsically values reinforcement, and that the cognition-updater goal is also a human-aligned optimization target, but this is still just one particular approach of using the cognition-updating to produce desirable cognition within an agent. Even in that proposal, the primary mechanistic function of reward is reinforcement, not optimization-target.
Implications
Here are some major updates which I made:
- Any reasoning derived from the reward-optimization premise is now suspect until otherwise supported.
- Wireheading was never a high-probability problem for RL-trained agents, absent a specific story for why cognition-updater-acquiring thoughts would be chiseled into primary decision factors.
- Stop worrying about finding “outer objectives” which are safe to maximize.[9] I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function).
- Instead, focus on building good cognition within the agent.
- In my ontology, there's only one question: How do we grow good cognition inside of the trained agent?
- Mechanistically model RL agents as executing behaviors downstream of past reinforcement (e.g. putting trash away), in addition to thinking about policies which are selected for having high reward on the training distribution (e.g. hitting the button).
- The latter form of reasoning skips past the mechanistic substance of reinforcement learning: The chiseling of computations responsible for the acquisition of the cognition-updater. I still think it's useful to consider selection, but mostly in order to generate failures modes whose mechanistic plausibility can be evaluated.
- In my view, reward's proper role isn't to encode an objective, but a reinforcement schedule, such that the right kinds of computations get reinforced within the AI's mind.
Edit 11/15/22: The original version of this post talked about how reward reinforces antecedent computations in policy gradient approaches. This is not true in general. I edited the post to instead talk about how reward is used to upweight certain kinds of actions in certain kinds of situations, and therefore reward chisels cognitive grooves into agents.
Appendix: The field of RL thinks reward=optimization target
Let’s take a little stroll through Google Scholar’s top results for “reinforcement learning", emphasis added:
The agent's job is to find a policy… that maximizes some long-run measure of reinforcement. ~ Reinforcement learning: A survey
In instrumental conditioning, animals learn to choose actions to obtain rewards and avoid punishments, or, more generally to achieve goals. Various goals are possible, such as optimizing the average rate of acquisition of net rewards (i.e. rewards minus punishments), or some proxy for this such as the expected sum of future rewards. ~ Reinforcement learning: The Good, The Bad and The Ugly
Steve Byrnes did, in fact, briefly point out part of the “reward is the optimization target” mistake:
I note that even experts sometimes sloppily talk as if RL agents make plans towards the goal of maximizing future reward… — Model-based RL, Desires, Brains, Wireheading
I don't think it's just sloppy talk, I think it's incorrect belief in many cases. I mean, I did my PhD on RL theory, and I still believed it. Many authorities and textbooks confidently claim—presenting little to no evidence—that reward is an optimization target (i.e. the quantity which the policy is in fact trying to optimize, or the quantity to be optimized by the policy). Check what the math actually says.
- ^
Including the authors of the quoted introductory text, Reinforcement learning: An introduction. I have, however, met several alignment researchers who already internalized that reward is not the optimization target, perhaps not in so many words.
- ^
Utility ≠ Reward points out that an RL-trained agent is optimized by original reward, but not necessarily optimizing for the original reward. This essay goes further in several ways, including when it argues that reward and utility have different type signatures—that reward shouldn’t be viewed as encoding a goal at all, but rather a reinforcement schedule. And not only do I not expect the trained agents to maximize the original “outer” reward signal, I think they probably won’t try to strongly optimize any reward signal.
- ^
Reward shaping seems like the most prominent counterexample to the “reward represents terminal preferences over state-action pairs” line of thinking.
- ^
But also, you were still probably thinking about reality as you interacted with it (“since I’m in front of the shop where I want to buy food, go inside”), and credit assignment will still locate some of those thoughts as relevant, and so you wouldn’t purely reinforce the reward-focused computations.
- ^
"Reward reinforces existing thoughts" is ultimately a claim about how updates depend on the existing weights of the network. I think that it's easier to update cognition along the lines of existing abstractions and lines of reasoning. If you're already running away from wolves, then if you see a bear and become afraid, you can be updated to run away from large furry animals. This would leverage your existing concepts.
From A shot at the diamond-alignment problem:
The local mapping from gradient directions to behaviors is given by the neural tangent kernel, and the learnability of different behaviors is given by the NTK’s eigenspectrum, which seems to adapt to the task at hand, making the network quicker to learn along behavioral dimensions similar to those it has already acquired.
- ^
Quintin Pope remarks: “The AI would probably want to establish control over the button, if only to ensure its values aren't updated in a way it wouldn't endorse. Though that's an example of convergent powerseeking, not reward seeking.”
- ^
For mechanistically similar reasons, keep cocaine out of the crib until your children can model the consequences of addiction.
- ^
I am presently ignorant of the relationship between pleasure and reward prediction error in the brain. I do not think they are the same.
However, I think people are usually weakly hedonically / experientially motivated. Consider a person about to eat pizza. If you give them the choice between "pizza but no pleasure from eating it" and "pleasure but no pizza", I think most people would choose the latter (unless they were really hungry and needed the calories). If people just navigated to futures where they had eaten pizza, that would not be true. - ^
From correspondence with another researcher: There may yet be an interesting alignment-related puzzle to "Find an optimization process whose maxima are friendly", but I personally don't share the intuition yet.
At some level I agree with this post---policies learned by RL are probably not purely described as optimizing anything. I also agree that an alignment strategy might try to exploit the suboptimality of gradient descent, and indeed this is one of the major points of discussion amongst people working on alignment in practice at ML labs.
However, I'm confused or skeptical about the particular deviations you are discussing and I suspect I disagree with or misunderstand this post.
As you suggest, in deep RL we typically use gradient descent to find policies that achieve a lot of reward (typically updating the policy based on an estimator for the gradient of the reward).
If you have a system with a sophisticated understanding of the world, then cognitive policies like "select actions that I expect would lead to reward" will tend to outperform policies like "try to complete the task," and so I usually expect them to be selected by gradient descent over time. (Or we could be more precise and think about little fragments of policies, but I don't think it changes anything I say here.)
It seems to me like you are saying that you think gradient descent will fail to find such policies because... (read more)
Thanks for the detailed comment. Overall, it seems to me like my points stand, although I think a few of them are somewhat different than you seem to have interpreted.
I think I believe the first claim, which I understand to mean "early-/mid-training AGI policies consist of contextually activated heuristics of varying sophistication, instead of e.g. a globally activated line of reasoning about a crisp inner objective." But that wasn't actually a point I was trying to make in this post.
Depends. This describes vanilla PG but not DQN. I think there are lots of complications which throw serious wrenches into the "and then SGD hits a 'global reward optimum'" picture. I'm going to have a post explaining this in more detail, but I will say some abstract words right now in case it shakes something loose / clarifies my thoughts.
Critic-ba... (read more)
I didn’t write the OP. If I were writing a post like this, I would (1) frame it as a discussion of a more specific class of model-based RL algorithms (a class that includes human within-lifetime learning), (2) soften the claim from “the agent won’t try to maximize reward” to “the agent won’t necessarily try to maximize reward”.
I do think the human (within-lifetime) reward function has an outsized impact on what goals humans ends up pursuing, although I acknowledge that it’s not literally the only thing that matters.
(By the way, I’m not sure why your original comment brought up inclusive genetic fitness at all; aren’t we talking about within-lifetime RL? The within-lifetime reward function is some complicated thing involving hunger and sex and friendship etc., not inclusive genetic fitness, right?)
I think incomplete exploration is very important in this context and I don’t quite follow why you de-emphasize that in your first comment. In the context of within-lifetime learning, perfect exploration entails that you try dropping an anvil on your head, and then you die. So we don’t expect perfect exploration; instead we’d presumably design the agent such that explores if and only if it ... (read more)
Let’s say, in the first few actually-encountered examples, reward is in fact strongly correlated with task completion. Reward is also of course 100% correlated with reward itself.
Then (at least under many plausible RL algorithms), the agent-in-training, having encountered those first few examples, might wind up wanting / liking the idea of task completion, OR wanting / liking the idea of reward, OR wanting / liking both of those things at once (perhaps to different extents). (I think it’s generally complicated and a bit fraught to predict which of these three possibilities would happen.)
But let’s consider the case where the RL agent-in-training winds up mostly or entirely wanting / liking the idea of task completion. And suppose further that the agent-in-training is by now pretty smart and self-aware and in control of its situation. Then the agent m... (read more)
I think RFLO is mostly imagining model-free RL with updates at the end of each episode, and my comment was mostly imagining model-based RL with online learning (e.g. TD learning). The former is kinda like evolution, the latter is kinda like within-lifetime learning, see e.g. §10.2.2 here.
The former would say: If I want lots of raspberries to get eaten, and I have a genetic disposition to want raspberries to be eaten, then I should maybe spend some time eating raspberries, but also more importantly I should explicitly try to maximize my inclusive genetic fitness so that I have lots of descendants, and those descendants (who will also disproportionately have the raspberry-eating gene) will then eat lots of raspberries.
The latter would say: If I want lots of raspberries to get eaten, and I have a genetic disposition to want raspberries to be eaten, then I shouldn’t go do lots of highly-addictive drugs that warp my preferences such that I no longer care about raspberries or indeed anything besides the drugs.
This feels very strongly reminiscent of an update I made a while back, and which I tried to convey in this section of AGI safety from first principles. But I think you've stated it far too strongly; and I think fewer other people were making this mistake than you expect (including people in the standard field of RL), for reasons that Paul laid out above. When you say things like "Any reasoning derived from the reward-optimization premise is now suspect until otherwise supported", this assumes that the people doing this reasoning were using the premise in the mistaken way that you (and some other people, including past Richard) were. Before drawing these conclusions wholesale, I'd suggest trying to identify ways in which the things other people are saying are consistent with th... (read more)
I have considered the hypothesis that most alignment researchers do understand this post already, while also somehow reliably emitting statements which, to me, indicate that they do not understand it. I deem this hypothesis unlikely. I have also considered that I may be misunderstanding them, and think in some small fraction of instances I might be.
I do in fact think that few people actually already deeply internalized the points I'm making in this post, even including a few people who say they have or that this post is obvious. Therefore, I concluded that lots of alignment thinking is suspect until re-analyzed.
I did preface "Here are some major updates which I made:". The post is ambiguous on whether/why I believe others have been mistaken, though. I felt that if I just blurted out my true beliefs about how people had been reasoning incorrectly, people would get defensive. I did in fact consider combing through Ajeya's post for disagreements, but I thought it... (read more)
It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You've certainly phrased things differently and made some specific points that we didn't, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?
(Note I am still surprised sometimes that people still think certain wireheading scenario's make sense despite them having read RFLO, so it's plausible to me that we really didn't communicate everyrhing that's in my head about this).
Maybe you have made a gestalt-switch I haven't made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.
Is there a difference between saying:
It seems to me that once you acknowledge the point about reinforcement, the additional statement that reward is not an objective doesn't actually imply anything further about the mechanistic properties of deep reinforcement learners? It is just a way to put a high-level conceptual story on top of it, and in this sense it seems to me that this point is ... (read more)
“Risks from Learned Optimization in Advanced Machine Learning Systems,” which we published three years ago and started writing four years ago, is extremely explicit that we don't know how to get an agent that is actually optimizing for a specified reward function. The alignment research community has been heavily engaging with this idea since then. Though I agree that many alignment researchers used to be making this mistake, I think it's extremely clear that by this point most serious alignment researchers understand the distinction.
This is precisely the point I make in “How do we become confident in the safety of a machine learning system is making,” btw.
(Just wanted to echo that I agree with TurnTrout that I find myself explaining the point that reward may not be the optimization target a lot, and I think I disagree somewhat with Ajeya's recent post for similar reasons. I don't think that the people I'm explaining it to literally don't understand the point at all; I think it mostly hasn't propagated into some parts of their other reasoning about alignment. I'm less on board with the "it's incorrect to call reward a base objective" point but I think it's pretty plausible that once I actually understand what TurnTrout is saying there I'll agree with it.)
The way I attempt to avoid confusion is to distinguish between the RL algorithm's optimization target and the RL policy's optimization target, and then avoid talking about the "RL agent's" optimization target, since that's ambiguous between the two meanings. I dislike the title of this post because it implies that there's only one optimization target, which exacerbates this ambiguity. I predict that if you switch to using this terminology, and then start asking a bunch of RL researchers questions, they'll tend to give broadly sensible answers (conditional on taking on the idea of "RL policy's optimization target" as a reasonable concept).
Authors' summary of the "reward is enough" paper:
... (read more)Here's a story:
I don't think this is guaranteed to happen, but seems likely enough to elevate “inner reward optimizer” hypothesis to our attention, at least.
As a more general/tangential comment, I'm a bit confused about how "elevate hypothesis to our attention" is supposed to work. I mean it took some conscious effort to come up with a possible mechanistic story about how "inner reward optimizer" might arise, so how were we supposed to come up with such a story without paying attention to "inner reward optimizer" in the first place?
Perhaps it's not that we should literally pay no attention to "inner reward optimizer" until we have a good mechanistic story for it, but more like we are (or were) paying too much attention to it, given that we don't (didn't) yet have a good mechanistic story? (But if so, how to decide how much is too much?)
I like this post, and basically agree, but it comes across somewhat more broad and confident than I am, at least in certain places.
I’m currently thinking about RL along the lines of Nostalgebraist here:
If that’s right, then I am very reluctant to say anything whatsoever about “RL agents in general”. They’re too diverse.
Much of the post, especially the early part, reads (to me) like confident claims about all possible RL agents. For example, the excerpt “…reward is the antecedent-computation-reinforcer. Reward reinforces those computations which produced it.” sounds like a confident claim about all RL agents, maybe even by definition of “RL”. (If so, I think I disagree.)
But other parts of the post aren’t like that—for example, the “Does the choice of RL algorithm matter?” part seems more reasonable and hedged, and l... (read more)
Here is an example story I wrote (that has been minorly edited by TurnTrout) about how an agent trained by RL could plausibly not optimize reward, forsaking actions that it knew during training would get it high reward. I found it useful as a way to understand his views, and he has signed off on it. Just to be clear, this is not his proposal for why everything is fine, nor is it necessarily an accurate representation of my views, just a plausible-to-TurnTrout story for how agents won't end up wanting to game human approval:
A similar point is (briefly) made in K. E. Drexler (2019). Reframing Superintelligence: Comprehensive AI Services as General Intelligence, §18 “Reinforcement learning systems are not equivalent to reward-seeking agents”:
And an additional point which calls into question the view of RL-produced agents as the product of one big training run (whose reward specification we better get right on the first try), as opposed to the product of an R&D feedback loop with reward as one non-static component:
... (read more)I think this is very important, probably roughly the way to go for top level alignment strategies, and we should start hammering out the mechanistic details of it more as soon as it's at all feasible.
Do you already have any ideas for experimentally verifying parts of this, and refining/formalising it further?
For example, do you think we could look at current RL models, and trace out how a particular pattern of behaviour being reinforced in early training led to things connected to that behaviour becoming the system's target even in later stages of tr... (read more)
I think the quotes cited under "The field of RL thinks reward=optimization target" are all correct. One by one:
Yes, that is the agent's job in RL, in the sense that if the training algorithm didn't do that we'd get another training algorithm (if we thought it was feasible for another algorithm to maximize reward). Basically, the field of RL uses a separation of concerns, where they design a reward function to incentivize good behaviour, and the agent maximizes th... (read more)
I discussed this post recently with a colleague, who encouraged me to post this excerpt:
I feel like this post has some themes similar to my article on tranquilism.
For a bit of context: In the article, I distinguish between "reflection-based motivation" and "need-based motivation." The former is something like "reflectively endorsed preferences / things the rational, planning part of your brain wants to do." The latter is something like "impulsive, system-1, unreflected motivation / things you can't help but be tempted to do." (I... (read more)
The term "RL agent" means an agent with architecture from a certain class, amenable to a specific kind of training. Since you are discussing RL agents in this post, I think it could be misleading to use human examples and analogies ("travelling across the world to do cocaine") in it because humans are not RL agents, neither on the level of wetware biological architecture (i. e., neurons and synapses don't represent a policy) nor on the abstract, cognitive level. On the cognitive level, even RL-by-construction agents of sufficient intelligence, trained in s... (read more)
This vibes well with what I've been thinking about recently.
There a post in the back of my mind called "Character alignment", which is... (read more)
Relevant quote I just found in the paper "Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents":
... (read more)I'm feeling confused.
It might just be my inexperience with reinforcement learning, but while I agree with what you say, I can't square it with my intuition of what a ML model does.
If our model uses some variant of gradient ascent, it will end up in high reward function values. (Not necessarily in any global/local maxima, but the attempt is to get it to some such maxima.) In that sense the model does optimize for reward.
Is that a special attribute of gradient ascent, that we shouldn't expect other models to have? Does that mean that gradient ascent models are more dangerous? Are you just noting that the model won't necessarily find the global maxima, and only reach some local maxima?
This seems like a great takeaway and the part I agree with most here, although probably stated less strongly. Did you see Richard Ngo's Shaping Safer Goals (2020) or my Motivations, Natural Selection, and Curriculum Engineering (2021) responding to it[1]? Both relate to this sort of picture.
... (read more)I think there are some subtleties here regarding the distinction between RL as a type of reward signal, and RL as a specific algorithm. You can take the exact same reward signal and use it either to update all computations in the entire AI (with some slightly magical credit assignment scheme) as in this post, or you can use it to update a reward prediction model in a model-based RL agent that acts a lot more like a maximizer.
I'd also like to hear your opinion on the effect of information leakage. For example, if reward only correlates with getting to the g... (read more)
"And not only do I not expect the trained agents to not maximize the original “outer” reward signal"
Nitpick: one "not" too many?
I think that subsection has the crucial insights from your post. Basically you're saying that, if we train an agent via RL in a limited environment where the reward correlates with another goal (eg "pick up the trash"), there are multiple policies the agent could have, multiple meta-policies it could... (read more)
Here's my general view on this topic:
Very interesting. I would love to see this worked out in a toy example, where you can see that an RL agent in a grid world does not in general maximize reward, but is able to reason to do… something else. That’s the part I have the hardest time translating into a simulation: what does it mean that the agent is “thinking” about outcomes, if that is something different than running an RL algorithm?
But the essential point that humans choose not to wirehead — or in general to delay or avoid gratification — is a good one. Why do they do this? Is there any RL al... (read more)
To me this implies that as the AI becomes more situationally aware it learns to avoid rewards that reinforce away its current goals (because it wants to preserve its goals.) As a result, throughout the training process, the AIs goals start out malleable and "harden" once the AI gains enough situational awareness. This implies that goals have to be simple enough for the agent to be able to model them early on in its training process.
Update: Changed
to
... (read more)The deceptive alignment worry is that there is some goal about the real world at all. Deceptive alignment breaks robustness of any properties of policy behavior, not just the property of following reward as a goal in some unfathomable sense.
So refuting this worry requires quieting the more general hypothesis that RL selects optimizers with any goals of their own, doesn't matter what goals those are. It's only the argument for why this seems plausible that needs to refer to reward as related to the goal of such an optimizer, but the way the argument goes su... (read more)
The argument above isn’t clear to me, because I’m not sure how you’re defining your terms.
I should note that, contrary to the statement “reward is _not_, in general, that-which-is-optimized by RL agents”, by definition "reward _must be_ what is optimized for by RL agents." If they do not do that, they are not RL agents. At least, that is true based on the way the term “reward” is commonly used in the field of RL. That is what RL agents are programmed by humans to do. They do that by changing their behavior over many trials, and testing the results of ... (read more)
Sorry if I should have misunderstood the point of your post, but I'm surprised that Bellman's optimality equation was nowhere mentioned. From Sutton's book on the topic I understood that once the policy iteration of vanilla RL converged to the point that the BOE holds, the agent is maximizing "value", which I would define in words as something like "expectation of discounted and cumulated reward". Now before one turns off a student new to the topic by giving a precise definition of those terms right away, I can see why he might have contracted that a bit u... (read more)
I feel like there is some vocabulary confusion in the genesis of this post. "Reward" is hard coded into the agents. The Dinosaurs of Jurrasic Park (spoiler alert) were genetically engineered to lack iodine. So, the trainers could use iodine as a reward to incentives other behaviors because be definition the dinos valued iodine as a terminal value. In humans Seratonin and Dopamine bonding to appropriate brain receptors are DNA-coded terminal values that inherently train us to pursue certain behaviors (eg food, sex). An AI is, by definition, going to take whatever actions maximize its Reward system. That's what having a Reward system means.