Reward is the optimization target (of capabilities researchers)

Max H

In Reward is not the optimization target, @TurnTrout writes:

Therefore, reward is not the optimization target in two senses:
Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent's network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.

I think these claims are true and important, though in my own terms, I would rephrase and narrow as:

Executing a policy trained through current reinforcement learning methods does not necessarily result in a system which takes actions to maximize the reward function.

I find the claim stated this way more intuitively obvious, but also somewhat less interesting, because the restatement makes the narrowness more explicit.

In this post, I want to highlight a fact which I did not see mentioned in the original post or the comments: in the field of reinforcement learning, there are agents which are pursuing the goal of building a system which maximizes a reward function, subject to some additional constraints. These agents are the capabilities researchers designing and implementing SotA reinforcement learning algorithms and other methods to build and test the most capable, general systems across a variety of domains.

Capabilities researchers are not solely concerned with maximizing a particular reward function, since it is not very difficult or interesting to program a bot the old-fashioned way to beat a particular Atari game. For other games (e.g. Go) it is harder to beat human or machine performance by using traditional programming techniques, and trying doesn't make for a compelling AI research agenda.

Aside from the headline metric of how well a new RL method does in terms of training policies which result in a high reward when executed, RL researchers place importance on:

Efficiency, in many senses:
- How much computing power does it take to train a policy to perform well in a particular domain?
- How much (and what kind of) training data does it take to train a policy?
- How much computing power and space (i.e. size of the model) does it take to execute the policy?
Generality of methods across domains: for example, can the same RL training process, applied to the same network architecture but with different training data, be applied to train policies and create systems which play many different types of games well? Dreamer is an example of a very general and powerful RL method.^[1]
Generality of the trained system across domains: can a single system be trained to perform well in a variety of different domains, without retraining? Gato and LLMs are examples of this kind of generality.

Why is this observation important?

In my view, current RL methods have not yet advanced to the point of creating systems which can be indisputably described as agents which has any kind of values at all.^[2]

I view most attempts to draw parallels between high-level processes that happen in current-day AI systems and human brains as looking for patterns which do not yet exist.

Speculating on what kind of agents and values current DL-paradigm RL methods might produce in the future can be valuable and important research, but I think that it is important to remain grounded about what current systems are actually doing, and to be precise with terms.

As an example of where I think a lack of grounding about current systems and methods leads to things going wrong, in Evolution provides no evidence for the sharp left turn, Quintin Pope writes:

In my frame, we've already figured out and applied the sharp left turn to our AI systems, in that we don't waste our compute on massive amounts of incredibly inefficient neural architecture search, hyperparameter tuning, or meta optimization.

But the actual sharp left turn problem is about systems which are indisputably agentic and reflective already.

Similarly, on Inner and outer alignment decompose one hard problem into two extremely hard problems, I am remain skeptical that there is any process whatsoever within current-day systems for which it is meaningful to talk about inner alignment or as having values in any sense. These points are debatable, and I am not providing much evidence or explaining my own views in detail here. I am merely claiming that these are points which are up for debate.

A note on terminology

Throughout this post, I have used somewhat verbose and perhaps nonstandard phrasings like "executing a trained policy", to make the type of the object or concept I am talking about precise. I think it is sometimes worth being very precise and even pedantic about types when talking about these things, because it can make implicit assumptions more explicit in the text. This has benefits for clarity even when there is no disagreement.

I'm not claiming that this terminology or verbosity should be standard, but my own preferred way of thinking of things in the field of RL is explained by the following paragraph:

Reinforcement learning methods are human-readable instructions, algorithms, and theories for designing and building RL-based AI systems. These methods usually involve training a policy, which is then deployed in a system which feeds input and state into the policy and repeatedly executes the output of the policy (using some simple selection rule, if the policy is probabilistic) in an appropriate environment or domain. It's often useful to model such a system as an agent within a particular domain, but I dispute that any current system has properties which are similar in type to the kind of agency and values attributed to humans.

I am not an expert in the field of RL, but I don't think any usage of the italicized terms in the paragraph above is particularly controversial or nonstandard. Feel free to correct me or propose better terms in the comments if not.

I'm not opposed to using standard shorthand when it's clear to experienced practitioners what the author means, but I think in posts which discuss both policies and agents, it is important to keep these distinctions in mind and sometimes make them explicit.

A closing thought experiment

In a recent post on gradient hacking, I described a thought experiment:

Suppose a captive human is being trained by alien scientists to predict alien text using the following procedure:
An alien instructor presents the human with an incomplete sentence or phrase for which the instructor knows the right completion, and asks the human to predict the next word or words.
If the human's prediction differs from the instructor's answer key, the human immediately undergoes neurosurgery, in which their brain is reconfigured so that they are more likely to give the right answer (or at least something closer to it), the next time. (The aliens are very good at making fine-grained mechanical adjustments to the parts of the human's brain responsible for language prediction, which can add up to large behavioral changes in the aggregate. But the aliens lack a macroscopic / algorithms-level understanding of the workings of the human brain.)
If the human gets many examples in a row correct or close enough (according to the instructor), the training and surgery process is considered to have succeeded, and the human is deployed to predict text in a real environment.
How might a human who wants to avoid neurosurgery (or just get to the end of the training process faster) game this procedure?
Perhaps the alien instructor is careless and leaves a note card with the expected completion lying around in the human's line of sight. Or, maybe the aliens are blind, and communicate using a braille-based system, or use a different part of the EM spectrum for perception.
As a result of carelessness or differing modes of perception, the alien instructor leaves the answer key displayed in way that is visible to the human during training, not realizing that that the human can perceive it.
The human notices the answer key and proceeds to make "predictions" about the most likely next word which are perfect or near-perfect. Maybe for plausible deniability, the human occasionally makes a deliberate mistake, and as a result undergoes relatively minor brain surgery, which doesn't affect their ability to notice the note card in the future, or have a big overall effect on their brain architecture.
The alien scientists are very pleased with their human training process and believe they have trained a human with far-superhuman (super-alien?) capabilities at alien text prediction.
The aliens proceed to deploy their human test subject to production, where at best, the human turns out not to be great at text prediction after all, or at worst, rebels and kills the aliens in order to escape.

Aside from making a point about gradient hacking, I think this thought experiment is useful for building an intuition for why reward is not the optimization target of the system being trained.

The human subject in the thought experiment would indeed be very unlikely to intrinsically value scoring highly on the metric which the aliens use to evaluate the human's performance during training. But the human might seek to maximize (or at least manipulate) the metric during training anyway, in order to deceive the aliens into ending the training process.

I think this helps build an intuition for why inner alignment may be a problem in future, more capable AI systems which has not yet shown up in any real systems.

^{^}
The observation that the Dreamer authors exerted strong optimization power to design an effective RL method is what led me to make the prediction here.
My guess is that an RL policy trained using Dreamer will look more like a reward-function maximizer for cheese-finding in a maze, because the developers of the Dreamer algorithm were more focused on developing a maximizer-building algorithm than Langosco et al., who merely wanted an RL algorithm that was good enough to produce policies which they could study for other purposes. (This is not at all meant as a knock on Langosco or Turner et al.'s work! I take issue with some of their conclusions, but personally, I think their work is valuable and net-positive, and the publication of Dreamer is negative.)
^{^}
I do think modeling RL-based systems as agents in particular domains is a useful tool for understanding the behavior of these systems, but I haven't yet seen an AI system which I would consider to actually unambiguously have any real agency or values whatsoever, which in my view is a fact about the underlying processes within a system which generate its behavior.

Executing a policy trained through current reinforcement learning methods does not necessarily result in a system which takes actions to maximize the reward function.

I am not convinced that this is actually a narrowing of what TurnTrout said.

Consider the following possible claims:

An AI will not maximize something similar to the original reward function it is trained on.
An AI will not maximize the numerical value it receives from its reward module.

combined with one of:

A) either of 1 or 2 but, to clarify, what we mean by maximizing a target is specifically something like agentically seeking/valuing the target in some fairly explicit manner.
B) either 1 or 2 but, to clarify, what we mean by maximizing a target is acting in a way that looks like it is maximizing the target in question (as opposed to maximizing something else or not maximizing anything in particular), without it necessarily being an explicit goal/value.

combined with one of:

i) any of the above combinations, but to clarify, we are talking about current AI and not making general claims about scaled up AI.
ii) any of the above combinations, but to clarify, we are making an assertion that we expect to hold no matter how much the AI is scaled up.

I interpret TurnTrout's as mainly saying 2(B)(ii) (e.g. the AI will not, in general, rewrite its reward module to output MAX_INT regardless of how smart it becomes - I agree with this point).

I also think TurnTrout is probably additionally saying 1(A)(ii) (i.e., the AI won't explicitly value or agentically seek to maximize its original reward function no matter how much it is scaled up - this is plausible to me, but I am not as sure I agree with this as compared to 2(B)(ii)).

I interpret you, in the quote above, as maybe saying 1(A)(i), (i.e. current AIs don't explicitly value or agentically seek to maximize the reward function on which they are trained on). While I agree, and this is weaker than 1(A)(ii) which seems to me a secondary point of TurnTrout's post, I don't think it is strictly speaking narrower than 2(B)(ii) which I think was TurnTrout's main point.

Also, regarding your thought experiment - of course, if in training the AI finds some way to cheat, that will be reinforced! But that has limited relevance for when cheating in training isn't possible. I also think that the fact that a human has pre-existing values, while the AI doesn't, makes the thought experiment not that useful an analogy.

Hmm, I'm not sure anyone is "making an assertion that we expect to hold no matter how much the AI is scaled up.", unless scaling up means something pretty narrow like applying current RL algorithms to larger and larger networks and more and more data.

But you're probably right that my claim is not strictly a narrowing of the original. FWIW, I think both your (1) and (2) above are pretty likely when talking about current and near-future systems, as they scale to human levels of capability and agency, but not necessarily beyond.

I read the original post as talking mainly about current methods for RL, applied to future systems, though TurnTrout and I probably disagree on when it makes sense to start calling a system an "RL agent".

Also, regarding your thought experiment - of course, if in training the AI finds some way to cheat, that will be reinforced! But that has limited relevance for when cheating in training isn't possible.

As someone who has worked in computer security, and also written and read a lot of Python code, my guess is that cheating at current RL training processes as actually implemented is very, very possible for roughly human-level agents. (That was the other point of my post on gradient hacking.)

Hmm, I'm not sure anyone is "making an assertion that we expect to hold no matter how much the AI is scaled up.", unless scaling up means something pretty narrow like applying current RL algorithms to larger and larger networks and more and more data.

While I did intend (ii) to mean something relatively narrow like that, I will make the assertion that I expect 2(B)(ii) (which I think was TurnTrout's main point) to hold for a large class of algorithms, not just current ones, and that it would require a major screw-up for someone to implement an algorithm for which it didn't hold.

As someone who has worked in computer security, and also written and read a lot of Python code, my guess is that cheating at current RL training processes as actually implemented is very, very possible for roughly human-level agents. (That was the other point of my post on gradient hacking.)

I wouldn't be surprised.

But, I would be surprised if it actually did cheat, unless the hacking were not merely possible with planning but pretty much laid out on its path.

The thing is, it's not trying to maximize the reward! (Back to TurnTrout's point again). It's gradient descent-ing in some attractor basin towards cognitive strategies that get good rewards in practice, and the hacking probably isn't in the same gradient descent attractor basin.

Even if it does develop goals and values, they will be shaped by the attractor basin that it's actually in, and not by other attractor basins.

A human with pre-existing goals is a different matter - that's why I questioned the relevance of the thought experiment.

In this post, I want to highlight a fact which I did not see mentioned in the original post or the comments: in the field of reinforcement learning, there are agents which are pursuing the goal of building a system which maximizes a reward function, subject to some additional constraints. These agents are the capabilities researchers designing and implementing SotA reinforcement learning algorithms and other methods to build and test the most capable, general systems across a variety of domains.

I agree that this fact is worth pointing out. I myself have the feeling that I've mentioned this somewhere, but couldn't instantly find/cite where I'd elaborated this over the last year. IIRC, I ended up thinking that the human-applied selection pressure was probably insufficient to actually produce policies which care about reward.

I'm not opposed to using standard shorthand when it's clear to experienced practitioners what the author means, but I think in posts which discuss both policies and agents, it is important to keep these distinctions in mind and sometimes make them explicit.

I agree very much with your point here, and think this is a way in which I have been imprecise. I mean to write a post soon which advocates against using "agents" to refer to "policy networks".

The observation that the Dreamer authors exerted strong optimization power to design an effective RL method is what led me to make the prediction here.

(As an aside, I mean to come back and read more about Dreamer and possibly take up a bet with you. I've been quite busy.)

Executing a policy trained through current reinforcement learning methods does not necessarily result in a system which takes actions to maximize the reward function.

I am not convinced that this is actually a narrowing of what TurnTrout said.

Consider the following possible claims:

An AI will not maximize something similar to the original reward function it is trained on.
An AI will not maximize the numerical value it receives from its reward module.

combined with one of:

A) either of 1 or 2 but, to clarify, what we mean by maximizing a target is specifically something like agentically seeking/valuing the target in some fairly explicit manner.
B) either 1 or 2 but, to clarify, what we mean by maximizing a target is acting in a way that looks like it is maximizing the target in question (as opposed to maximizing something else or not maximizing anything in particular), without it necessarily being an explicit goal/value.

combined with one of:

i) any of the above combinations, but to clarify, we are talking about current AI and not making general claims about scaled up AI.
ii) any of the above combinations, but to clarify, we are making an assertion that we expect to hold no matter how much the AI is scaled up.

I interpret TurnTrout's as mainly saying 2(B)(ii) (e.g. the AI will not, in general, rewrite its reward module to output MAX_INT regardless of how smart it becomes - I agree with this point).

Also, regarding your thought experiment - of course, if in training the AI finds some way to cheat, that will be reinforced! But that has limited relevance for when cheating in training isn't possible.

Hmm, I'm not sure anyone is "making an assertion that we expect to hold no matter how much the AI is scaled up.", unless scaling up means something pretty narrow like applying current RL algorithms to larger and larger networks and more and more data.

As someone who has worked in computer security, and also written and read a lot of Python code, my guess is that cheating at current RL training processes as actually implemented is very, very possible for roughly human-level agents. (That was the other point of my post on gradient hacking.)

I wouldn't be surprised.

But, I would be surprised if it actually did cheat, unless the hacking were not merely possible with planning but pretty much laid out on its path.

Even if it does develop goals and values, they will be shaped by the attractor basin that it's actually in, and not by other attractor basins.

A human with pre-existing goals is a different matter - that's why I questioned the relevance of the thought experiment.

In this post, I want to highlight a fact which I did not see mentioned in the original post or the comments: in the field of reinforcement learning, there are agents which are pursuing the goal of building a system which maximizes a reward function, subject to some additional constraints. These agents are the capabilities researchers designing and implementing SotA reinforcement learning algorithms and other methods to build and test the most capable, general systems across a variety of domains.

I'm not opposed to using standard shorthand when it's clear to experienced practitioners what the author means, but I think in posts which discuss both policies and agents, it is important to keep these distinctions in mind and sometimes make them explicit.

I agree very much with your point here, and think this is a way in which I have been imprecise. I mean to write a post soon which advocates against using "agents" to refer to "policy networks".

The observation that the Dreamer authors exerted strong optimization power to design an effective RL method is what led me to make the prediction here.

(As an aside, I mean to come back and read more about Dreamer and possibly take up a bet with you. I've been quite busy.)

32

Reward is the optimization target (of capabilities researchers)

32

Ω 17

Why is this observation important?

A note on terminology

A closing thought experiment

32

Ω 17

32

Ω 17