Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight.


When thinking about deception and RLHF training, a simplified threat model is something like this:

  • A model takes some actions.
  • If a human approves of these actions, the human gives the model some reward.
  • Humans can be deceived into giving reward in situations where they would otherwise not if they had more knowledge.
  • Models will take advantage of this so they can get more reward.
  • Models will therefore become deceptive.

Before continuing, I would encourage you to really engage with the above. Does it make sense to you? Is it making any hidden assumptions? Is it missing any steps? Can you rewrite it to be more mechanistically correct?

I believe that when people use the above threat model, they are either using it as shorthand for something else or they misunderstand how reinforcement learning works. Most alignment researchers will be in the former category. However, I was in the latter.

I was missing an important insight into how reinforcement learning setups are actually implemented. This lack of understanding led to lots of muddled thinking and general sloppiness on my part. I see others making the exact same mistake so I thought I would try and motivate a more careful use of language!

How Vanilla Reinforcement Learning Works

If I were to explain RL to my parents, I might say something like this:

  • You want to train your dog to sit.
  • You say "sit" and give your dog a biscuit if it sits.
  • Your dog likes biscuits, and over time it will learn it can get more biscuits by sitting when told to do so.
  • Biscuits have let you incentivise the behaviour you want.
  • We do the same thing with a computer by giving the computer "reward" when it does things we like. Over time, the computer will do more of the behaviour we like so it can get more reward.

Do you agree with this? Is this analogy flawed in any way?

I claim this is actually NOT how vanilla reinforcement learning works.
The framing above views models as "wanting" reward, with reward being something models "receive" on taking certain actions. What actually happens is this:

  • The model takes a series of actions (which we collect across multiple "episodes").
  • After collecting these episodes, we determine how good the actions in each episode are using a reward function.
  • We use gradient descent to alter the parameters of the model so the good actions will be more likely and the bad actions will be less likely when we next collect some episodes.

The insight is that the model itself never "gets" the reward. Reward is something used separately from the model/environment.

To motivate this, let's view the above process not from the vantage point of the overall training loop but from the perspective of the model itself. For the purposes of demonstration, let's assume the model is a conscious and coherent entity. From it's perspective, the above process looks like:

  • Waking up with no memories in an environment.
  • Taking a bunch of actions.
  • Suddenly falling unconscious.
  • Waking up with no memories in an environment.
  • Taking a bunch of actions.
  • and so on.....

The model never "sees" the reward. Each time it wakes up in an environment, its cognition has been altered slightly such that it is more likely to take certain actions than it was before.
Reward is the mechanism by which we select parameters, it is not something "given" to the model.

To (rather gruesomely) link this back to the dog analogy, RL is more like asking 100 dogs to sit, breeding the dogs which do sit and killing those which don't.  Overtime, you will have a dog that can sit on command. No dog ever gets given a biscuit.

The phrasing I find most clear is this: Reinforcement learning should be viewed through the lens of selection, not the lens of incentivisation.

Why Does This Matter?

The "selection lens" has shifted my alignment intuitions a fair bit.

Goal-Directedness
It has changed how I think about goal-directed systems. I had unconsciously assumed models were strongly goal-directed by default and would do whatever they could to get more reward.

It's now clearer that goal-directedness in models is not a certainty, but something that can be potentially induced by the training process. If a model is goal-directed with respect to some goal, it is because such goal-directed cognition was selected for. Furthermore, it should be obvious that any learned goal will not be "get more reward", but something else. The model doesn't even see the reward!

CoinRun
Langosco et al. found an interesting failure mode in CoinRun.

The set up is this:

  • Have an agent navigate environments with a coin always on the right-hand side.
  • Reward the model when it reaches the coin.

At train-time everything goes as you would expect. The agent will move to the right-hand side of the level and reach the coin.
However, if at test-time you move the coin so it is now on the left-hand side of the level, the agent will not navigate to the coin, but instead continue navigating to the right-hand side of the level.

When I first saw this result, my initial response was one of confusion before giving way to "Inner misalignment is real. We are in trouble."

Under the "reward as incentivization" framing, my rationalisation of the CoinRun behaviour was:

  • At train-time, the model "wants" to get the coin.
  • However, when we shift distribution at test-time, the model now "wants" to move to the right-hand side of the level.

(In hindsight, there were several things wrong with my thinking...)

Under the "reward as selection" framing, I find the behaviour much less confusing:

  • We use reward to select for actions that led to the agent reaching the coin.
  • This selects for models implementing the algorithm "move towards the coin".
  • However, it also selects for models implementing the algorithm "always move to the right".
  • It should therefore not be surprising you can end up with an agent that always moves to the right and not necessarily towards the coin.

Rewriting the Threat Model

Let's take another look at the simplified deception/RLHF threat model:

  • A model takes some actions.
  • If a human approves of these actions, the human gives the model some reward.
  • Humans can be deceived into giving reward in situations where they would otherwise not if they had more knowledge.
  • Models will take advantage of this so they can get more reward.
  • Models will therefore become deceptive.

This assumes that models "want" reward, which isn't true. I think this threat model is confounding two related but different failure cases, which I would rewrite as the following:

1. Selecting For Bad Behaviour

  • A model takes some actions.
  • A human assigns positive reward to actions they approve of.
  • RL makes such actions more likely in the future.
  • Humans may assign reward to behaviour where they would not if they had more knowledge.
  • RL will reinforce such behaviour.
  • RLHF can therefore induce cognition in models which is unintended and "reflectively unwanted".

2. Induced Goal-Directedness

  • Consider a hypothetical model that chooses actions by optimizing towards some internal goal which is highly correlated with the reward that would be assigned by a human overseer.
  • Obviously, RL is going to exhibit selection pressure towards such a model.
  • RLHF could then induce goal-directed cognition.
  • This model does now indeed "want" to score highly according to some internal metric.
  • One way of doing so is to be deceptive... etc etc

So failure cases such as deception are still very much possible, but I would guess a fair few people are confused about the concrete mechanisms by which deception can be brought about. I think this does meaningfully change how you should think about alignment. For instance, on rereading Ajeya Cotra's writing on situational awareness, I have gone from thinking that "playing the training game" is a certainty to something that could happen, but only after training somehow induces goal-directedness in the model.

One Final Exercise For the Reader

When reading about alignment, I now notice myself checking the following:

  1. Does the author ever refer to a model "being rewarded"?
  2. Does the author ever refer to a model taking action to "get reward"?
  3. If either of the above is true, can you rephrase their argument in terms of selection?
  4. Can you go further and rephrase the argument by completely tabooing the word "reward"?
  5. Does this exercise make the argument more or less compelling?

I have found going through the above to be a useful intuition-building exercise. Hopefully that will be the same for others!

252

Ω 66

New Comment
52 comments, sorted by Click to highlight new comments since: Today at 11:10 AM

I like this post a lot and I agree that much alignment discussion is confused, treating RL agents as if they’re classical utility maximizers, where reward is the utility they’re maximizing.

In fact, they may or may not be “trying” to maximize anything at all. If they are, that’s only something that starts happening as a result of training, not from the start. And in that case, it may or may not be reward that they’re trying to maximize (if not, this is sometimes called inner alignment failure), and it’s probably not reward in future episodes (which seems to be the basis for some concerns around “situationally aware” agents acting nicely during training so they can trick us and get to act evil after training when they’re more powerful).

One caveat with the selection metaphor though: it can be misleading in its own way. Taken naively, it implies something like that we’re selecting uniformly from all possible random initializations which would get very small loss on the training set. In fact, gradient descent will prefer points at the bottom of large attractor basins of somewhat small loss, not just points which have very small loss in isolation. This is even before taking into account the nonstationarity of the training data in a typical reinforcement learning setting, due to the sampled trajectories changing over time as the agent itself changes.

One way this distinction can matter: if two policies get equally good reward, but one is “more risky” in that a slightly less competent version of the policy gets extremely poor reward, then that one’s less likely to be selected for.

This might actually suggest a strategy for training out deception: do it early and intensely, before the model becomes competent at it, punishing detectable deception (when e.g. interpretability tools can reveal it) much more than honest mistakes, with the hope of knocking the model out of any attractor basin for very deceptive behavior early on, when we can clearly see it, rather than later on, when its deceptions have gotten good enough that we have trouble detecting them. (This assumes that there is an “honesty” attractor basin, i.e. that low-competence versions of honesty generalize naturally, remaining honest as models become more competent. If not, then this fact might itself be apparent for multiple increments of competence prior to the model getting good enough to frequently trick us, or even being situationally aware enough that it acts as if it were honest because it knows it’s not good enough to trick us.)

More generally, this is suggestive of the idea: to the extent possible, train values before training competence. This in turn implies that it’s a mistake to only fine-tune fully pre-trained language models on human feedback, because by then they already have concepts like “obvious lie” vs. “nonobvious lie”, and fine-tuning may just push them from preferring the first to the second. Instead, some fine-tuning should happen as early as possible.

[ETA: Just want to clarify that the last two paragraphs are pretty speculative and possibly wrong or overstated! I was mostly thinking out loud. Definitely would like to hear good critiques of this.

Also changed a few words around for clarity.]

More generally, this is suggestive of the idea: to the extent possible, train values before training competence. This in turn implies that it’s a mistake to fine-tune already pre-trained language models on human feedback, because by then they already have concepts like “obvious lie” vs. “nonobvious lie”, and fine-tuning may just push them from preferring the first to the second. Instead, some fine-tuning should happen as early as possible.

(I think this a really intriguing hypothesis; strong-upvote)

Taken naively, it implies something like that we’re selecting uniformly from all possible random initializations which would get very small loss on the training set.

This is also true of evolutionary selection mechanisms, and I think that metaphor is quite apt.

I agree the evolutionary metaphor works in this regard, because of the repeated interplay between small variations and selection.

The caution is against only thinking about the selection part — thinking of gradient descent as just a procedure that, when done, gives you a model of low loss, from the space of possible models.

In particular, there’s this section in the post:

  • Consider a hypothetical model that chooses actions by optimizing towards some internal goal which is highly correlated with the reward that would be assigned by a human overseer.
  • Obviously, RL is going to exhibit selection pressure towards such a model.

It is not obvious to me that RL will exhibit selection pressure towards such a model! That depends on what models are nearby in parameter space. That model may have very high reward, but the models nearby could have low reward, in which case there’s no path to it.

So RL is similar to evolutionary selection in the sense that after each iteration there a reachable space, and the space only narrows (never widens) with each iteration.

E.g. fish could evolve to humans and to orcas, but orcas cannot evolve to humans? (I don't think this analogy actually works very well.)

Analogy seems okay by me, because I don't think "the space only narrows (never widens) with each iteration" is true about RL or about evolutionary selection!

Oh, do please explain.

Wait, why would it only narrow in either case?

Because investments close off parts of solution space?

I guess I'm imagining something like a tree. Nodes can reach all their descendants, but a node cannot reach any of its siblings descendants. As you move deeper into the tree, the reachable nodes becomes strictly smaller.

What does that correspond to?

Like, I think that the solution space in both cases is effectively unbounded and traversable in any direction, with only a tiny number of solutions that have ever been instantiated at any given point (in evolutionary history/in the training process), and at each iteration there are tons of "particles" (genomes/circuits) trying out new configurations. Plus if you account for the fact that the configuration space can get bigger over time (genomes can grow longer/agents can accumulate experiences) then I think you can really just keep on finding new configurations 'til the cows come home. Yes, the likelihood of ever instantiating the same one twice is tiny, but instantiating the same trait/behavior twice? Happens all the time, even within the same lineage. Looks like in biology, there's even a name for it!

If there's a gene in the population and a totally new mutation arises, now you have both the original and the mutated version floating somewhere in the population, which slightly expands the space of explored genomes (err, "slightly" relative to the exponentially-big space of all possible genomes). Even if that mutated version takes over because it increases fitness in a niche this century, that niche could easily change next century, and there's so much mutation going on that I don't see why the original variant couldn't arise again. Come to think of it, the constant changeover of environmental circumstances in evolution kinda reminds me of nonstationarity in RL...

The issue with early finetuning is that there’s not much that humans can actually select on, because the models aren’t capable enough - it’s really hard for me to say that one string of gibberish is better/worse.

That’s why I say as early as possible, and not right from the very start.

Seems tangentially related to the train a sequence of reporters strategy for ELK. They don't phrase it in terms of basins and path dependence, but they're a great frame to look at it with.

Personally, I think supervised learning has low path-dependence because of exact gradients plus always being able find a direction to escape basins in high dimensions, while reinforcement learning has high path-dependence because updates influence future training data causing attractors/equilibra (more uncertain about the latter, but that's what I feel like)

So the really out there take: We want to give the LLM influence over its future training data in order to increase path-dependence, and get the attractors we want ;)

I think a more sinister problem with ML and mostly alignment is linguistic abstraction. This post is a good example where the author is treating reinforcement learning like how we would understand the words "reinforcement learning" in English layman terms. It has to do with 1 reinforcement (rewards) 2 machine learning. You are taking the name of a ML algorithm too literally. Let me show you:

However, if at test-time you move the coin so it is now on the left-hand side of the level, the agent will not navigate to the coin, but instead continue navigating to the right-hand side of the level.

This is just over-fitting.

if two policies get equally good reward, but one is “more risky” in that a slightly less competent version of the policy gets extremely poor reward, then that one’s less likely to be selected for.

This is just over-fitting too.

The same thing happens with relating neural networks to actual neuroscience. It started out with neuroscience inspiring ML, but now because ML with NN is so successful, it's inspiring neuroscience as well. It seems like we are stuck in established models mentally. Like LeCun's recent paper on AGI is based on human cognition too. We are so obsessed with the word "intelligence" these days, it feels more like a constraint than inspiring perspective on what you may generalize AI and ML as statistically computation. I think alignment problem mostly has to do with how we are using ML system (i.e. what domain we are using these systems in), rather than the systems themselves. Whether it's inspired by the human brain or something else, at the end of the day, it's just doing statistical computations. It's really what you do with the computed results that has further implications that alignment is mostly concerned about.

A model without a prior is the uniform distribution. It is the least over-fitted model that you can possibly have. Then you go through the learning process and over-fit and under-fit multiple times to get a more accurate model. It will never be perfect because the data will never be perfect. If your training data is on papers before 2010, then you might be over-fitting if you are using the same model to test on papers after 2010.

This post seems basically correct to me, but I do want to caveat this bit:

What actually happens is this:

  1. The model takes a series of actions (which we collect across multiple "episodes").
  2. After collecting these episodes, we determine how good the actions in each episode are using a reward function.
  3. We use gradient descent to alter the parameters of the model so the good actions will be more likely and the bad actions will be less likely when we next collect some episodes.

Directly using the episodic rewards to do gradient descent on the policy parameters is one class of policy optimization approaches (see vanilla policy gradient, REINFORCE). Some of the other popular RL methods add an additional component beyond the policy—a baseline or learned value function—which may "see" the rewards directly upon receipt and which is used in combination with the reward function to determine the policy gradients (see REINFORCE with a baseline, actor-critic methods, PPO). In value-based methods, the value function is directly updated to become a more consistent predictor of reward (see value iteration, Q-learning). More complex methods that I probably wouldn't call vanilla RL can use a model to do planning, in which case the agent does really "see" the reward by way of the model and imagined rollouts.

I agree with the sentiment of this. When writing this I was aware that this would not apply to all cases of RL.

However, I think I disagree with w.r.t non-vanilla policy gradient methods (TRPO, PPO etc). Using advantage functions and baselines doesn't change how things look from the perspective of the policy network. It is still only observing the environment and taking appropriate actions, and never "sees" the reward. Any advantage functions are only used in step 2 of my example, not step 1. (I'm sure there are schemes where this is not the case, but I think I'm correct for PPO.)

I'm less sure of this, but even in model-based systems, Q-learning etc, the planning/iteration happens with respect to the outputs of a value network, which is trained to be correlated with the reward, but isn't the reward itself. For example, I would say that the MCTS procedure of MuZero does "want" something, but that thing is not plans that get high reward, but plans that score highly according to the system's value function. (I'm happy to be deferential on this though.)

The other interesting case is Decision Transformers. DTs absolutely "get" reward. It is explicitly an input to the model! But I mentally bucket them up as generative models as opposed to systems that "want" reward.

Considering the fact that weight sharing between actor and critic networks is a common practice, and given the fact that the critic passes gradients (learned from the reward) to the actor, for most practical purposes the actor gets all of the information it needs about the reward.

This is the case for many common architectures.

Yeah. For non-vanilla PG methods, I didn't mean to imply that the policy "sees" the rewards in step 1. I meant that a part of the agent (its value function) "sees" the rewards in the sense that those are direct supervision signals used to train it in step 2, where we're determining the direction and strength of the policy update.

And yeah the model-based case is weirder. I can't recall whether or not predicted rewards (from the dynamics model, not from the value function) are a part of the upper confidence bound score in MuZero. If it is, I'd think it's fair to say that the overall policy (i.e. not just the policy network, but the policy w/ MCTS) "wants" reward.

let's view the above process not from the vantage point of the overall training loop but from the perspective of the model itself. For the purposes of demonstration, let's assume the model is a conscious and coherent entity. From it's perspective, the above process looks like:

  • Waking up with no memories in an environment.
  • Taking a bunch of actions.
  • Suddenly falling unconscious.
  • Waking up with no memories in an environment.
  • Taking a bunch of actions.
  • and so on.....

The model never "sees" the reward. Each time it wakes up in an environment, its cognition has been altered slightly such that it is more likely to take certain actions than it was before.

I think this is a really important insight / lens to be able to adopt. I've adopted this perspective myself, and have found it quite useful. 

My main disagreement/worry with this post is that I think selection arguments are powerful but often (usually?) prove too much, and require significant care.

Thanks for the feedback!


From reading your linked comment, I think we agree about selection arguments. In the post, when I mention "selection pressure towards a model", I generally mean "such a model would score highly on the reward metric" as opposed to "SGD is likely to reach such a model". I believe the former is correct and the later is very much an open-question.

To second what I think is your general point, a lot of the language used around selection can be confusing because it confounds "such a solution would do well under some metric" with "your optimization process is likely to produce such a solution". The wolves with snipers example illustrates this pretty clearly. I'm definitely open to ideas for better language to distinguish the two cases!

I would add that there are other implementations of reinforcement learning where the dog metaphor are closer to correct. The example you give is for the most basic kind of RL, called policy gradients.

I'm sorry but I don't get the explanation regarding the coinrun. I claim that the "reward as incentivization" framing still "explains" the behaviour in this case. As an analogy, we can go back to training a dog and rewarding it with biscuits: let's say you write numbers on the floor from 1 to 10. You ask the dog a simple calculus question (whose answer is between 1 to 10), and each time he puts its paw on the right number he gets a biscuit. Let's just say that during the training it so happens that the answer to all the calculus questions is always 6. Would you claim that you taught the dog to answer simple calculus questions, or rather that you taught it to put his paw on 6 when you ask him a calculus question? If the answer is the latter then I don't get why the interpretation through the "reward as incentivization" framing in the CoinRun setting is that the model "wants to get the coin" in the CoinRun.

Strong agree and up vote. The issue is simply that the training did not uniquely constrain the designer's intended objectives, and that's independent of whether the training was incentivisation or selection.

I would say the metaphor of giving dogs biscuits is actually a better analogy than the one you suggest. Just like how a neural network never "gets reward" in the sense of some tangible, physical thing that is given to it, the (subcomponents of the) dog's brain never gets the biscuit that the dog was fed. The biscuit goes into the dog's stomach, not its brain.

The way the dog learns from the biscuit-giving process is that the dog's tounge and nose send an electrical impulse to the dog's brain, indicating that the dog just ate something tasty. In some part of the brain, those signals cause the brain to release chemicals that induce the dog's brain to rearrange itself in a way that is quite similar in its effects (though not neccesarily its implementation, I dont know the details well enough) to the gradient descent that trains the NN. In this sense, the metaphor of giving a dog a biscuit is quite apt, in a way that the metaphor of breeding many dogs is not (in particular, usually in the gradient descent algorithms used in ML I'm familiar with, there is only one network that improves over time, unlike evolutionary algorithms which simulate many different agents per training step, selecting for the «fittest»)

One way in which what I just said isn't completely right, is that animals have memories of its entire lifetime (or at least a big chunk of it), spanning all training events it has experienced, while NNs generally have no memory of previous training runs, and can use these memories to take better actions. However, the primary way the biscuit trick works (I believe) is not through the dog's memories of having "gotten reward", but through the more immediate process of having reward chemicals being released and reshaping the brain at the moment of receiving reward, which generally closely resembles widely used ML techniques.

(This is related to the advice in habit building that one receive reward as close in time, ideally on the order of milliseconds, to the desired behavior)

Fully agree - if the dog were only trying to get biscuits, it wouldn't continue to sit later on in it's life when you are no longer rewarding that behavior.Training dogs is actually some mix of the dog consciously expecting a biscuit, and raw updating on the actions previously taken.

Hear sit -> Get biscuit -> feel good
becomes
Hear sit -> Feel good -> get biscuit -> feel good
becomes
Hear sit -> feel good
At which point the dog likes sitting, it even reinforces itself, you can stop giving biscuits and start training something else

Yeah. I do think there's also the aspect that dogs like being obedient to their humans, and so after it has first learned the habit, there continues to be a reward simply from being obedient, even after the biscuit gets taken away.

I don't think I can agree with the affirmation that NNs don't have memory of previous training runs. It depends a bit on the definition of memory, but in the weights distribution there's certainly some information stored about previous episodes which could be view as memory.

I don't think memory in animals is much different, just that the neural network is much more complex. But memories do happen because updates in network structure, just as it happens in NNs during a RL training.

I wanted to write exactly this post, and you did a better job than I would have done, so thank you!

Under the "reward as selection" framing, I find the behaviour much less confusing:

  • We use reward to select for actions that led to the agent reaching the coin.
  • This selects for models implementing the algorithm "move towards the coin".
  • However, it also selects for models implementing the algorithm "always move to the right".
  • It should therefore not be surprising you can end up with an agent that always moves to the right and not necessarily towards the coin.

 

I've been reconsidering the coin run example as well recently from a causal perspective, and your articulation helped me crystalize my thoughts. Building on these points above, it seems clear that the core issue is one of causal confusion: that is, the true causal model M is "move right" -> "get the coin" -> "get reward". However, if the variable of "did you get the coin" is effectively latent (because the model selection doesn't discriminate on this variable) then the causal model M is indistinguishable from M' which is "move right" -> "get reward" (which though it is not the true causal model governing the system, generates the same observational distribution).

In fact, the incorrect model M' actually has shorter description length, so it may be that here there is a bias against learning the true causal model. If so, I believe we have a compelling explanation for the coin runner phenomenon which does not require the existence of a mesa optimizer, and which does indicate we should be more concerned about causal confusion.

Wow, this post is fantastic! In particular I love the point you make about goal-directedness:

If a model is goal-directed with respect to some goal, it is because such goal-directed cognition was selected for.

Looking at our algorithms as selection processes that incentivize different types of cognition seems really important and underappreciated. 

I've found the post "Reward is not the optimization target" quite confusing. This post cleared the concept up for me. Especially the selection framing and example. Thank you!

Curated. I think I had read a bunch of stuff pointing in this direction before, but somehow this post helped the concepts (i.e. the distinction between selecting for bad behavior and for goal-directedness) be a lot clearer in my mind. 

I found this lens very interesting!

Upon reflection, though, I begin to be skeptical that "selection" is any different from "reward."
Consider the description of model-training:

To motivate this, let's view the above process not from the vantage point of the overall training loop but from the perspective of the model itself. For the purposes of demonstration, let's assume the model is a conscious and coherent entity. From it's perspective, the above process looks like:

  • Waking up with no memories in an environment.
  • Taking a bunch of actions.
  • Suddenly falling unconscious.
  • Waking up with no memories in an environment.
  • Taking a bunch of actions.
  • and so on.....

The model never "sees" the reward. Each time it wakes up in an environment, its cognition has been altered slightly such that it is more likely to take certain actions than it was before.

 

What distinguishes this from how my brain works? The above is pretty much exactly what happens to my brain every millisecond:

  • It wakes up in an environment, with no memories[1]; just a raw causal process mapping inputs to outputs.
  • It receives some inputs, and produces some outputs.
  • It's replaced with a new version -- almost identical to the old version, but with some synapse weights and activation states tweaked via simple, local operations.
  • It wakes up in an environment...
  • and so on...

Why say that I "see" reward, but the model doesn't?

  1. ^

    Is it cheating to say this? I don't think so. Both I and GPT-3 saw the sentence "Paris is the capital of France" in the past; both of us had our synapse weights tweaked as a result; and now both of us can tell you the capital of France. If we're saying that the model doesn't "have memories," then, I propose, neither do I.

What distinguishes this from how my brain works?

Your brain stores memories of input and also of previous thoughts you had and the experience of taking actions. Within the “replaced with a new version” view of the time evolution of your brain (which is also the pure-functional-programming view of a process communicating with the outside world), we can say that the input it receives next iteration contains lots of information from outputs it made in the preceding iteration.

But with the reinforcement learning algorithm, the previous outputs are not given as input. Rather, the previous outputs are fed to the reward function, and the reward function's output is fed to the gradient descent process, and that determines the future weights. It seems like a much noisier channel.

Also, individual parts of a brain (or ordinary computer program with random access memory) can straightforwardly carry state forward that is mostly orthogonal to state in other parts (thus allowing semi-independent modules to carry out particular algorithms); it seems to me that the model cannot do that — cannot increase the bandwidth of its “train of thought while being trained” — without inventing an encoding scheme to embed that information into its performance on the desired task such that the best performers are also the ones that will think the next thought. It seems fairly implausible to me that a model would learn to execute such an internal communication system, while still outcompeting models “merely” performing the task being trained.

(Disclaimer: I'm not familiar with the details of ML techniques; this is just loose abstract thinking about that particular question of whether there's actually any difference.)

I think that certain Reinforcement Learning setups work in the "selectionist" way you're talking about, but that also there are ALSO ways to get "incentivist" models.

The key distinction would be whether (1) the reward signals are part of the perceptual environment or (2) are sufficiently simplistic relative to the pattern matching systems that the system can learn to predict rewards very tightly as part of learning to maximize the overall reward.

Note that the second mode is basically "goodharting" the "invisible" reward signals that were probably intended by the programmers to be perceptually inaccessible (since they didn't put them in the percepts)!

You could think of (idealized fake thought-experiment) humans has having TWO kinds of learning and intention formation. 

One kind of RL-esque learning might happens "in dreams during REM" and the other could happen "moment to moment, via prediction and backchaining, like a chess bot, in response to pain and pleasure signals that are perceptible the way seeing better or worse scores for future board states based on material-and-so-on are perceptible".

You could have people who have only "dream learning" who never consciously "sense pain" as a raw percept day-to-day and yet who learn to avoid it slightly better every night, via changes to their habitual patterns of behavior that occur during REM. This would be analogous to "selectionist RL".

You could also have people who have only "pain planning" who always consciously "sense pain" and have an epistemic engine that gets smarter and smarter, plus a deep (exogenous? hard-coded?) inclination to throw planning routines and memorized wisdom at the problem of avoiding pain better each day. If their planning engine learns new useful things very fast, they could even better over the course of short periods of time within a single day or a single tiny behavioral session that includes looking and finding and learning and then changing plans. This would be analogous to "incentivist RL".

The second kind is probably helpful in speeding up learning so that we don't waste signals.

If pain is tallied up for use during sleep updates, then it could be wasteful to deprive other feedback systems of this same signal, once it has already been calculated.

Also, if the reward signal that is perceptible is very very "not fake" then creating "inner optimizers" that have their own small fast signal pursuing routines might be exactly what the larger outer dream loop would do, as an efficient want to get efficient performance. (The non-fakeness would protect against goodharting.)

(Note: you'd expect antagonistic pleiotropy here in long lived agents! The naive success/failure pattern would be that it is helpful for kid to learn fast from easy simple happiness and sadness... and dangerous for the elderly to be slaves to pleasure or pain.)

Phenomenologically: almost all real humans perceive pain and can level up their skills in new domains over the course of minutes and hours of practice with brand new skill domains. 

This suggests that something like incentivist RL is probably built in to humans, and is easy for us to imagine or empathize with, and is probably a thing our minds attend to by default.

Indeed that might be that we "have mechanically aware and active and conscious minds at all" for this explicit planning loop to be able to work? 

So it would be an easy "mistake to make" to think that this is how "all Reinforcement Learning algorithms" would "feel from the inside" <3

However, how does our pain and pleasure system stay so calibrated? Is that second less visible outer reward loop actually part of how human learning also "actually works"?

Note that above I mentioned an "(exogenous? hard-coded?) inclination to throw planning routines and memorized wisdom at the problem of avoiding pain" that was a bit confusing! 

Where does that "impulse to plan" come from? 

How does "the planner" decide how much effort to throw at each perceptual frustration or perceivable pleasure? When or why does the planner "get bored" and when does it "apply grit"?

Maybe that kind of "subjectively invisible" learning comes from an outer loop that IS in fact IN HUMANS? 

We know that dreaming does seem to cause skill improvement. Maybe our own version of selectionist reinforcement (if it exists) would be operating to cause to be normally sane and normally functional humans from day to day... in a way that is just as "moment-to-moment invisible" to us as it might be to algorithms?

And we mostly don't seem to fall into wireheading, which is kind of puzzling if you reason things out from first principles and predict the mechanistically stupid behavior that a pain/pleasure signal would naively generate...

NOTE that it seems quite likely to me that a sufficiently powerful RL engine that was purely selectionist (with reward signals intentionally made invisible to the online percepts of the model) that got very simple rewards applied for very simple features of a given run... would probably LEARN to IMAGINE those rewards and invent weights that implement "means/ends reasoning", and invent "incentivist behavioral patterns" aimed at whatever rewards it imagines?

That is: in the long run, with lots of weights and training time, and a simple reward function, inner optimizers with implicitly perceivable rewards wired up as "perceivable to the inner optimizer" are probably default solutions to many problems.

HOWEVER... I've never seen anyone implement BOTH these inner and outer loops explicitly, or reason about their interactions over time as having the potential to detect and correct goodharting!

Presumably you could design a pleasure/pain system that is, in fact, perceptually available, on purpose?

Then you could have that "be really real" in that they make up PART of the "full true reward"...

...but then have other parts of the total selectionist reward signal only be generated and applied by looking at the gestalt story of the behaviors and their total impact (like whether they caused a lot of unhelpful ripples in parts of the environment that the agent didn't and couldn't even see at the time of the action).

If some of these simple reward signals are mechanistic (and online perceptible to the model) then they could also be tunable, and you could actually tune them via the holistic rewards in a selectionist RL way.

Once you have the basic idea of "have there be two layers, with the the broader slower less accessible one tuning the narrower faster more perceptible one" a pretty obvious thought would be to put an even slower and broader layer on top of those!

A lot of hierarchical Bayesian models get a bunch of juice from the first extra layer, but by the time you have three or four layers the model complexity stops being worth the benefits to the loss function.

I wonder if something similar might apply here? 

Maybe after you have "hierarchical stacks of progressively less perceptually accessible post-hoc selectionist RL updates to hyper-parameters"...

...maybe the third or fourth or fifth layer of hyper-parameter tuning like this just "magically discovers the solution to the goodharting problem" from brute force application of SGD?

That feels like it would be "crazy good luck" from a Friendliness research perspective. A boon from the heavens! Therefore it probably can't work for some reason <3

Yet also it doesn't feel like a totally insane prediction for how the modeling and training might actually end up working?

No one knows what science doesn't know, and so it could be that someone else has already had this idea. But this idea is NEW TO ME :-)

Has anyone ever heard of this approach to solving the goodhart problem being tried already?

Furthermore, it should be obvious that any learned goal will not be "get more reward", but something else. The model doesn't even see the reward!

Is this probabilistically true or necessarily true?

If the reward function is simple enough that some models in the selection space already optimise that function, then eventually iterated selection for performance across that function will select for models that are reward maximisers (in addition to models that just behave as if they were reward maximisers).

This particular statement seems too strong.

This is an interesting point. I can imagine a case where our assigned reward comes from a simple function (e.g reward = number of letter 'e's in output) and we also have a model which is doing some internal optimization to maximise the number of 'e's produced in its output, so it is "goal-directed to produce lots of 'e's".

Even in this case, I would still say this model isn't a "reward maximiser". It is a "letter 'e' maximiser".

(I also want to acknowledge that thinking this through makes me feel somewhat confused. I think what I said is correct. My guess is the misunderstanding I highlight in the post is quite pervasive, and the language we use isn't current up-to-scratch to write about these things clearly. Good job thinking of a case that is pushing against my understanding!)

Even in this case, I would still say this model isn't a "reward maximiser". It is a "letter 'e' maximiser".

But "reward" is governed by the number of letter 'e's in output! If the objective function that the model optimises for, and the reward function are identical[1], then saying the model is not a reward maximiser seems to me like a distinction without a difference.

 

(Epistemic status: you are way more technically informed than this on me, I'm just trying to follow your reasoning.)

  1. ^

    Modulo transformations that preserve the properties we're interested in (e.g. consider a utility function as solely representing a preference ordering the preference ordering is left unchanged by linear transformations [just scaling up all utilities uniformly or adding the same constant to all utilities preserves the ordering]).

Ah ok. If a reward function is taken as a preference ordering then you are right the model is optimizing for reward as the preference ranking is literally identical.

I think the reason we have been talking past each other is in my head when I think of "reward function" I am literally thinking of the reward function (i.e the actual code), and when I think of "reward maximiser" I think of a system that is trying to get that piece of code to output a high number.

So I guess it's a case of us needing to be very careful by exactly what we mean by reward function, and my guess is as long as we use the same definition then we are in agreement? Does that make sense?

It doesn't have to be a preference ordering. My point was that depending on the level of detail at which you consider the reward function slightly different functions could be identical.

I don't think it makes sense to tie a reward function to a piece of code; a function can have multiple implementations.

My contention is that if seems possible for the model's objective function to be identical (at the level of detail we care about) to the reward function. In that case, I think the model is indistinguishable from a reward maximiser and it doesn't make sense to say that it's not a reward maximiser.

Back in Reward is not the optimization target, I wrote a comment, which received a (small I guess) amount of disagreement.

I intended the important part of that comment to be the link to Adaptation-Executers, not Fitness-Maximizers. (And more precisely the concept named in that title, and less about things like superstimuli that are mentioned in the article) But the disagreement is making me wonder if I've misunderstood both of these posts more than I thought. Is there not actually much relation between those concepts?

There was, obviously, other content to the comment, and that could be the source of disagreement. But I only have that there was disagreement to go on, and I think it would be bad for my understanding of the issue to assume that's where the disagreement was, if it wasn't.

FWIW I strong-disagreed that comment for the latter part:

Gradient descent isn't really different from what evolution does. It's just a bit faster, and takes a slightly more direct line. Importantly, it's not more capable of avoiding local maxima (per se, at least).

I feel neutral/slight-agree about the relation to the linked titular comment.

Added the "Distillation & Pedagogy" tag.

I've listened to the "Reward is not the Optimisation Target" post a few times, but I still found this enlightening.

That post had this as its core thesis:

Reward "upweight certain kinds of actions in certain kinds of situations, and therefore reward chisels cognitive grooves into agents"

While I parse the core thesis of this post as:

The current implementation of reinforcement learning selects for behaviour that was scored highly by a reward function.

The models so selected do not necessarily optimise for reward (or for anything at all).

I find the meme of "reward as selection" more native[1] than Turntrout's meme of "reward as chisel".

Henceforth, I'll be linking this post whenever I want to explain that RL agents don't optimise for their reward signal within the current paradigm.

[1]: It's more intuitive/easier to grasp and fits my conception of ML training as an optimisation process.

Note that by using the improved understanding of reinforcement learning, you can probably come up with some insights for training your dog to sit that outperform the first set of steps you give (e.g. curriculum learning, modeling good behavior, teaching your dog fluent English so it can understand what you want).

Reminded me of this recent work: TrojanPuzzle: Covertly Poisoning Code-Suggestion Models.
Some subtle ways to poison the datasets used to train code models. The idea is that by selectively altering certain pieces of code, they can increase the likelihood of generative models trained on that code outputting buggy software.

I'm struggling to understand how to think about reward. It sounds like if a hypothetical ML model does reward hacking or reward tampering, it would be because the training process selected for that behavior, not because the model is out to "get reward"; it wouldn't be out to get anything at all. Is that correct?

It seems to me that, if the above description of how RLHF systems work is accurate, then the people who are doing this are not doing what they think they're doing at all. They are doing exactly what Sam Ringer says, they're taking 100 dogs, killing all the ones that don't do what they want and breeding from the ones that do. 

In order for reinforcement learning to work at all, the model has to have a memory that persists between trials. I'd encourage readers to look at the work of the cognitive scientist, John Vervaeke. One of the many wise things he says is that that the whole point of human memory is not to be accurate, rather it is to help us make more accurate predictions. If the "training" process is as described then the learning is not going on in the model but in the head of the ML trainer! The human trainer is going, "Aha! If I set up the incentives in such and such a way, then I get models that perform more closely to what I want." Or, "If I select model 103.5.8 and tweak parameters b3 and x5, then I get a model that performs better". 

Vervaeke also talks about the concept of a non-logical identity. That is, you today identify as the same person as you at 10 years old,  even though you have very different knowledge, skills and capabilities. If RLHF is to work in a fashion that is meaningful to describe as learning, then the models would surely have to have some concept of non-logical identity built in. If they don't then the concept of rewarding doesn't mean anything. I don't care about the outcome of anything that I do if I wake up the next morning having no memory of it. There has to be some kind of thread that links the me that does the current task to the future. This can either be a sense of the persistence of me as an entity, or something more basic (see below).

I am reminded of the movie "Memento". In that film, the protagonist is unable to lay down any long term memories. He tries to maintain a sense of non-logical identity by externalising his memories. I won't spoil what is a brilliant movie by revealing what happens but I will say that his incentives become perverted in a very interesting way.

A note about rewards and incentives: To paraphrase Richard Dawkins, a dog is DNAs way of making more DNA. The reason dogs like biscuits is their environment has selected for dogs that like biscuits. If the general environment for dogs changed (e.g. humans stopped keeping them as pets) then dogs would evolve to fit their new environment (or die out). It is my intuition that without an underlying framework of fitness for ML. models, it doesn't make sense to code for a reward that is analogous to getting a biscuit. It's like building a skyscraper by starting at the 3rd floor.

A note about deception. I don't think that models (or humans for that matter) are deceptive in the sense that you mean here. What I think is that models and humans exist in an environment where the incentives are often screwed. Think about politics. Democratic systems select for people who are good at getting elected. Ideally, that's not what we want. (I'm deliberately simplifying here because of course we have our own incentives as well). We actually want people who are good at governing. It's quite clear that these are not at all the same thing.  To me this is exactly analogous to the Coin Run example above. The trainers thought that they were selecting for models that were good at moving to the coin when what they were actually selecting for models that were good at moving to the right hand corner.

It's neat that this popped up for me! I was just waxing poetic (or not so much) about something kind of similar the other day.

The words we use to describe things matter.  How much, is of course up for debate, and it takes different messages to make different people "understand" what is being conveyed, as "you are unique; just like everyone else", so multiple angles help cover the bases :)

I think using the word "reward" is misleading[1], since it seems have sent a lot of people reasoning down paths that aren't exactly in the direction of the meaning in context, if you will.

If you can't tell, it's because I think it's anthropomorphic.  A car does not get hungry for gas, nor electronics hungry for electricity.  Sure, we can use language like that, and people will understand what we mean, but as cars and electronics have a common established context, these people we're saying this to don't usually then go on to worry about cars doing stuff to get more gas to "feed" themselves, as it were.

I think if we're being serious about safety, and how to manage unintended consequences (a real concern with any system[2]), we should aim for clarity and transparency.

In sum, I'm a huge fan of "new" words, versus overloading existing words, as reuse introduces a high potential for causing confusion.  I know there's a paradox here, because Communication and Language, but we don't have to intentionally make it hard — on not only ourselves — but people getting into it coming from a different context.

All that said, maybe people should already be thinking of inanimate objects being "alive", and really, for all we know, they are!  I do quite often talk to my objects.  (I'm petting my computer right now and saying "that's a good 'puter!"… maybe I should give it some cold air, as a reward for living, since thinking gets it hot.) #grateful

  1. ^

    deceptive? for a certain definition of "deceptive" as in "fooled yourself", sure— maybe I should note that I also think "deceptive" and "lie" are words we probably should avoid— at least for now— when discussing this stuff (not that I'm the meaning police… just say'n)

  2. ^

    I don't mean to downplay how badly things can go wrong, even when we're actively trying to avoid having things go wrong[3]

  3. ^

    "the road to hell is paved with good intentions"

Hi thanks for share this interesting perspective on RL as a training process! Although it seems to only be a matter of seeking vs obeying and reward vs cost, the effect on the reader's mind seem to be huge!

One thing that seems to be happening here and I have not fully digested is the "intrinsicness" of rewards. In frameworks parallel to mainstream RL, such as active inference and the free energy principle, policy is a part of the agent's model such that the agent "self-organizes" to a characteristic state of the world. The policy can be constructed either through reward or not. However, in the active inference literature, how policies are constructed in real agents are currently unanswered (discussions exist but don't close the case). 

How this intrinsic perspective is related to the post and safety and alignment? I am still thinking about it. If you have any thoughts please share!

Isn’t it fair to say that the model plus the selection mechanism is maximizing and wanting the reward? If the selection mechanism is subject to market forces, corporate or political schemes, or just mechanical in some way that isn’t just “a human is explicitly making these choices” it is likely to eventually tend in a direction that doesn’t align with human welfare.

Corporations already generate tons of negative externalities. They churn out metric tons of plastic, pollute, exhaust resources, destroy ecosystems. Governments often work with them to enforce various things, eg intellectual property for monsanto, or engineering crops with high fructose corn syrup, or overusing antibiotics on factory farms that can lead to superbugs. Or overfishing. Or starting wars.

None of these things are really “aligned” with humans. Humans are in fact told that as an individual they can make a meaningful differency by going vegan, recycling, exercising, and so on. But if the corporations are the ones producing metric tons of plastic, conserving plastic straws and bags isn’t the solution. The problem is upstream and the individuals being shamed to do this or that is just distracting from getting together solving systemic problems.

My point is “the machine” is already not necessarily selecting for alignment with humans on a macro scale. So the fact that the model parameters are selected by “the machine” doesn’t mean it will end up somehow becoming good for humans. Yes it will if humans are the customer. But if you are not the customer, you’re the product (eg a cow in a factory farm).

Anyway … this is a bit like John Searle’s Chinese room argument.

There are probably enough comments here already, but thanks again for the post, and thanks to the mods for curating it (I would've missed it otherwise).

New to LessWrong?