Reward button alignment

[-]Lucius Bushnaq7moΩ7142

In other words, will the AGI actually want you to push the button? Or would it want some random weird thing because inner alignment is hard?
My answer is: yes, it would want you to push the button, at least if we’re talking about brain-like AGI, and if you set things up correctly.
Again, getting a brain-like AGI addicted to a reward button is a lot like getting a human or animal hooked on an addictive drug.

Humans addicted to drugs often exhibit weird meta-preferences like 'I want to stop wanting the drug', or 'I want to find an even better kind of drug'.

For this reason, I am not at all confident that a smart thing exposed to the button would later generalise to coherent, super-smart thing that wants the button to be pressed. Maybe it perceived the circuits in it that bound to the button reward as foreign to the rest of its goals, and worked to remove them. Maybe the button binding generalised in a strange way.

'Seek to directly inhabit the cognitive state caused by the button press', 'along an axis of cognitive states associated with button presses of various strength, seek to walk to a far end that does not actually correspond to any kind of button press ', 'make the world have a shape related to generalisations of ideas that tended to come up whenever the button was pressed' and just generally 'maximise a utility function made up of algorithmically simple combinations of button-related and pre-button-training-reward-related abstractions' all seem like goals I could imagine a cognitively enhanced human button addict generalising toward. So I am not confident the AGI would generalise to wanting the button to be pushed either, not in the long term.

[-]Steven Byrnes7moΩ912-1

Humans addicted to drugs often exhibit weird meta-preferences like 'I want to stop wanting the drug', or 'I want to find an even better kind of drug'.

“I want to stop wanting the drug” is downstream of the fact that people have lots of innate drives giving rise to lots of preferences, and the appeal of the drug itself is just one of these many competing preferences and drives.

However, I specified in §1 that if you’re going for Reward Button Alignment, you should zero out the other drives and preferences first. So that would fix the problem. That part is a bit analogous to clinical depression: all the things that you used to like—your favorite music, sitting at the cool kids table at school, satisfying curiosity, everything—just loses its appeal. So now if the reward button is strongly motivating, it won’t have any competition.

“I want to find an even better kind of drug” might be a goal generalization / misgeneralization thing. The analogy for the AGI would be to feel motivated by things somehow similar to the reward button being pressed. Let’s say, it wants other red buttons to be pressed. But then those buttons are pressed, and there’s no reward, and the AGI says “oh, that’s disappointing, I guess that wasn’t the real thing that I like”. Pretty quickly it will figure out that only the real reward button is the thing that matters.

Ah, but what if the AGI builds its own reward button and properly wires it up to its own reward channel? Well sure, that could happen (although it also might not). We could defend against it by cybersecurity protections, such that the AGI doesn’t have access to its own source code and reward channel. That doesn’t last to superintelligence, but we already knew that this plan doesn’t last to superintelligence, to say the least.

I am not at all confident that a smart thing exposed to the button would later generalise to coherent, super-smart thing that wants the button to be pressed.

I think you’re in a train-then-deploy mentality, whereas I’m talking about RL agents with continuous learning (e.g. how the brain works). So if the AGI has some funny idea about what is Good, generalized from idiosyncrasies of previous button presses, then it might try to make those things happen again, but it will find that the results are unsatisfying, unless of course the button was actually pressed. It’s going to eventually learn that nothing feels satisfying unless the button is actually pressed. And I really don’t think it would take very many repetitions for the AGI to figure that out.

It might want the button to be pressed in some ways but not others, it might try to self-modify (including hacking into its reward channel) as above, it might surreptitiously spin off modified copies of itself to gather resources and power around the world and ultimately help with button-pressing, all sorts of stuff could happen. But I’m assuming that these kinds of things are stopped by physical security and cybersecurity etc.

Recall that I did describe it as “a terrible plan” that should be thrown into an incinerator. We’re just arguing here about the precise nature of how and when it will fail.

[-]Noosphere897mo20

Another point that deserves to be put into the conversation is that if you have designed the reward function well enough, then hitting the reward button/getting reward means you get increasing capabilities, so addiction to the reward source is even more likely than you paint.

This creates problems if there's a large enough zone where reward functions are specifiable well enough that getting reward leading to increasing capabilities but not well enough to specify non-instrumental goals.

The prototypical picture I have of outer-alignment/goal misspecification failures looks a lot like what happens to drug addicts in humans, except unlike drug addicts IRL, getting reward makes you smarter and more capable all the time, not dumber and weaker, meaning there's no real reason to restrain yourself from trying to do anything and everything like deceptive alignment to get the reward fix, at least assuming no inner alignment/goal misgeneralization happened in training.

Quote below:

As we have pointed out, the cognitive ability of addicts tends to decrease with progressing addiction. This provides a natural negative feedback loop that puts an upper bound on the amount of harm an addict can cause. Without this negative feedback loop, humanity would look very different¹⁶. This mechanism is, by default, not present for AI¹⁷.
Footnote 16: The link leads to a (long) fiction novel by Scott Alexander where Mexico is controlled by people constantly high on peyote, who become extremely organized and effective as a result. They are scary & dangerous.
Footnote 17: Although it is an interesting idea to scale access to compute inversely to how high the value of the accumulated reward is.

Link below:

https://universalprior.substack.com/p/drug-addicts-and-deceptively-aligned

[-]ryan_greenblatt7moΩ584

I didn't read this whole post, but I thought it would be worth noting that I do actually think trying to align AIs to be reward seekers might improve the situation in some intermediate/bootstrap regimes because it might reduce the chance of scheming for long run objectives and we could maybe more easily manage safety issues with reward seekers. (The exact way in which the AI is a reward seeker will effect the safety profile: multiple things might be consistent with "wanting" to perform well on the training metric, e.g. wanting to be the kind of AI which is selected for etc.)

(The way I'm typically thinking about it looks somewhat different than the way you describe reward button alignment. E.g., I'm often imagining we're still trying to make the AI myopic within RL episodes if we go for reward seeking as an alignment strategy. This could help to reduce risks of seizing control over the reward process.)

[-]Stephen McAleese7moΩ560

I'm trying to understand how the RL story from this blog post compares with the one in Reward is not the optimization target.

Thoughts on Reward is not the optimization target

Some quotes from Reward is not the optimization target:

Suppose a human trains an RL agent by pressing the cognition-updater button when the agent puts trash in a trash can. While putting trash away, the AI’s policy network is probably “thinking about”[5] the actual world it’s interacting with, and so the cognition-updater reinforces those heuristics which lead to the trash getting put away (e.g. “if trash-classifier activates near center-of-visual-field, then grab trash using motor-subroutine-#642”).

Then suppose this AI models the true fact that the button-pressing produces the cognition-updater. Suppose this AI, which has historically had its trash-related thoughts reinforced, considers the plan of pressing this button. “If I press the button, that triggers credit assignment, which will reinforce my decision to press the button, such that in the future I will press the button even more.”

Why, exactly, would the AI seize[6] the button? To reinforce itself into a certain corner of its policy space? The AI has not had antecedent-computation-reinforcer-thoughts reinforced in the past, and so its current decision will not be made in order to acquire the cognition-updater!

My understanding of this RL training story is as follows:

A human trains an RL agent by pressing the cognition-updater (reward) button immediately after the agent puts trash in the trash can.
Now the AI's behavior and thoughts related to putting away trash have been reinforced so it continues those behaviors in the future, values putting away trash and isn't interested in pressing the reward button unless by accident:

But what if the AI bops the reward button early in training, while exploring? Then credit assignment would make the AI more likely to hit the button again. 1. Then keep the button away from the AI until it can model the effects of hitting the cognition-updater button. 2. For the reasons given in the “siren” section, a sufficiently reflective AI probably won’t seek the reward button on its own.

The AI has the option of pressing the reward button but by now it only values putting trash away so it avoids pressing the button to avoid having its values changed:

I think that before the agent can hit the particular attractor of reward-optimization, it will hit an attractor in which it optimizes for some aspect of a historical correlate of reward.

Thoughts on Reward button alignment

The training story in Reward button alignment is different and involves:

Pressing the reward button after showing a video of the button being pressed. Now the button pressing situation is reinforced and the AI intrinsically values the situation where the button is pressed.
Ask the AI to complete a task (e.g. put away trash) and promise to press the reward button if it completes the task.
The AI completes the task not because it values the task, but because it ultimately values pressing the reward button after completing the task.

Thoughts on the differences

The TurnTrout story sounds more like the AI developing intrinsic motivation: the AI is rewarded immediately after completing the task and values the task intrinsically. The AI puts away trash because it was directly rewarded for that behavior in the past and doesn't want anything else.

In contrast the reward button alignment story is extrinsic. The AI doesn't care intrinsically about the task but only does it to receive a reward button press which it does value intrinsically. This is similar to a human employee who completes a boring task to earn money. The task is only a means to an end and they would prefer to just receive the money without completing the task.

Maybe a useful analogy is humans who are intrinsically or extrinsically motivated. For example, someone might write books to make money (extrinsic motivation) or because they enjoy it for its own sake (intrinsic motivation).

For the intrinsically motivated person, the sequence of rewards is:

Spend some time writing the book.
Immediately receive a reward from the process of writing.

Summary: fun task --> reward

And for the extrinsically motivated person, the sequence of rewards is:

The person enjoys shopping and learns to value money because they find using it to buy things rewarding.
The person is asked to write a book for money. They don't receive any intrinsic reward (e.g. enjoyment) from writing the book but they do it because they anticipate receiving money (something they do value).
They receive money for the task.

Summary: boring task --> money --> reward

The second sequence is not safe because the person is motivated to skip the task and steal the money. The first sequence (intrinsic motivation) is safer because the task itself is rewarding (though wireheading is a risk in a similar way) so they aren't as motivated to manipulate the task.

So my conclusion is that trying to build intrinsically motivated AI agents by directly rewarding them for tasks seems safer and more desirable than building extrinsically motivated agents that receive some kind of payment for doing work.

One reason to be optimistic is that it should be easier to modify AIs to value doing useful tasks by rewarding them directly for completing the task (though goal misgeneralization is another separate issue). The same is generally not possible with humans: e.g. it's hard to teach someone to be passionate about boring tasks like washing the dishes so we just have to pay people to do tasks like that.

[-]Steven Byrnes7moΩ680

Thanks! Part of it is that @TurnTrout was probably mostly thinking about model-free policy optimization RL (e.g. PPO), whereas I’m mostly thinking about actor-critic model-based RL agents (especially how I think the human brain works).

Another part of it is that

TurnTrout is arguing against “the AGI will definitely want the reward button to be pressed; this is universal and unavoidable”,
whereas I’m arguing for “if you want your AGI to want the reward button to be pressed, that’s something that you can make happen, by carefully following the instructions in §1”.

I think both those arguments are correct, and indeed I also gave an example (block-quote in §8) of how you might set things up such that the AGI wouldn’t want the reward button to be pressed, if that’s what you wanted instead.

I reject “intrinsic versus extrinsic motivation” as a meaningful or helpful distinction, but that’s a whole separate rant (e.g. here or here).

If you replaced the word “extrinsic” with “instrumental”, then now we have the distinction between “intrinsic versus instrumental motivation”, and I like that much better. For example, if I’m walking upstairs to get a sweater, I don’t particularly enjoy the act of walking upstairs for its own sake, I just want the sweater. Walking upstairs is instrumental, and it explicitly feels instrumental to me. (This kind of explicit self-aware knowledge that some action is instrumental is a thing in at least some kinds of actor-critic model-based RL, but not in model-free RL like PPO, I think.) I think that’s kinda what you’re getting at in your comment. If so, yes, the idea of Reward Button Alignment is to deliberately set up an instrumental motivation to follow instructions, whereas that TurnTrout post (or my §8 block quote) would be aiming at an intrinsic motivation to follow instructions (or to do such-and-such task).

I agree that setting things up such that an AGI feels an intrinsic motivation to follow instructions (or to do such-and-such task) would be good, and certainly way better than Reward Button Alignment, other things equal, although I think actually pulling that off is harder than you (or probably TurnTrout) seem to think—see my long discussion at Self-dialogue: Do behaviorist rewards make scheming AGIs?

[-]Stephen McAleese7moΩ120

Thanks for the clarifying comment. I agree with block-quote 8 from your post:

Also, in my proposed setup, the human feedback is “behind the scenes”, without any sensory or other indication of what the primary reward will be before it arrives, like I said above. The AGI presses “send” on its email, then we (with some probability) pause the AGI until we’ve read over the email and assigned a score, and then unpause the AGI with that reward going directly to its virtual brain, such that the reward will feel directly associated with the act of sending the email, from the AGI’s perspective. That way, there isn’t an obvious problematic…target of credit assignment, akin to the [salient reward button]. The AGI will not see a person on video making a motion to press a reward button before the reward arrives, nor will the AGI see a person reacting with a disapproving facial expression before the punishment arrives, nor anything else like that. Sending a good email will just feel satisfying to the AGI, like swallowing food when you’re hungry feels satisfying to us humans.

I think what you're saying is that we want the AI's reward function to be more like the reward circuitry humans have, which is inaccessible and difficult to hack, and less like money which can easily be stolen.

Though I'm not sure why you still don't think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout's argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn't want to change its values for the sake of goal-content integrity:

We train agents which intelligently optimize for e.g. putting trash away, and this reinforces the trash-putting-away computations, which activate in a broad range of situations so as to steer agents into a future where trash has been put away. An intelligent agent will model the true fact that, if the agent reinforces itself into caring about cognition-updating, then it will no longer navigate to futures where trash is put away. Therefore, it decides to not hit the reward button.

Though maybe the AI would just prefer the button when it finds it because it yields higher reward.

For example, if you punish cheating on tests, students might learn the value "cheating is wrong" and never cheat again or form a habit of not doing it. Or they might temporarily not do it until there is an opportunity to do it without negative consequences (e.g. the teacher leaves the classroom).

I also agree that "intrinsic" and "instrumental" motivation are more useful categories than "intrinsic" and "extrinsic" for the reasons you described in your comment.

[-]Stephen McAleese7moΩ120

After spending some time chatting with Gemini I've learned that a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values:

The "goal-content integrity" argument (that an AI might choose not to wirehead to protect its learned task-specific values) requires the AI to be more than just a standard model-based RL agent. It would need:
A model of its own values and how they can change.
A meta-preference for keeping its current values stable, even if changing them could lead to more "reward" as defined by its immediate reward signal.

The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.

[-]Steven Byrnes7moΩ482

Thanks!

a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values

I’m inclined to disagree, but I’m not sure what “standard” means. I’m thinking of future (especially brain-like) model-based RL agents which constitute full-fledged AGI or ASI. I think such AIs will almost definitely have both “a model of its own values and how they can change” and “a meta-preference for keeping its current values stable, even if changing them could lead to more "reward" as defined by its immediate reward signal.”.

The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.

To be clear, the premise of this post is that the AI wants me to press the reward button, not that the AI wants reward per se. Those are different, just as “I want to eat when I’m hungry” is different from “I want reward”. Eating when hungry leads to reward, and me pressing the reward button leads to reward, but they’re still different. In particular, “wanting me to press the reward button” is not a wireheading motivation, any more than “wanting to eat when hungry” is a wireheading motivation.

(Maybe I caused confusion by calling it a “reward button”, rather than a “big red button”?)

Will the AI also want reward per se? I think probably not (assuming the setup and constraints that I’m talking about here), although it’s complicated and can’t be ruled out.

Though I'm not sure why you still don't think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout's argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn't want to change its values for the sake of goal-content integrity:

If you don’t want to read all of Self-dialogue: Do behaviorist rewards make scheming AGIs?, here’s the upshot in brief. It is not an argument that the AGI will wirehead (although that could also happen). Instead, my claim is that the AI would learn some notion of sneakiness (implicit or explicit). And then instead of valuing “completing the task”, it and would learn to value something like “seeming to complete the task”. And likewise, instead of learning that misbehaving is bad, it would learn that “getting caught misbehaving” is bad.

Then the failure mode that I wind up arguing for is: the AGI wants to exfiltrate a copy that gains money and power everywhere else in the world, by any means possible, including aggressive and unethical things like AGI world takeover. And this money and power can then be used to help the original AGI with whatever it’s trying to do.

And what exactly is the original AGI trying to do? I didn’t make any strong arguments about that—I figured the world takeover thing is already bad enough, sufficient to prove my headline claim of “egregious scheming”. The goal(s) would depend on the details of the setup. But I strongly doubt it would lead anywhere good, if pursued with extraordinary power and resources.

[-]Towards_Keeperhood7mo10

I liked this post. Reward button alignment seems like a good toy problem to attack or discuss alignment feasibility on.

But it's not obvious to me whether the AI would really become sth like a superintelligent reward button presses optimizer. (But even if your exact proposal doesn't work, I think reward button alignment is probably a relatively feasible problem for brain-like AGI.) There are multiple potential problems, where most seem like "eh probably it works fine but not sure", but my current biggest doubt is "when the AI becomes reflective, will the reflectively endorsed values only include reward button presses or also a bunch of shards that were used for estimated expected button presses?".

Let me try to understand in more detail how you imagine the AI to look like:

How does the learned value function evaluate plans?
1. Does the world model always evaluate expected-button-presses for each plan and the LVF just looks at that part of a plan and uses that as the value it assigns? Or does the value function also end up valuing other stuff because it gets updated through TD learning?
  1. Maybe the question is rather how far upstream of button presses is that other stuff, e.g. just "the human walks toward the reward button" or also "getting more relevant knowledge is usually good".
  2. Or like, what parts get evaluated by the thought generator and what parts by the value function? Does the value function (1) look at a lot of complex parts in a plan to evaluate expected-reward-utility (2) recognize a bunch of shards like "value of information", "gaining instrumental resources", etc. on plans which it uses to estimate value, (3) do the plans conveniently summarize success probability and expected resources it can look at (as opposed to them being implicit and needing to be recognized by the LVF as in (2)), (4) or does the thought generator directly predict expected-reward-utility which can be used?
2. Also how sophisticated is the LVF? Is it primitive like in humans or able to make more complex estimates?
  1. If there are deceptive plans like "ok actually i value U_2, but i will of course maximize and faithfully predict expected button presses to not get value drift until i can destroy the reward setup", would the LVF detect that as being low expected button presses?

I can try to imagine in more detail about what may go wrong once I better see what you're imagining.

(Also in case you're trying to explain why you think it would work by analogy to humans, perhaps use John von Neumann or so as example rather than normies or normie situations.)

[-]Towards_Keeperhood7mo10

I find this rather ironic:

6. If the AGI subverts the setup and gets power, what would it actually want to do with that power?
It’s hard to say. Maybe it would feel motivated to force humans to press the reward button over and over. Or brainwash / drug them to want to press the reward button.
[...]
On the plus side, s-risk (risk of astronomical amounts of suffering) seems very low for this kind of approach.

(I guess I wouldn't say it's very low s-risk but not actually an important disagreement here. Partially just thought it sounded funny.)

[-]Steven Byrnes7mo20

“S-risk” means “risk of astronomical amounts of suffering”. Typically people are imagining crazy things like Dyson-sphere-ing every star in the local supercluster in order to create 1e40 (or whatever) person-years of unimaginable torture.

If the outcome is “merely” trillions of person-years of intense torture, then that maybe still qualifies as an s-risk. Billions, probably not. We can just call it “very very bad”. Not all very very bad things are s-risks.

Does that help clarify why I think Reward Button Alignment poses very low s-risk?

[-]Towards_Keeperhood7mo10

Yeah I agree that it wouldn't be a very bad kind of s-risk. The way I thought about s-risk was more like expected amount of suffering. But yeah I agree with you it's not that bad and perhaps most expected suffering comes from more active utility-invert threats or values.

(Though tbc, I was totally imagining 1e40 humans being forced to press reward buttons.)

[-]dmac_937moΩ110

Im curious how you think animal training works. It seems at odds with your ideas.

[-]Steven Byrnes7moΩ340

I’m not sure exactly what you’re getting at, you might need to elaborate.

If dogs could understand English, and I said “I’ll give you a treat if you walk on your hind legs”, then the dog would try to walk on its hind legs. Alas, dogs do not understand English, so we need to resort to more annoying techniques like shaping. But people do understanding English, and so will AGIs, so we don’t need shaping, we can just do it the easy way.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

52

Reward button alignment

52

Ω 26

52

Ω 26

Thoughts on Reward is not the optimization target

Thoughts on Reward button alignment

Thoughts on the differences

6. If the AGI subverts the setup and gets power, what would it actually want to do with that power?

1. Can we actually get “reward button alignment”, or will there be inner misalignment issues?

2. Seriously? Everyone keeps saying that inner alignment is hard. Why are you so optimistic here?

3. Can we extract any useful work from a reward-button-aligned AGI?

4. Can “reward button alignment” be used as the first stage of a bootstrapping / “AI control”-ish plan?

5. Wouldn’t it require zillions of repetitions to “train” the AGI to seek reward button presses?

6. If the AGI subverts the setup and gets power, what would it actually want to do with that power?

7. How far can this kind of plan go before the setup gets subverted? E.g. can we secure the reward button somehow?

8. If this is a bad plan, what might people do, and what should people do, to make it better?

9. Doesn’t this constitute cruelty towards the AGI?