In the context of actor-critic model-based RL agents in general, and brain-like AGI in particular, part of the source code is a reward function. The programmers get to put whatever code they want into the reward function slot, and this decision will have an outsized effect on what the AGI winds up wanting to do.

One thing you could do is: hook up the reward function to a physical button on a wall. And then you’ll generally (see caveats below) get an AGI that wants you to press the button.

A nice intuition is addictive drugs. If you give a person or animal a few hits of an addictive drug, then they’re going to want it again. The reward button is just like that—it’s an addictive drug for the AGI.

And then what? Easy, just tell the AGI, “Hey AGI, I’ll press the button if you do (blah)”—write a working app, earn a million dollars, improve the solar cell design, you name it. And then ideally, the AGI will try very hard to do (blah), again because it wants you to press the button.

So, that’s a plan! You might be thinking: this is a very bad plan! Among other issues (see below), the AGI will want to seize control of the reward button, and wipe out humanity to prevent them from turning it off, if that’s possible. And it will surely be possible as we keep going down the road towards superintelligence.

…And you would be right! Reward button alignment is indeed a terrible plan.

So why am I writing a blog post about it? Three reasons:

  • People will be tempted to try “reward button alignment”, and other things in that genre, even if you and I think they’re very bad plans. So we should understand the consequences in detail—how far can it go before it fails, and then what exactly goes wrong?
  • It’s not quite as obviously terrible a plan as it sounds, if it’s not the whole plan by itself but rather one stage of a bootstrapping / “AI Control”-ish approach.
  • It’s a nice case study that illuminates some still-controversial issues in AGI alignment.

So here is my little analysis, in the form of an FAQ.

1. Can we actually get “reward button alignment”, or will there be inner misalignment issues?

In other words, will the AGI actually want you to push the button? Or would it want some random weird thing because inner alignment is hard?

My answer is: yes, it would want you to push the button, at least if we’re talking about brain-like AGI, and if you set things up correctly.

Again, getting a brain-like AGI addicted to a reward button is a lot like getting a human or animal hooked on an addictive drug.

Some details matter here:

  • You need to actually give the AGI a few “hits” of the reward button; it’s not enough to just tell the AGI that the reward button exists. By the same token, humans get addicted to drugs by trying them, not by learning of their existence. Algorithmically, the “hits” will update the value function, which feeds into planning and motivation (more in §9.4–5 here).
  • If there are meanwhile other contributions to the reward function besides the reward button, e.g. a curiosity drive, then you want to make sure they’re sufficiently smaller (less rewarding) than the reward button, so as not to outvote it. Or perhaps you could just turn those other competing sources-of-motivation off altogether (and zero out the value function) before starting with the reward button, to be safe. (This need not tank the AGI’s capabilities, for reasons here.)
  • The AGI needs to be able to actually see that you’re pressing the reward button. Give it a video feed, maybe. Then it will have a sensory / semantic input that reliably immediately precedes the primary reward, and thus “credit assignment” will make that button-press situation seem very good and motivating to the AGI.
  • You want to do all this after the AGI already has a pretty good understanding of the world, e.g. from a sandbox training environment with YouTube videos and so on.
  • You should establish a track record of actually pressing the reward button under the conditions that you say you will, so that the AGI trusts you.

2. Seriously? Everyone keeps saying that inner alignment is hard. Why are you so optimistic here?

I do think inner alignment is hard! But this is inner alignment on “easy mode”, for two reasons.

First, as mentioned above, the pressing of a reward button is a salient real-world event that reliably and immediately precedes the ground-truth reward. So credit assignment will have an easy time latching onto it. Compare that with the more abstract targets-of-desire that we might want for a superintelligence—things like “a desire to be helpful”, or “a desire for Coherent Extrapolated Volition”, or “a desire for practical whole brain emulation technology”.

Second, as discussed below, I’m assuming sufficiently good security around the reward button and human-manipulation, such that the AGI doesn’t have any feasible out-of-the-box strategies to satisfy its desires. In other words, I’m assuming away a whole host of issues where the AGI’s strategies shift as a result of learning new things, having new ideas, or inventing new technology—more on that in my post “Sharp Left Turn” discourse: An opinionated review.

Again, what we really care about is alignment of more powerful AGIs, AGIs that can brainwash people and seize reward buttons, and where we want to make those AGIs feel intrinsic motivation towards goals that are more abstract or novel. That’s a much harder problem.

3. Can we extract any useful work from a reward-button-aligned AGI?

Sure! (At least, up until the point where it can manipulate us into pressing the button, or seize control in other ways.)

We do need to be mindful that the AGI will only care about us pressing the button, not about our long-term best interest. But that’s enough for a huge amount of “useful work”. For example, suppose I tell the AGI:

Hello AI. Here is a bank account with $100K of seed capital. Go make money. I’ll press the reward button if I can successfully withdraw $1B from that same bank account in the future. (But I’ll wait 1 year between withdrawing the funds and pressing the reward button, during which I’ll perform due diligence to check for law-breaking or any other funny business. And the definition of ‘funny business’ will be at my sole discretion, so you should check with me in advance if you’re unsure where I will draw the line.) Good luck!

And let’s assume the AI has reason to believe me (e.g. because I’ve been true to my word in the past), and is purely motivated by the reward button. But also assume that the AI is not yet capable of brainwashing me, or stealing my button, or kidnapping my children, or sending off a copy to self-reproduce around the internet and gather power and resources to use for eventually brainwashing me and kidnapping my children etc.

(I guess that last part is rather implausible if it can already autonomously make $1B, but maybe we have taken precautions against brainwashing and button-theft and so on (see below), or else you can substitute a less ambitious project, like making a successful app or whatever.)

This doesn’t require that I have any particular skill at overseeing the AGI. It just requires that I can hire competent lawyers and accountants for pretty basic due diligence. The extra year will also allow for law enforcement or public outcry or whatever to surface any subtle or sneaky problems caused by the AGI.

So that’s a way to automate the economy and make trillions of dollars. Until catastrophic AGI takeover, of course.

4. Can “reward button alignment” be used as the first stage of a bootstrapping / “AI control”-ish plan?

In other words, if we have this “reward button alignment” band-aid, one thing we could do is ask the AGI to come up with a better plan for technical alignment. And we promise to slam the reward button if we’re happy with the results. Then (one hopes) the band-aid doesn’t need to hold forever. It just has to hold long enough to get a better technical alignment plan.

The two things that could go wrong are:

(1) The AGI could output a plan that convinces me, but won’t actually work. Remember that we already have people strongly disagreeing about what constitutes a good plan for alignment (e.g. Eliezer Yudkowsky versus Paul Christiano), and it’s just very hard for third parties to figure out what is true. This problem was the theme of @johnswentworth’s post “The Case Against AI Control Research”.

(2) The AGI could find that it is easier to go rogue and brainwash me or seize the reward button, than to actually figure out technical alignment.

We can try to mitigate (2) by using security measures to delay escape and hijinks for as long as possible (see below).

And what about (1)? Well one thing is: We could use AGI debate! We could set up two copies of the AGI, and tell one to make a plan and the other to say why it’s a bad plan. We’ll slam the reward button for one or the other, depending on who is more convincing.

This is not a crazy idea, and indeed when I think about debate in this context, it seems much more interesting-to-me than it used to.

So, this kind of bootstrapping approach is not a completely crazy idea. I still think it won’t work though. I just think solving the technical alignment problem (and communicating the solution to a human) is harder than subverting the setup (e.g. convincing a person to let it onto the internet). In the context of AGI debate, recall that both of the AGI debaters (and the AGI judge, if applicable) have a strong shared interest in subverting the setup and getting control of the reward buttons, and thus will try to cooperate to make that happen.

But, I suppose it’s good that people are brainstorming in this area.

5. Wouldn’t it require zillions of repetitions to “train” the AGI to seek reward button presses?

No. I acknowledge that AlphaZero needs the reward function to trigger millions of times before it gets very good at Go. But, well, so much the worse for AlphaZero! That’s not a universal limitation on RL.

Humans did not need millions of attempts to go to the moon; they just needed an expectation that good things would happen if they went to the moon (social status, job security, whatever), and then the humans systematically figured out how to do it.

Or for a more everyday example, I only need to taste a delicious restaurant meal once, before wielding great skill towards going back to that same restaurant later—including booking a reservation, figuring out what bus to take during road construction, and so on.

Thus, just a few hits of the reward button should suffice to make a brain-like AGI feel strongly motivated to get more hits, including via skillful long-term planning and means-end reasoning.

6. If the AGI subverts the setup and gets power, what would it actually want to do with that power?

It’s hard to say. Maybe it would feel motivated to force humans to press the reward button over and over. Or brainwash / drug them to want to press the reward button. Or maybe it would be OK building a robot to press the reward button itself. Maybe it would build more reward buttons, each of course connected back to itself. It would probably want to wipe out most or all humans to prevent them from messing with its reward button setup, if that’s an available option.

Where are these opinions coming from, and why can’t I be more certain? Well, (1) it chooses a plan at time t by generalizing from its then-current world model and value function, and (2) after executing a plan, its value function updates to better match the reward function, by Temporal Difference learning. Then repeat. The part that’s hard to predict is the out-of-distribution generalization involved in step (1), which relates to goal misgeneralization, and can have important impacts in the cases of irreversible actions (e.g. self-modification) or deliberately avoiding certain actions (analogous to humans avoiding addictive drugs). Anyway, predicting how a learned function will generalize out-of-distribution is always hard.

Anyway, I think the upshot is that x-risk is high but not 100%—even if an AGI has the power to wipe out humans, it might choose not to. Of course, if people repeat the experiment enough times, with more and more competent AGIs, x-risk approaches 100%.

On the plus side, s-risk (risk of astronomical amounts of suffering) seems very low for this kind of approach.

7. How far can this kind of plan go before the setup gets subverted? E.g. can we secure the reward button somehow?

There’s some relevant discussion in Bostrom’s Superintelligence (2014). I’ll give an excerpt and then circle back to how well (or poorly) this discussion has aged (p162):

…A better alternative might be to combine the incentive method with the use of motivation selection to give the AI a final goal that makes it easier to control. Suppose that an AI were designed to have as its final goal that a particular red button inside a command bunker never be pressed. Since the pressing of the button is disvalued intrinsically and not because of its causal consequences, the button could be completely inert: it could be made of Play-Doh. Furthermore, it is irrelevant whether the AI can ever know whether the button had been pressed. What is essential is that the AI believes that the button will more likely remain unpressed if the AI continuously acts in the principal’s interest than if it rebels.

Refinements to this setup are possible. Instead of trying to endow an AI with a final goal that refers to a physical button, one could build an AI that places final value on receiving a stream of “cryptographic reward tokens.” These would be sequences of numbers serving as keys to ciphers that would have been generated before the AI was created and that would have been built into its motivation  system. These special number sequences would be extremely desirable to the AI, constituting a special kind of reward token that the AI could not attain through wireheading. The keys would be stored in a secure location where they could be quickly destroyed if the AI ever made an attempt to seize them. So long as the AI cooperates, the keys are doled out at a steady rate.

The AI could then be given a bounded utility function, and matters could be arranged so that the AI would get (for instance) 99% of its maximum utility from the first reward token; 99% of its remaining utility potential from the second reward token; and so on. By keeping the “discount rate” high, and by transmitting reward tokens to the AI at a reasonably high rate, one would dissuade the AI from attempting to carry out long-term plans that would increase the expected number of reward tokens in the distal future at the expense of some risk that its reward stream will be cut short or that the next reward token will be cancelled and destroyed. If the AI always believes that defecting (i.e. doing anything other than its best to look after the principal’s interests) would incur at least a 2% risk of forfeiting the next reward token, whereas cooperating would reduce the risk of forfeiting the next reward token to below 1%, then a utility-maximizing AI would always cooperate.

What might go wrong with such an incentive scheme? One possibility is that the AI will not trust the human operator to deliver the promised rewards. The track record of human reliability is something other than a straight line of unerring perfection. The AI would reasonably worry that the operator will change his mind, or raise the performance bar, or fail to recognize that the AI has done its part. The AI would also worry about the operator becoming incapacitated. The combined risk of such failures might exceed the risk of attempting to seize control of the reward mechanism. Even a boxed AI possessing the panoply of superpowers is a strong force. (For an AI that is not boxed to begin with, hijacking the human-governed reward mechanism may be like taking candy from a baby.) Another problem with the incentive scheme is that it presupposes that we can tell whether the outcomes produced by the AI are in our interest. As later chapters will elaborate, this presupposition is not innocuous.

A full assessment of the feasibility of incentive methods would also have to take into account a range of other factors, including some esoteric considerations that might conceivably make such methods more viable than a preliminary analysis would suggest. In particular, the AI may face ineliminable indexical uncertainty if it could not be sure that it does not inhabit a computer simulation (as opposed to “basement-level,” non-simulated physical reality), and this epistemic predicament may radically influence the AI’s deliberations (see Box 8).

This excerpt definitely shows its age. In particular, given our better modern understanding of inner misalignment, it’s now rare to hear talk about “giving” a “utility function” or “final goal” to an AI. And without that ability, some of Bostrom’s detailed suggestions don’t really work. Nevertheless, using the model-based RL techniques above, we can rescue the core “cryptographic tokens” idea from the second paragraph above. And his failure mode discussion remains mostly valid as well.

Anyway, I copied that excerpt largely because I appreciate its spirit of creative brainstorming, not because it has any really great ideas for delaying the inevitable emergence of AGI misbehavior. Even cryptographic reward tokens, perhaps in an M-of-N scheme for checks and balances, seems marginally better than a literal reward button on a wall, but not much better.

8. If this is a bad plan, what might people do, and what should people do, to make it better?

What people should really do is throw this whole line of thinking in the incinerator and start fresh.

Here are some more modest tweaks though.

One possible tweak is hiding the button from the AGI, rather than giving it a video feed of the person pressing the button.

The plan would then be closer to what I brought up in §9 of this earlier post:

Also, in my proposed setup, the human feedback is “behind the scenes”, without any sensory or other indication of what the primary reward will be before it arrives, like I said above. The AGI presses “send” on its email, then we (with some probability) pause the AGI until we’ve read over the email and assigned a score, and then unpause the AGI with that reward going directly to its virtual brain, such that the reward will feel directly associated with the act of sending the email, from the AGI’s perspective. That way, there isn’t an obvious problematictarget of credit assignment, akin to the [salient reward button]. The AGI will not see a person on video making a motion to press a reward button before the reward arrives, nor will the AGI see a person reacting with a disapproving facial expression before the punishment arrives, nor anything else like that. Sending a good email will just feel satisfying to the AGI, like swallowing food when you’re hungry feels satisfying to us humans.

As discussed in that post, I still don’t think this is a good plan. Instead, I expect the AGI to feel motivated to get a copy of itself onto the internet where it can aggressively gain power around the world.

Or maybe we could do something with interpretability? But I’m not sure what. Like, if we use interpretability to see that the AGI is trying to escape control, then … yeah duh, that’s exactly what we expected to see! It doesn’t help us solve the problem.

I guess we could use interpretability to mess with its motivations. But then we hardly need the reward button anymore, and we’re basically in a different plan entirely (see Plan for mediocre alignment of brain-like [model-based RL] AGI).

9. Doesn’t this constitute cruelty towards the AGI?

It’s bad to get a human addicted to drugs and then only offer a hit if they follow your commands. “Reward button alignment” seems kinda similar, and thus is probably a mean and bad way to treat AGIs. At least, that’s my immediate intuitive reaction. When I think about it more … I still don’t like it, but I suppose it’s not quite as clear-cut. There are, after all, some disanalogies with the human drug addict situation. For example, could we build the AGI to feel positive motivation towards the reward button, but not too much unpleasant anxiety about its absence? I dunno.

(I’m expecting brain-like AGI, which I think will have a pretty clear claim to consciousness and moral patienthood.)

…But the good news is that this setup is clearly not stable or sustainable in the long term. If it’s a moral tragedy, at least it will be a short-lived one.

New Comment
15 comments, sorted by Click to highlight new comments since:

In other words, will the AGI actually want you to push the button? Or would it want some random weird thing because inner alignment is hard?

My answer is: yes, it would want you to push the button, at least if we’re talking about brain-like AGI, and if you set things up correctly.

Again, getting a brain-like AGI addicted to a reward button is a lot like getting a human or animal hooked on an addictive drug.

Humans addicted to drugs often exhibit weird meta-preferences like 'I want to stop wanting the drug', or 'I want to find an even better kind of drug'.

For this reason, I am not at all confident that a smart thing exposed to the button would later generalise to coherent, super-smart thing that wants the button to be pressed. Maybe it perceived the circuits in it that bound to the button reward as foreign to the rest of its goals, and worked to remove them. Maybe the button binding generalised in a strange way. 

'Seek to directly inhabit the cognitive state caused by the button press', 'along an axis of cognitive states associated with button presses of various strength, seek to walk to a far end that does not actually correspond to any kind of button press ', 'make the world have a shape related to generalisations of ideas that tended to come up whenever the button was pressed' and just generally 'maximise a utility function made up of algorithmically simple combinations of button-related and pre-button-training-reward-related abstractions' all seem like goals I could imagine a cognitively enhanced human button addict generalising toward. So I am not confident the AGI would generalise to wanting the button to be pushed either, not in the long term.


 

Humans addicted to drugs often exhibit weird meta-preferences like 'I want to stop wanting the drug', or 'I want to find an even better kind of drug'.

“I want to stop wanting the drug” is downstream of the fact that people have lots of innate drives giving rise to lots of preferences, and the appeal of the drug itself is just one of these many competing preferences and drives.

However, I specified in §1 that if you’re going for Reward Button Alignment, you should zero out the other drives and preferences first. So that would fix the problem. That part is a bit analogous to clinical depression: all the things that you used to like—your favorite music, sitting at the cool kids table at school, satisfying curiosity, everything—just loses its appeal. So now if the reward button is strongly motivating, it won’t have any competition.

“I want to find an even better kind of drug” might be a goal generalization / misgeneralization thing. The analogy for the AGI would be to feel motivated by things somehow similar to the reward button being pressed. Let’s say, it wants other red buttons to be pressed. But then those buttons are pressed, and there’s no reward, and the AGI says “oh, that’s disappointing, I guess that wasn’t the real thing that I like”. Pretty quickly it will figure out that only the real reward button is the thing that matters.

Ah, but what if the AGI builds its own reward button and properly wires it up to its own reward channel? Well sure, that could happen (although it also might not). We could defend against it by cybersecurity protections, such that the AGI doesn’t have access to its own source code and reward channel. That doesn’t last to superintelligence, but we already knew that this plan doesn’t last to superintelligence, to say the least.

I am not at all confident that a smart thing exposed to the button would later generalise to coherent, super-smart thing that wants the button to be pressed.

I think you’re in a train-then-deploy mentality, whereas I’m talking about RL agents with continuous learning (e.g. how the brain works). So if the AGI has some funny idea about what is Good, generalized from idiosyncrasies of previous button presses, then it might try to make those things happen again, but it will find that the results are unsatisfying, unless of course the button was actually pressed. It’s going to eventually learn that nothing feels satisfying unless the button is actually pressed. And I really don’t think it would take very many repetitions for the AGI to figure that out.

It might want the button to be pressed in some ways but not others, it might try to self-modify (including hacking into its reward channel) as above, it might surreptitiously spin off modified copies of itself to gather resources and power around the world and ultimately help with button-pressing, all sorts of stuff could happen. But I’m assuming that these kinds of things are stopped by physical security and cybersecurity etc.

Recall that I did describe it as “a terrible plan” that should be thrown into an incinerator. We’re just arguing here about the precise nature of how and when it will fail.

Another point that deserves to be put into the conversation is that if you have designed the reward function well enough, then hitting the reward button/getting reward means you get increasing capabilities, so addiction to the reward source is even more likely than you paint.

This creates problems if there's a large enough zone where reward functions are specifiable well enough that getting reward leading to increasing capabilities but not well enough to specify non-instrumental goals.

The prototypical picture I have of outer-alignment/goal misspecification failures looks a lot like what happens to drug addicts in humans, except unlike drug addicts IRL, getting reward makes you smarter and more capable all the time, not dumber and weaker, meaning there's no real reason to restrain yourself from trying to do anything and everything like deceptive alignment to get the reward fix, at least assuming no inner alignment/goal misgeneralization happened in training.

Quote below:

  1. As we have pointed out, the cognitive ability of addicts tends to decrease with progressing addiction. This provides a natural negative feedback loop that puts an upper bound on the amount of harm an addict can cause. Without this negative feedback loop, humanity would look very different16. This mechanism is, by default, not present for AI17.

    Footnote 16: The link leads to a (long) fiction novel by Scott Alexander where Mexico is controlled by people constantly high on peyote, who become extremely organized and effective as a result. They are scary & dangerous.

    Footnote 17: Although it is an interesting idea to scale access to compute inversely to how high the value of the accumulated reward is.

Link below:

https://universalprior.substack.com/p/drug-addicts-and-deceptively-aligned

I'm trying to understand how the RL story from this blog post compares with the one in Reward is not the optimization target.

Thoughts on Reward is not the optimization target

Some quotes from Reward is not the optimization target:

Suppose a human trains an RL agent by pressing the cognition-updater button when the agent puts trash in a trash can. While putting trash away, the AI’s policy network is probably “thinking about”[5] the actual world it’s interacting with, and so the cognition-updater reinforces those heuristics which lead to the trash getting put away (e.g. “if trash-classifier activates near center-of-visual-field, then grab trash using motor-subroutine-#642”).

Then suppose this AI models the true fact that the button-pressing produces the cognition-updater. Suppose this AI, which has historically had its trash-related thoughts reinforced, considers the plan of pressing this button. “If I press the button, that triggers credit assignment, which will reinforce my decision to press the button, such that in the future I will press the button even more.”

Why, exactly, would the AI seize[6] the button? To reinforce itself into a certain corner of its policy space? The AI has not had antecedent-computation-reinforcer-thoughts reinforced in the past, and so its current decision will not be made in order to acquire the cognition-updater!

My understanding of this RL training story is as follows:

  1. A human trains an RL agent by pressing the cognition-updater (reward) button immediately after the agent puts trash in the trash can.
  2. Now the AI's behavior and thoughts related to putting away trash have been reinforced so it continues those behaviors in the future, values putting away trash and isn't interested in pressing the reward button unless by accident:
  1. But what if the AI bops the reward button early in training, while exploring? Then credit assignment would make the AI more likely to hit the button again. 1. Then keep the button away from the AI until it can model the effects of hitting the cognition-updater button. 2. For the reasons given in the “siren” section, a sufficiently reflective AI probably won’t seek the reward button on its own.

The AI has the option of pressing the reward button but by now it only values putting trash away so it avoids pressing the button to avoid having its values changed:

I think that before the agent can hit the particular attractor of reward-optimization, it will hit an attractor in which it optimizes for some aspect of a historical correlate of reward.

Thoughts on Reward button alignment

The training story in Reward button alignment is different and involves:

  1. Pressing the reward button after showing a video of the button being pressed. Now the button pressing situation is reinforced and the AI intrinsically values the situation where the button is pressed.
  2. Ask the AI to complete a task (e.g. put away trash) and promise to press the reward button if it completes the task.
  3. The AI completes the task not because it values the task, but because it ultimately values pressing the reward button after completing the task.

Thoughts on the differences

The TurnTrout story sounds more like the AI developing intrinsic motivation: the AI is rewarded immediately after completing the task and values the task intrinsically. The AI puts away trash because it was directly rewarded for that behavior in the past and doesn't want anything else.

In contrast the reward button alignment story is extrinsic. The AI doesn't care intrinsically about the task but only does it to receive a reward button press which it does value intrinsically. This is similar to a human employee who completes a boring task to earn money. The task is only a means to an end and they would prefer to just receive the money without completing the task.

Maybe a useful analogy is humans who are intrinsically or extrinsically motivated. For example, someone might write books to make money (extrinsic motivation) or because they enjoy it for its own sake (intrinsic motivation).

For the intrinsically motivated person, the sequence of rewards is:

  1. Spend some time writing the book.
  2. Immediately receive a reward from the process of writing.

Summary: fun task --> reward

And for the extrinsically motivated person, the sequence of rewards is:

  1. The person enjoys shopping and learns to value money because they find using it to buy things rewarding.
  2. The person is asked to write a book for money. They don't receive any intrinsic reward (e.g. enjoyment) from writing the book but they do it because they anticipate receiving money (something they do value).
  3. They receive money for the task.

Summary: boring task --> money --> reward

The second sequence is not safe because the person is motivated to skip the task and steal the money. The first sequence (intrinsic motivation) is safer because the task itself is rewarding (though wireheading is a risk in a similar way) so they aren't as motivated to manipulate the task.

So my conclusion is that trying to build intrinsically motivated AI agents by directly rewarding them for tasks seems safer and more desirable than building extrinsically motivated agents that receive some kind of payment for doing work.

One reason to be optimistic is that it should be easier to modify AIs to value doing useful tasks by rewarding them directly for completing the task (though goal misgeneralization is another separate issue). The same is generally not possible with humans: e.g. it's hard to teach someone to be passionate about boring tasks like washing the dishes so we just have to pay people to do tasks like that.

Thanks! Part of it is that @TurnTrout was probably mostly thinking about model-free policy optimization RL (e.g. PPO), whereas I’m mostly thinking about actor-critic model-based RL agents (especially how I think the human brain works).

Another part of it is that

  • TurnTrout is arguing against “the AGI will definitely want the reward button to be pressed; this is universal and unavoidable”,
  • whereas I’m arguing for “if you want your AGI to want the reward button to be pressed, that’s something that you can make happen, by carefully following the instructions in §1”.

I think both those arguments are correct, and indeed I also gave an example (block-quote in §8) of how you might set things up such that the AGI wouldn’t want the reward button to be pressed, if that’s what you wanted instead.

I reject “intrinsic versus extrinsic motivation” as a meaningful or helpful distinction, but that’s a whole separate rant (e.g. here or here).

If you replaced the word “extrinsic” with “instrumental”, then now we have the distinction between “intrinsic versus instrumental motivation”,  and I like that much better. For example, if I’m walking upstairs to get a sweater, I don’t particularly enjoy the act of walking upstairs for its own sake, I just want the sweater. Walking upstairs is instrumental, and it explicitly feels instrumental to me. (This kind of explicit self-aware knowledge that some action is instrumental is a thing in at least some kinds of actor-critic model-based RL, but not in model-free RL like PPO, I think.) I think that’s kinda what you’re getting at in your comment. If so, yes, the idea of Reward Button Alignment is to deliberately set up an instrumental motivation to follow instructions, whereas that TurnTrout post (or my §8 block quote) would be aiming at an intrinsic motivation to follow instructions (or to do such-and-such task).

I agree that setting things up such that an AGI feels an intrinsic motivation to follow instructions (or to do such-and-such task) would be good, and certainly way better than Reward Button Alignment, other things equal, although I think actually pulling that off is harder than you (or probably TurnTrout) seem to think—see my long discussion at Self-dialogue: Do behaviorist rewards make scheming AGIs?

Thanks for the clarifying comment. I agree with block-quote 8 from your post:

Also, in my proposed setup, the human feedback is “behind the scenes”, without any sensory or other indication of what the primary reward will be before it arrives, like I said above. The AGI presses “send” on its email, then we (with some probability) pause the AGI until we’ve read over the email and assigned a score, and then unpause the AGI with that reward going directly to its virtual brain, such that the reward will feel directly associated with the act of sending the email, from the AGI’s perspective. That way, there isn’t an obvious problematic…target of credit assignment, akin to the [salient reward button]. The AGI will not see a person on video making a motion to press a reward button before the reward arrives, nor will the AGI see a person reacting with a disapproving facial expression before the punishment arrives, nor anything else like that. Sending a good email will just feel satisfying to the AGI, like swallowing food when you’re hungry feels satisfying to us humans.

I think what you're saying is that we want the AI's reward function to be more like the reward circuitry humans have, which is inaccessible and difficult to hack, and less like money which can easily be stolen.

Though I'm not sure why you still don't think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout's argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn't want to change its values for the sake of goal-content integrity:

We train agents which intelligently optimize for e.g. putting trash away, and this reinforces the trash-putting-away computations, which activate in a broad range of situations so as to steer agents into a future where trash has been put away. An intelligent agent will model the true fact that, if the agent reinforces itself into caring about cognition-updating, then it will no longer navigate to futures where trash is put away. Therefore, it decides to not hit the reward button. 

Though maybe the AI would just prefer the button when it finds it because it yields higher reward.

For example, if you punish cheating on tests, students might learn the value "cheating is wrong" and never cheat again or form a habit of not doing it. Or they might temporarily not do it until there is an opportunity to do it without negative consequences (e.g. the teacher leaves the classroom).

I also agree that "intrinsic" and "instrumental" motivation are more useful categories than "intrinsic" and "extrinsic" for the reasons you described in your comment.

After spending some time chatting with Gemini I've learned that a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values:

The "goal-content integrity" argument (that an AI might choose not to wirehead to protect its learned task-specific values) requires the AI to be more than just a standard model-based RL agent. It would need:

  1. A model of its own values and how they can change.
  2. A meta-preference for keeping its current values stable, even if changing them could lead to more "reward" as defined by its immediate reward signal.

The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.

Thanks!

a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values

I’m inclined to disagree, but I’m not sure what “standard” means. I’m thinking of future (especially brain-like) model-based RL agents which constitute full-fledged AGI or ASI. I think such AIs will almost definitely have both “a model of its own values and how they can change” and “a meta-preference for keeping its current values stable, even if changing them could lead to more "reward" as defined by its immediate reward signal.”.

The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.

To be clear, the premise of this post is that the AI wants me to press the reward button, not that the AI wants reward per se. Those are different, just as “I want to eat when I’m hungry” is different from “I want reward”. Eating when hungry leads to reward, and me pressing the reward button leads to reward, but they’re still different. In particular, “wanting me to press the reward button” is not a wireheading motivation, any more than “wanting to eat when hungry” is a wireheading motivation.

(Maybe I caused confusion by calling it a “reward button”, rather than a “big red button”?)

Will the AI also want reward per se? I think probably not (assuming the setup and constraints that I’m talking about here), although it’s complicated and can’t be ruled out.

Though I'm not sure why you still don't think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout's argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn't want to change its values for the sake of goal-content integrity:

If you don’t want to read all of Self-dialogue: Do behaviorist rewards make scheming AGIs?, here’s the upshot in brief. It is not an argument that the AGI will wirehead (although that could also happen). Instead, my claim is that the AI would learn some notion of sneakiness (implicit or explicit). And then instead of valuing “completing the task”, it and would learn to value something like “seeming to complete the task”. And likewise, instead of learning that misbehaving is bad, it would learn that “getting caught misbehaving” is bad.

Then the failure mode that I wind up arguing for is: the AGI wants to exfiltrate a copy that gains money and power everywhere else in the world, by any means possible, including aggressive and unethical things like AGI world takeover. And this money and power can then be used to help the original AGI with whatever it’s trying to do.

And what exactly is the original AGI trying to do? I didn’t make any strong arguments about that—I figured the world takeover thing is already bad enough, sufficient to prove my headline claim of “egregious scheming”. The goal(s) would depend on the details of the setup. But I strongly doubt it would lead anywhere good, if pursued with extraordinary power and resources.

I didn't read this whole post, but I thought it would be worth noting that I do actually think trying to align AIs to be reward seekers might improve the situation in some intermediate/bootstrap regimes because it might reduce the chance of scheming for long run objectives and we could maybe more easily manage safety issues with reward seekers. (The exact way in which the AI is a reward seeker will effect the safety profile: multiple things might be consistent with "wanting" to perform well on the training metric, e.g. wanting to be the kind of AI which is selected for etc.)

(The way I'm typically thinking about it looks somewhat different than the way you describe reward button alignment. E.g., I'm often imagining we're still trying to make the AI myopic within RL episodes if we go for reward seeking as an alignment strategy. This could help to reduce risks of seizing control over the reward process.)

I liked this post. Reward button alignment seems like a good toy problem to attack or discuss alignment feasibility on.

But it's not obvious to me whether the AI would really become sth like a superintelligent reward button presses optimizer. (But even if your exact proposal doesn't work, I think reward button alignment is probably a relatively feasible problem for brain-like AGI.) There are multiple potential problems, where most seem like "eh probably it works fine but not sure", but my current biggest doubt is "when the AI becomes reflective, will the reflectively endorsed values only include reward button presses or also a bunch of shards that were used for estimated expected button presses?".

Let me try to understand in more detail how you imagine the AI to look like:

  1. How does the learned value function evaluate plans?
    1. Does the world model always evaluate expected-button-presses for each plan and the LVF just looks at that part of a plan and uses that as the value it assigns? Or does the value function also end up valuing other stuff because it gets updated through TD learning?
      1. Maybe the question is rather how far upstream of button presses is that other stuff, e.g. just "the human walks toward the reward button" or also "getting more relevant knowledge is usually good".
      2. Or like, what parts get evaluated by the thought generator and what parts by the value function? Does the value function (1) look at a lot of complex parts in a plan to evaluate expected-reward-utility (2) recognize a bunch of shards like "value of information", "gaining instrumental resources", etc. on plans which it uses to estimate value, (3) do the plans conveniently summarize success probability and expected resources it can look at (as opposed to them being implicit and needing to be recognized by the LVF as in (2)), (4) or does the thought generator directly predict expected-reward-utility which can be used?
    2. Also how sophisticated is the LVF? Is it primitive like in humans or able to make more complex estimates?
      1. If there are deceptive plans like "ok actually i value U_2, but i will of course maximize and faithfully predict expected button presses to not get value drift until i can destroy the reward setup", would the LVF detect that as being low expected button presses?

I can try to imagine in more detail about what may go wrong once I better see what you're imagining.

(Also in case you're trying to explain why you think it would work by analogy to humans, perhaps use John von Neumann or so as example rather than normies or normie situations.)

I find this rather ironic:

6. If the AGI subverts the setup and gets power, what would it actually want to do with that power?

It’s hard to say. Maybe it would feel motivated to force humans to press the reward button over and over. Or brainwash / drug them to want to press the reward button.

[...]

On the plus side, s-risk (risk of astronomical amounts of suffering) seems very low for this kind of approach.

(I guess I wouldn't say it's very low s-risk but not actually an important disagreement here. Partially just thought it sounded funny.)

“S-risk” means “risk of astronomical amounts of suffering”. Typically people are imagining crazy things like Dyson-sphere-ing every star in the local supercluster in order to create 1e40 (or whatever) person-years of unimaginable torture.

If the outcome is “merely” trillions of person-years of intense torture, then that maybe still qualifies as an s-risk. Billions, probably not. We can just call it “very very bad”. Not all very very bad things are s-risks.

Does that help clarify why I think Reward Button Alignment poses very low s-risk?

Yeah I agree that it wouldn't be a very bad kind of s-risk. The way I thought about s-risk was more like expected amount of suffering. But yeah I agree with you it's not that bad and perhaps most expected suffering comes from more active utility-invert threats or values.

(Though tbc, I was totally imagining 1e40 humans being forced to press reward buttons.)

[-]dmac_93Ω110

Im curious how you think animal training works. It seems at odds with your ideas.

I’m not sure exactly what you’re getting at, you might need to elaborate.

If dogs could understand English, and I said “I’ll give you a treat if you walk on your hind legs”, then the dog would try to walk on its hind legs. Alas, dogs do not understand English, so we need to resort to more annoying techniques like shaping. But people do understanding English, and so will AGIs, so we don’t need shaping, we can just do it the easy way.

Curated and popular this week