Three case studies
1. Incentive landscapes that can’t feasibly be induced by a reward function
You’re a deity, tasked with designing a bird brain. You want the bird to get good at singing, as judged by a black-box hardcoded song-assessing algorithm that you already built into the brain last week. The bird chooses actions based in part on within-lifetime reinforcement learning involving dopamine. What reward signal do you use?
Well, we want to train the bird to sing the song correctly. So it’s easy: the bird practices singing, and it listens to its own song using the song-assessing black box, and it does RL using the rule:
The better the song sounds, the higher the reward.
Oh wait. The bird is also deciding how much time to spend practicing singing, versus foraging or whatever. And the worse it sings, the more important it is to practice! So you really want the rule:
The worse the song sounds, the more rewarding it is to practice singing.
How do you resolve this conflict?
- Maybe “Reward = Time derivative of how good the song sounds”? Nope, under this reward, if the bird is bad at singing, and improving very slowly, then practice would not feel very rewarding. But here, the optimal action is to continue spending lots of time practicing. (Singing well is really important.)
- Maybe “Reward is connected to the abstract concept of ‘I want to be able to sing well’?” Sure—I mean, that is ultimately what evolution is going for, and that’s what it would look like for an adult human to “want to get out of debt” or whatever. But how do you implement that? "I want to be able to sing well" is an awfully complicated thought; I doubt most birds are even able to think it—and if they could, we still have to solve a vexing symbol-grounding problem if we want to build a genetic mechanism that points to that particular concept and flags it as desirable. No way. I think this is just one of those situations where “the exact thing you want” is not a feasible option for the within-lifetime RL reward signal, or else doesn’t produce the desired result. (Another example in this category is “Don’t die”.)
- Maybe you went awry at the start, when you decided to choose actions using a within-lifetime RL algorithm? In other words, maybe “choosing actions based on anticipated future rewards, as learned through within-lifetime experience” is not a good idea? Well, if we throw out that idea, it would avoid this problem, and a lot of reasonable people do go down that route (example), but I disagree (discussion here, here); I think RL algorithms (and more specifically model-based RL algorithms) are really effective and powerful ways to skillfully navigate a complex and dynamic world, and I think there's a very good reason that these algorithms are a key component of within-lifetime learning in animal brains. There’s gotta be a better solution than scrapping that whole approach, right?
- Maybe after each singing practice, you could rewrite those memories, to make the experience seem more rewarding in retrospect than it was at the time? I mean, OK, maybe in principle, but can you actually build a mechanism like that which doesn't have unintended side-effects? Anyway, this is getting ridiculous.
…“Aha”, you say. “I have an idea!” One part of the bird brain is “deciding” which low-level motor commands to execute during the song, and another part of the bird brain is “deciding” whether to spend time practicing singing, versus foraging or whatever else. These two areas don’t need the same reward signal! So for the former area, you send a signal: “the better the song sounds, the higher the reward”. For the latter area, you send a signal: “the worse the song sounds, the more rewarding it feels to spend time practicing”.
2. Wishful thinking
You’re the same deity, onto your next assignment: redesigning a human brain to work better. You’ve been reading all the ML internet forums and you’ve become enamored with the idea of backprop and differentiable programming. Using your godlike powers, you redesign the whole human brain to be differentiable, and apply the following within-lifetime learning rule:
When something really bad happens, do backpropagation-through-time, editing the brain’s synapses to make that bad thing less likely to happen in similar situations in the future.
OK, to test your new design, you upgrade a random human, Ned. Ned then goes on a camping trip to the outback, goes to sleep, wakes up to a scuttling sound, opens his eyes and sees a huge spider running towards him. Aaaah!!!
The backpropagation kicks into gear, editing the synapses throughout Ned's brain so as to make that bad signal less likely in similar situations the future. What are the consequences of these changes? A bunch of things! For example:
- In the future, the decision to go camping in the outback will be viewed as less appealing. Yes! Excellent! That's what you wanted!
- In the future, when hearing a scuttling sound, Ned will be less likely to open his eyes. Whoa, hang on, that’s not what you meant!
- In the future, when seeing a certain moving black shape, Ned’s visual systems will be less likely to classify it as a spider. Oh jeez, this isn’t right at all!!
In The Credit Assignment Problem, Abram Demski describes actor-critic RL as a two-tiered system: an “instrumental” subsystem which is trained by RL to maximize rewards, and an “epistemic” subsystem which is absolutely not trained to maximize rewards, in order to avoid wishful thinking / wireheading.
Brains indeed do an awful lot of processing which is not trained by the main reward signal, for precisely this reason:
- Low-level sensory processing seems to run on pure predictive (a.k.a. self-supervised) learning, with no direct involvement of RL at all.
- Some higher-level sensory-processing systems seem to have a separate reward signal that reward it for discovering and attending to “important things” both good and bad—see discussion of inferotemporal cortex here.
- The brainstem and hypothalamus seem to be more-or-less locked down, doing no learning whatsoever—which makes sense since they’re the ones calculating the reward signals. (If the brainstem and hypothalamus were being trained to maximize a signal that they themselves calculate … well, it's easy enough to guess what would happen, and it sure wouldn't be “evolutionarily adaptive behavior”.)
- Other systems that help the brainstem and hypothalamus calculate rewards and other assessments—amygdala, ventral striatum, agranular prefrontal cortex, etc.—likewise seem to have their own supervisory training signals that are different from the main reward signal.
So we get these funny within-brain battles involving subsystems that do not share our goals and that we cannot directly intentionally control. I know intellectually that it's safe to cross the narrow footbridge over the ravine, but my brainstem begs to differ, and I wind up turning around and missing out on the rest of the walk. “Grrr, stupid brainstem,” I say to myself.
3. Deceptive AGIs
You're a human. You have designed an AGI which has (you believe) a good corrigible motivation, and it is now trying to invent a better solar panel.
- There's some part of the AGI's network that is imagining different ways to build a solar panel, and trying to find a good design;
- There's another part of the AGI's network that is choosing what words to say, when the AGI is talking to you and telling you what it’s working on.
(In the human case, we could point to different parts of the cortex. The parts are interconnected, of course, but they can still get different reward signals, just as in the bird example above.)
The obvious approach is to have one reward signal, widely broadcast, influencing both parts of the network. And if we get to a point where we can design reward signals that sculpt an AGI's motivation with surgical precision, that's fine! We would sculpt the motivation so that the AGI is trying to invent a better solar panel as a means to an end, with the ultimate goal of helping you. Then when it chooses words, it would also be maximally helpful.
I think it's more likely that we won't get to that point—that AGIs' motivations will always be a bit off from what we intended for various reasons, but we'll (hopefully) muddle through anyway thanks to conservativism, transparency, and so on. Anyway, the motivation will certainly be off during training, when it's still learning basic things about the world, the existence of people, itself, etc. So either way, we should be deeply concerned that the AGI may sometimes be overly motivated to invent a better solar panel. And then, when it chooses the words to say to you, it may choose its words partly to manipulate you so as to advance its goal of inventing a better solar panel—rather than choosing words that are purely maximally helpful and honest.
So it seems to me that the word-choosing part of the network should have a separate reward signal (or be built in a different way altogether), a reward signal which doesn't care a whit about inventing a better solar panel, a reward signal geared purely towards clear and honest communication.
I don’t have a specific plan here. I just feel like I'd be surprised if the best possible safe AGI design that anyone can dream up will involve using the exact same reward signal for “making plans” and “choosing what words to emit when taking to a human”. I think we'll want different reward signals. Just like the bird example above.
This kind of non-unified agent sounds kinda weird, right? Like an AGI with Tourette’s or something?!? But it shouldn’t sound weird! It’s similar to (or even isomorphic to) a robot taking actions while being inspected by an interpretability / oversight module:
Does an agent need to be "unified" to be reflectively stable?
“Need”? No. It’s clearly possible for a non-unified system—with different supervisory signals training different subsystems—to be reflectively stable. For example, take the system “me + AlphaZero”. I think it would be pretty neat to have access to a chess-playing AlphaZero. I would have fun playing around with it. I would not feel frustrated in the slightest that AlphaZero’s has “goals” that are not my goals (world peace, human flourishing, etc.), and I wouldn’t want to change that.
By the same token, if I had easy root access to my brain, I would not change my low-level sensory processing systems to maximize the same dopamine-based reward signal that my executive functioning gets. I don't want the wishful thinking failure mode! I want to have an accurate understanding of the world! (Y’know, having Read The Sequences and all…) Sure, I might make a few tweaks to my brain here and there, but I certainly wouldn’t want to switch every one of my brain subsystems to maximize the same reward signal.
(If AlphaZero were an arbitrarily powerful goal-seeking agent, well then, yeah, I would want it to share my goals. But it’s possible to make a subsystem that is not an arbitrarily powerful goal-seeking agent. For example, take AlphaZero itself—not scaled up, just literally exactly as coded in the original paper. Or a pocket calculator, for that matter.)
So it seems to me that a “non-unified” agent is not inevitably reflectively unstable. However, they certainly can be. Just like I have a few bones to pick with my brainstem, as mentioned above, it's likewise very possible for different parts of the agent to start trying to trick each other, or hack into each other, or whatever. This is an obvious potential failure mode that we’d be nuts to ignore.
It's not a new problem though. Remember the figure above: it's arbitrary where we draw the line between "the AGI" and "other algorithms interacting with and trying to influence the AGI". So it's not a fundamentally different type of problem from gradient hacking. Or, for that matter, deception in general. (After all, humans are algorithms too.)
The Fraught Valley
Still, while it's not a new problem, I'll still take this as an excuse to talk about solving it.
The way I’m thinking about it is:
- Early in training, we have The Path Of Incompetence, where the “executive / planning submodule” of the AGI is too stupid / insufficiently self-aware / whatever to formulate and execute a plan to undermine other submodules.
- Late in training, we can hopefully get to The Trail of Corrigibility. That’s where we have succeeded at making a corrigible AGI that understands and endorses the way that it’s built—just like how, as discussed above, my low-level sensory processing systems don’t share my goals, but I like them that way.
- If there’s a gap between those, we’re in, let’s call it, The Fraught Valley.
For example, go back to that figure above, and imagine using those interpretability / oversight tools to install and verify good motivations in the executive / planning submodule. The goal is to do this successfully before the AGI is sophisticated enough to undermine the interpretability tools themselves.
Or imagine trying to do value learning (IRL etc.) in an AGI that builds a world-model from scratch, as I believe humans do. Here we literally can’t install the right motivations from the get-go, because “the right motivations” are inevitably defined in terms of concepts like objective reality, people, self, etc., that are (as of yet) nowhere to be found in the world-model. So maybe we let it do some learning, with some carefully-thought-through curriculum of data and rewards, and spin up the IRL subsystem as soon as the world-model is developed enough to support it.
Anyway, the goal is to make the width of the Fraught Valley as small as possible, or better yet, eliminate it altogether. This involves:
- Making it hard and complicated to corrupt the various motivation-installation and interpretability systems. I don’t think it’s realistic to harden these systems against a superintelligent adversary, but every little roadblock we can think of is good—it helps stretch out the Path Of Incompetence.
- Meanwhile, we push from the other side by designing the AGI in such a way that we can install good motivations, and especially root out the most dangerous ones, early. This might involve things like directing its attention to learn corrigibility-relevant concepts early, and self-awareness late, or whatever. Maybe we should even try to hardcode some key aspects of the world-model, rather than learning the world-model from scratch as discussed above. (I’m personally very intrigued by this category and planning to think more along these lines.)
Success here doesn't seem necessarily impossible. It just seems like a terrifying awful mess. (And potentially hard to reason about a priori.) But it seems kinda inevitable that we have to solve this, unless of course AGI has a wildly different development approach than the one I'm thinking of.
Finally we get to the paper “Reward Is Enough” by Silver, Sutton, et al.
The title of this post here is a reference to the recent paper by David Silver, Satinder Singh, Doina Precup, and Rich Sutton at DeepMind.
I guess the point of this post is that I’m disagreeing with them. But I don’t really know. The paper left me kinda confused.
Starting with their biological examples, my main complaint is that they didn’t clearly distinguish “within-lifetime RL (involving dopamine)” from “evolution treated as an RL process maximizing inclusive genetic fitness”.
With the latter (intergenerational) definition, their discussion is entirely trivial. Oh, maximizing inclusive genetic fitness “is enough” to develop perception, language, etc.? DUUUHHHH!!!
With the former (within-lifetime) definition, their claims are mostly false when applied to biology, as discussed above. Brains do lots of things that are not “within-lifetime RL with one reward signal”, including self-supervised (predictive) learning, supervised learning, auxiliary reward signals, genetically-hardcoded brainstem circuits, etc. etc.
Switching to the AI case, they gloss over the same interesting split—whether the running code, the code controlling the AI’s actions and thoughts in real time, looks like an RL algorithm (analogous to within-lifetime dopamine-based learning), or whether they are imagining reward-maximization as purely an outer loop (analogous to evolution). If it’s the latter, then, well, then what they’re saying is trivially obvious (humans being an existence proof). If it’s the former, then the claim is nontrivial, but it’s also I think wrong.
As a matter of fact, I personally expect an AGI much closer to the former (the real-time running code—the code that chooses what thoughts to think etc.—involves an RL algorithm) than the latter (RL purely as an outer-loop process), for reasons discussed in Against Evolution as an analogy for how humans will create AGI. If that’s what they were talking about, then the point of my post here is that “reward is not enough”. The algorithm would need other components too, components which are not directly trained to maximize reward, like self-supervised learning.
Then maybe their response would be: “Such a multi-component system would still be RL; it’s just a more sophisticated RL algorithm.” If that’s what they meant, well, fine, but that’s definitely not the impression I got when I was reading the text of the paper. Or the paper's title, for that matter.