Fictionalized/Paraphrased version of a real dialog between me and John Wentworth.
Fictionalized Me: So, in the Eliezer/Richard dialogs, Eliezer is trying to get across this idea that consequentialism deeply permeates optimization, and this is important, and that's one[1] reason why Alignment is Hard. But something about it is confusing and slippery, and he keeps trying to explain it and it keeps not-quite-landing.
I think I get it, but I'm not sure I could explain it. Or, I'm not sure who to explain it to. I don't think I could tell who was making a mistake, where "consequentialism is secretly everywhere" is a useful concept for realizing-the-mistake.
Fictionalized John: [stares at me]
Me: Okay, I guess I'm probably supposed to try and explain this and see what happens.
...
Me: Okay, so the part that's confusing here is that this is supposed to be something that Eliezer thinks thoughtful, attentive people like Richard (and Paul?) aren't getting, despite them having read lots of relevant material and paying attention and being generally on board with "alignment is hard."
...so, what is a sort of mistake I could imagine a smart, thoughtful person who read the sequences making here?
My Eliezer-model imagines someone building what they think is an aligned ML system. They've trained it carefully to do things they reflectively approve of, they've put a lot of work into making it interpretable and honest. This Smart Thoughtful Researcher has read the sequences and believes that alignment is hard and whatnot. Nonetheless, they'll have failed to really grok this "consequentialism-is-more-pervasive-and-important-than-you-think" concept. And this will cause doom when they try to scale up their project to accomplish something actually hard.
I... guess what I think Eliezer thinks is that Thoughful Researcher isn't respecting inner optimizers enough. They'll have built their system to be carefully aligned, but to do anything hard, it'll end up generating inner-optimizers that aren't aligned, and the inner-optimizers will kill everyone.
...
John: Nod. But not quite. I think you're still missing something.
You're familiar with the arguments of convergent instrumental goals?
Me: i.e. most agents will end up wanting power/resources/self-preservation/etc?
John: Yeah.
But not only is "wanting power and self preservation" convergently instrumental. Consequentialism is convergently instrumental. Consequentialism is a (relatively) simple, effective process for accomplishing goals, so things that efficiently optimize for goals tend to approximate it.
Now, say there's something hard you want to do, like build a moon base, or cure cancer or whatever. If there were a list of all the possible plans that cure cancer, ranked by "likely to work", most of the plans that might work route through "consequentalism", and "acquire resources."
Not only that, most of the plans route through "acquire resources in a way that is unfriendly to human values." Because in the space of all possible plans, while consequentialism doesn't take that many bits to specify, human values are highly complex and take a lot of bits to specify.
Notice that I just said "in the space of all possible plans, here are the most common plans." I didn't say anything about agents choosing plans or acting in the world. Just listing the plans. And this is important because the hard part lives in the choosing of the plans.
Now, say you build an oracle AI. You've done all the things to try and make it interpretable and honest and such. If you ask it for a plan to cure cancer, what happens?
Me: I guess it gives you a plan, and... the plan probably routes through consequentialist agents acquiring power in an unfriendly way.
Okay, but if I imagine a researcher who is thoughtful but a bit too optimistic, what they might counterargue with is: "Sure, but I'll just inspect the plans for whether they're unfriendly, and not do those plans."
And what I might then counterargue their counterargument with is:
1) Are you sure you can actually tell which plans are unfriendly and which are not?
and,
2) If you're reading very carefully, and paying lots of attention to each plan... you'll still have to read through a lot of plans before you get to one that's actually good.
John: Bingo. I think a lot of people imagine asking an oracle to generate 100 plans, and they think that maybe half the plans will be pretty reasonable. But, the space of plans is huge. Exponentially huge. Most plans just don't work. Most plans that work route through consequentialist optimizers who convergently seek power because you need power to do stuff. But then the space of consequentialist power-seeking plans are still exponentially huge, and most ways of seeking power are unfriendly to human values. The hard part is locating a good plan that cures cancer that isn't hostile to human values in the first place.
Me: And it's not obvious to me whether this problem gets better or worse if you've tried to train the oracle to only output "reasonable seeming plans", since that might output plans that are deceptively unaligned.
John: Do you understand why I brought up this plan/oracle example, when you originally were talking about inner optimizers?
Me: Hmm. Um, kinda. I guess it's important that there was a second example.
John: ...and?
Me: Okay, so partly you're pointing out that hardness of the problem isn't just about getting the AI to do what I want, it's that doing what I want is actually just really hard. Or rather, the part where alignment is hard is precisely when the thing I'm trying to accomplish is hard. Because then I need a powerful plan, and it's hard to specify a search for powerful plans that don't kill everyone.
John: Yeah. One mistake I think people end up making here is that they think the problem lives in the AI-who's-deciding/doing things, as opposed to in the actual raw difficulty of the search.
Me: Gotcha. And it's important that this comes up in at least two places – inner optimizers with an agenty AI, and an oracle that just output plans that would work. And the fact that it shows up in two fairly different places, one of which I hadn't thought of just now, is suggestive that it could show up in even more places I haven't thought of at all.
And this is confusing enough that it wasn't initially obvious to Richard Ngo, who's thought a ton about alignment. Which bodes ill for the majority of alignment researchers who probably are less on-the-ball.
- ^
I'm tempted to say "the main reason" why Alignment Is Hard, but then remembered Eliezer specifically reminded everyone not to summarize him as saying things like "the key reason for X" when he didn't actually say that, and often is tailoring his arguments to a particular confusion with his interlocuter.
I'm having some trouble phrasing this comment clearly, and I'm also not sure how relevant it is to the post except that the post inspired the thoughts, so bear with me...
It seems important to distinguish between several things that could vary with time, over the course of a plan or policy:
In principle, this is equivalent to a static objective function with terms for "how it would respond" to each possible sequence of observations (ignoring subtleties about orders over world-states vs. world-histories). But this has exactly the same structure as the previous point: it's more feasible to say "make an observation, then run this function to update the objective" than to unroll the same thing into a lookup table known entirely at the start.
The recent discussions about consequentialism seem to be about the case where we have a task that takes a significant amount of real-world time, over which many observations (1) will be made with implications for subsequent decisions -- but over which the objective (2) is approximately unchanging. This setup leads to various scary properties of what the policies actually do (3).
But, I don't understand the rationale for focusing on this case where the objective (2) doesn't change. (In the sense of "doesn't change" specified above -- that we can specify it simply over long time horizons, rather than incurring an exp(T) cost for unrolling its updates on observations sequences.)
One reason to care about this case is a hope for oracle AI, since oracle AI is something that receives "questions" (objectives simple enough for us to feel we understand) and returns "answers" (plans that may take time). This might produce a good argument that oracle AI is unsafe, but it doesn't apply to systems with changing objectives.
In the case of human intelligence, it seems to me that (2) evolves not too much more slowly than (1), and becomes importantly non-constant for longer-horizon cases of human planning.
If I set myself a brief and trivial goal like "make the kitchen cleaner over the next five minutes," I will spend those five minutes acting much like a clean-kitchen-at-all-costs optimizer, with all my subgoals pointing coherently in that direction ("wash this dish," "pick up the sponge"). If I set myself a longer-term goal like "get a new job," I may well find my preferences about the outcome have evolved substantially well before the task is complete.
This fact seems orthogonal to the fact that I am "good at search" relative to all known things that aren't humans. Relative to all non-humans, I'm very good at finding policies that are high-EV for the targets I'm trying to hit. But my targets evolve over time.
Indeed, I imagine this is why the complexity of human value doesn't create more of a problem for human action than it does. I don't have a simply-specifiable constant objective with a term for "make people happy" (or whatever); I have an objective with an update rule that reacts to human feedback over time. The update rule may have been optimized for something on an evolutionary timescale, but it's not obvious its application in an individual human can be modeled as optimizing anything.
(For a case that has the intelligence gap of humans/AGI, consider human treatment of animals. I've heard this brought up as an analogy for misaligned AI, and it's an interesting one. But the shape of the problem is not "humans are good at search, and have an objective which omits 'animal values,' or includes them in the wrong way." Sometimes people just decide to become vegan for ethical reasons! Sometimes whole cultures do.
This looks like a real case of individual values being updated, i.e. I don't think the right model of someone who goes vegan at age 31 is "this person is maximizing an objective which gives them points for eating animals, but only until age 31, and negative points thereafter.")
If we think of humans as a prototype case of an "inner optimizer," with evolution the outer optimizer, we have to note that the inner optimizer doesn't have a constant objective, even though the outer one does. The inner optimizer is very powerful, has the lasing property, and all of that, but it gets applied to a changing objective, which seems to produce qualitatively different results in terms of corrigibility, Goodhart, etc. The same thing could be true of an AGI, if it's the product of something like gradient descent rather than a system with an internal objective we explicitly wrote. This is not strong evidence that it will be true, but it at least motivates asking the question.
(It seems noteworthy, here, that when people talk about the causes of human misery / "non-satisfaction of human values," they typically point to things like scarcity, coordination problems, and society-level optimization systems with constant objectives. If we're good at search, and human value is complex, why aren't we constantly harming each other by executing incorrigibly on misaligned plans at an individual level? Something fitting this description no doubt happens, but it causes less damage that a naive application of AI safety theory would lead one to expect.)