The AI alignment problem as a consequence of the recursive nature of plans


Ruby's recent post on how Plans are Recursive & Why This is Important has strongly shaped my thinking. I think it is a fact that deserves more attention, both when thinking about our own lives as well as from an AI alignment standpoint. The AI alignment problem seems to be a problem inherent to seeking out sufficiently complex goals.

Any sufficiently complex goal, in order to be achieved, needs to be broken down recursively into sub-goals until you have executable actions. E.g. “becoming rich” is not an executable action, and neither is “leaving the maximum copies of your genes.” No one ever *does* those things. The same goes to values — what does it mean to value art, to value meaning, to value life? Whatever it is that you want, you need to translate it into muscle movements and/or lines of code in order to get any close at all to achieving it. This is an image Ruby used in that post:

Now, imagine that you have a really complex goal in that upper node. Say you want to "create the best civilization possible" or "maximize happiness in the universe." Or you just have a less complex goal -- "having a good and meaningful life" and you're in a really complex environment, like the modern capitalist economy. A tree representing that goal and the sub-goals and actions it needs to be decomposed to would be very tall. I can't make my own image because I am in a borrowed computer, but imagine the tree above with like, 20 levels of nodes.

Now, our cognitive resources are not infinite, and they don't scale with the complexity of the environment. You wouldn't be able to store that entire tree in your mind at all times and always keep track of how your every action is serving your one ultimate terminal goal. So in order to act at all, in order to get anywhere, you need to forget a considerable part of your super-goals, and focus on the more achievable sub-goals -- you need to "zoom in."

Any agent that seeks X as an instrumental goal, with, say, Y as a terminal goal, can easily be outcompeted by an agent that seeks X as a terminal goal. If you’re thinking of Y all the time, you’re simply not going to do the best you can to get X. Someone who sees becoming a lawyer as their terminal goal, someone who is intrinsically motivated by it, will probably do much better at becoming a lawyer than someone who sees it as merely a step towards something else. That is analogous to how an AI trained to do X will outcompete an AI trained to do X, *plus* value human lives and meaning and everything.

Importantly, this happens in humans at a very visceral level. Motivation, desire, wanting, are not infinite resources. If there is something that is, theoretically, your one true value/desire/goal, you're not necessarily going to feel any motivation at all to pursue it if you have lower-down sub-goals to achieve first, even if what originally made you select those sub-goals was that they brought you closer to that super-goal.

That may be why sometimes we find ourselves unsure about what we want in life. That also may be why we disagree on what values should guide society. Our motivation needs to be directed at something concrete and actionable in order for us to get anywhere at all.

So the crux of the issue is that we need to manage to 1) determine the sequence of sub-goals and executable actions that will lead to our terminal goal being achieved, and 2) make sure that those sub-goals and executable actions remain aligned with the terminal goal.

There are many examples of that going wrong. Evolution “wants” us to “leave the maximum copies of our genes.” The executable actions that that comes down to are things like “being attracted to specific features in the opposite sex that in the environment of evolutionary adaptedness correlated with leaving the maximum copies of your genes” and “having sex.” Nowadays, of course, having sex doesn’t lead to spreading genes, so evolution is kind of failing at the human alignment problem.

Another example would be people who work their entire lives to succeed at a specific prestigious profession and get a lot of money, but when they do, they end up not being entirely sure of why they wanted that in the first place, and find themselves unhappy.

You can see humans maximizing for i.e. pictures of feminine curves on Instagram as kind of like an AI maximizing paperclips. Some people think of the paperclip maximizer thought experiment as weird or arbitrary, but it really isn't any different from what we already do as humans. From what we have to do.

What AI does, in my view, is massively scale and exacerbate that already-existing issue. AI maximizes efficiency, maximizes our capacity to get what we want, and because of that, specifying what we want becomes the trickiest issue.

Goodhart’s law says that “When a measure becomes a target, it ceases to be a good measure.” But every action we take is a movement towards a target! And complex goals are not going to do as targets to be directly aimed at without the extensive use of proxies. The role of human language is largely to impress other humans. So when we say that we value happiness, meaning, a great civilization, or whatever, that sounds impressive and cool to other humans, but it says very little about what muscle movements need to be made or what lines of code need to be written.