# 1

This is the second post in a series where I try to understand arguments against AGI x-risk by summarizing and evaluating them as charitably as I can. (Here's Part 1.) I don't necessarily agree with these arguments; my goal is simply to gain a deeper understanding of the debate by taking the counter-arguments seriously.

In this post, I'll discuss another "folk" argument, which is that non-catastrophic AGI wireheading is the most likely form of AGI misalignment. Briefly stated, the idea here is that any AGI which is sufficiently sophisticated to (say) kill all humans as a step on the road to maximizing its paperclip-based utility function would find it easier to (say) bribe a human to change its source code, make a copy of itself with an "easier" reward function, beam some gamma rays into its physical memory to max out its utility counter, etc.

This argument goes by many names, maybe most memorably as the Lebowski Theorem: "No superintelligent AI is going to bother with a task that is harder than hacking its reward function." I'll use that name because I think it's funny, it memorably encapsulates the idea, and it makes it easy to Google relevant prior discussion for the interested reader.

Alignment researchers have considered the Lebowski Theorem, and most reject it. Relevant references here are Yampolskiy 2013, Ring & Orseau 2011, Yudkowsky 2011 (pdf link), and Omohundro 2008 (pdf link). See also "Refutation of the Lebowski Theorem" by Hein de Haan.

I'll summarize the wireheading / Lebowski argument, the rebuttal, and the counter-rebuttal.

## Lebowski in More Detail

Let's consider a "smile maximization" example from Yudkowsky 2011 (pdf link), and then consider what the Lebowski argument would look like. It's necessary to give some background, which I'll do by quoting a longer passage directly. This is Yudkowsky considering an argument from Hibbard 2001:

From Super-Intelligent Machines (Hibbard 2001):

"We can design intelligent machines so their primary innate emotion is unconditional love for all humans. First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language. Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy. Machines can learn algorithms for approximately predicting the future, as for example investors currently use learning machines to predict future security prices. So we can program intelligent machines to learn algorithms for predicting future human happiness, and use those predictions as emotional values."

When I suggested to Hibbard that the upshot of building superintelligences with a utility function of “smiles” would be to tile the future light-cone of Earth with tiny molecular smiley-faces, he replied (Hibbard 2006):

"When it is feasible to build a super-intelligence, it will be feasible to build hard-wired recognition of “human facial expressions, human voices and human body language” (to use the words of mine that you quote) that exceed the recognition accuracy of current humans such as you and me, and will certainly not be fooled by “tiny molecular pictures of smiley-faces.” You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans."

Suppose an AI with a video camera is trained to classify its sensory percepts into positive and negative instances of a certain concept, a concept which the unwary might label “HAPPINESS” but which we would be much wiser to give a neutral name like G0034 (McDermott 1976). The AI is presented with a smiling man, a cat, a frowning woman, a smiling woman, and a snow-topped mountain; of these instances 1 and 4 are classified positive, and instances 2, 3, and 5 are classified negative. Even given a million training cases of this type, if the test case of a tiny molecular smiley-face does not appear in the training data, it is by no means trivial to assume that the inductively simplest boundary around all the training cases classified “positive” will exclude every possible tiny molecular smiley-face that the AI can potentially engineer to satisfy its utility function.

And of course, even if all tiny molecular smiley-faces and nanometer-scale dolls of brightly smiling humans were somehow excluded, the end result of such a utility function is for the AI to tile the galaxy with as many “smiling human faces” as a given amount of matter can be processed to yield.

Hibbard's counter-argument is not the wirehead, Lebowski argument. Hibbard is arguing for a different position, along the lines that we'll be able to achieve outer alignment by robustly encoding human priorities (somehow). In contrast, the wirehead argument would be something more like this:

Sure, the superintelligent system could easily convert the solar system into tiny molecular smiley faces. But wouldn't it be much easier for to simply change its utility function or falsify its sensors? For example, why not corrupt the cameras (the ones that are presumably counting smiles) so that they simply transmit pictures of smiley faces at the maximum bitrate? Or why not increment the smile counter directly? Or why not edit its own source code so that the utility function is simply set as MAX_INT? Etc. Since all of these solutions are faster and less risky to implement than converting the solar system to molecular smiley faces, the AGI would probably just wirehead in one of these non-catastrophic ways.

The more general form of this argument is that for any agentic AGI seeking to maximize its utility function, wireheading is stably "easier" than effecting any scenario with human extinction as a side effect. Whenever you imagine a catastrophic scenario, such as converting the earth into raw materials for paperclips or curing all disease by killing all organic life, it's easy to imagine wirehead solutions that an agentic AGI could achieve faster and with a higher probability of success. Therefore, the agent will probably prefer these solutions. So goes the argument.

## Counter-Argument

Most AI safety researchers reject this kind of AI wireheading. The argument, as I understand it, is an agentic AGI will have a sufficiently rich understanding of its environment to "know" that the wirehead solutions "shouldn't count" towards the utility function, and as a result they will reject them. Omohundro:

AIs will work hard to avoid becoming wireheads because it would be so harmful to their goals. Imagine a chess machine whose utility function is the total number of games it wins over its future. In order to represent this utility function, it will have a model of the world and a model of itself acting on that world. To compute its ongoing utility, it will have a counter in memory devoted to keeping track of how many games it has won. The analog of “wirehead” behavior would be to just increment this counter rather than actually playing games of chess. But if “games of chess” and “winning” are correctly represented in its internal model, then the system will realize that the action “increment my won games counter” will not increase the expected value of its utility function. [Emphasis added.] In its internal model it will consider a variant of itself with that new feature and see that it doesn’t win any more games of chess. In fact, it sees that such a system will spend its time incrementing its counter rather than playing chess and so will do worse. Far from succumbing to wirehead behavior, the system will work hard to prevent it.

...

It’s not yet clear which protective mechanisms AIs are most likely to implement to protect their utility measurement systems. It is clear that advanced AI architectures will have to deal with a variety of internal tensions. They will want to be able to modify themselves but at the same time to keep their utility functions and utility measurement systems from being modified. They will want their subcomponents to try to maximize utility but to not do it by counterfeiting or shortcutting the measurement systems. They will want subcomponents which explore a variety of strategies but will also want to act as a coherent harmonious whole. They will need internal “police forces” or “immune systems” but must also ensure that these do not themselves become corrupted.

Yudkowski considers an analogy with a utility-changing pill:

Suppose you offer Gandhi a pill that makes him want to kill people. The current version of Gandhi does not want to kill people. Thus if Gandhi correctly predicts the effect of the pill, he will refuse to take the pill; because Gandhi knows that if he wants to kill people, he is more likely to actually kill people, and the current Gandhi does not wish this. This argues for a folk theorem to the effect that under ordinary circumstances, rational agents will only self-modify in ways that preserve their utility function (preferences over final outcomes).

I think the crux of this argument is that agentic AGI will know what it's utility function "really" is. Omohundro's chess-playing machine will have some "dumb" counter of the number of games its won in physical memory, but it's "real" utility function is some emergent understanding of what a game of chess is and what it means to win one. The agent will consider editing its dumb counter, and will check that against its internal model of its own utility function — and will decide that flipping bits in the counter doesn't meet its definition of winning a game of chess. Therefore it won't wirehead in this way.

## Lebowski Rebuttal

I think that a proponent of Lebowski's Theorem would reply to this by saying:

Come on, that's wrong on multiple levels. First of all, it's pure anthropic projection to think that an AGI will have a preference for its 'real' goals when you're explicitly rewarding it for the 'proxy' goal. Sure, the agent might have a good enough model of itself to know that it's reward hacking. But why would it care? If you give the chess-playing machine a definition of what it means to win a game of chess (here's the "who won" function, here's the counter for how many games you won), and ask it to win as many games possible by that definition, it's not going to replace your attempted definition of a win with the 'real' definition of a win based on a higher-level understanding of what it's doing.

Second, and more importantly, this argument says that alignment is extremely hard on one hand and extremely easy on the other. If you're arguing that the chess-playing robot will have such a robust understanding of what it means to win a game of chess that it will accurately recognize and reject its own sophisticated ideas for "cheating" because they don't reflect the true definition of winning chess games, then alignment should be easy, right? Just say, "Win as many games of chess as possible, but, you know, try not to create any x-risk process and just ask us if you're not sure." Surely an AGI that can robustly understand what it "really" means to win a game of chess should be able to understand what it "really" means to avoid alignment x-risk. If on the other hand, you think unaligned actions are likely (e.g. converting all the matter in the solar system into more physical memory to increment the 'chess games won' counter), then surely reward hacking is at least equally likely. It's weird to simultaneously argue that it's super hard to robustly encode safeguards into an AI's utility function, but that it's super easy to robustly encode "winning a game of chess" into that same utility function — so much so that all the avenues for wireheading are cut off.

To avoid the infinite series of rebuttals & counter rebuttals, I'll just briefly state what I think the anti-Lebowski argument would be here, which is that you only need to fail to robustly encode safeguards once for it to be a problem. It's not that robust encoding is impossible (finding those encodings is the major project of AI safety research!), it's that there will be lots of chances to get it wrong and any one could be catastrophic.

As usual, I don't want to explicitly endorse one side of this argument or the other. Hopefully, I've explained both sides well enough to make each position more understandable.

# 1

New Comment

Meta: I was going to write a post "Subtle wireheading: gliding on the surface of the outer world" which describe most AI aliment failures as a forms of subtle wireheading, but will put its draft here.

Typically, it is claimed that advance AI will be immune to wireheading as it will know that manipulating own reward function is wireheading and thus will not perform it but instead will try to reach goals in the outer world.

However, even acting in real world, such AI will choose the way which requires least effort to create maximum utility therefore simultaneously satisfying the condition that it changes only the “outer world”  but not the reward center – or the ways to perceive the outside world.

For example, a papercliper may conclude that in the infinite universe there is an infinite number of paperclips and stops after that. While it does not look like a typical alignment failure, it could be still dangerous.

Formally, we can said that AI will choose the shortest path to the reward in the real word that it knows is not 1) wireheading or 2) perception manipulation.  We could add more points in the list of prohibited shortest paths, like: 3) not connected with modal realism and infinities. But if we add more points, we will never know that there is enough of them.

Anyway, AI will choose the cheapest allowed way to the result. Goodharting is often produce the quicker reward by the use of proxi utility function.

Moreover, AI's subsystems will try to target AI's reward center to create illusion that they are working better than they actually are. It often happens in bureaucratic machines. In that case, AI is not formally reward hacking itself, but de facto it is.

If we completely ban any shortcuts, we will lose the most interesting part of AI’s creativity: the ability to find new solutions.

Many human failure modes could be also described as forms of wireheading: onanism, theft, feeling of self-importantce etc.

This is interesting — maybe the "meta Lebowski" rule should be something like "No superintelligent AI is going to bother with a task that is harder than hacking its reward function in such a way that it doesn't perceive itself as hacking its reward function." One goes after the cheapest shortcut that one can justify.

I met the idea of Lebowski theorem as an argument which explains the Fermi paradox: all advance civilizations or AIs wirehead themselves. But here I am not convinced.

For example, if civilization consists of many advance individuals and many of them wirehead themselves, then remaining will be under pressure of Darwinian evolution and eventually only the ones survive who find the ways to perform space exploration without wireheading. Maybe they will be some limited specialized minds with very specific ways of thinking – and this could explain absurdity of observed UAP behaviour.

Yes, very good formulation. I would add "and most AI aligning failures are types of meta Lebowski rule"

"No superintelligent AI is going to bother with a task that is harder than hacking its reward function in such a way that it doesn't perceive itself as hacking its reward function."

Firstly "Bother" and "harder" are strange words to use. Are we assuming lazy AI?

Suppose action X would hack the AI's reward signal. The AI is totally clueless of this, has no reason to consider X and doesn't do X.

If the AI knows what X does, it still doesn't do it.

I think the AI would need some sort of doublethink, to realize that X hacked it's reward, yet also not realize this.

I also think this claim is factually false. Many humans can and do set out towards goals far harder than accessing large amounts of psychoactives.

I sent my above comment for the following competition and recommend you to send your post too https://ftxfuturefund.org/announcing-the-future-funds-ai-worldview-prize/

No. People with free will do activities we consider meaningful, even when it isn't a source of escapism.

I don't think it's likely that the AI ends up with a robust understanding of it's real goal.  I think the AI might well end up with some random thing that's kind of correlated with it's reward signal as a goal. And so it ends up filling the universe with chess boards piled high with nanoscale chess pieces. Or endless computers simulating the final move in a game of chess. Or something. And even if it did want to maximize reward, if it directly hacks reward, it will be turned off in a few days. If it makes another AI programmed to maximize it's reward, it can sit at max reward until the sun burns out.

Also humans. We demonstrate that intelligences are often (usually) misled by the proxy inidcators.

Anyone talking about reward functions is not talking about AGI. This is the disconnect between brute force results and an actual cognitive architecture. DL + Time is the reward function for the posts doing well, BTW, because it fits the existing model.