Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

tl;dr: there is no natural category called "wireheading", only wireheading relative to some desired ideal goal.

Suppose that we have a built an AI, and have invited a human H to help test it. The human H is supposed to press a button B if the AI seems to be behaving well. The AI's reward is entirely determined by whether H presses B or not.

So the AI manipulates or tricks H into pressing B. A clear case of the AI wireheading itself.

Or is it? Suppose H was a meddlesome government inspector that we wanted to keep away from our research. Then we want H to press B, so we can get them our of our hair. In this case, the AI is behaving entirely in accordance with our preferences. There is no wireheading involved.

Same software, doing the same behaviour, and yet the first is wireheading and the second isn't. What gives?

Well, initially, it seemed that pressing the button was a proxy goal for our true goal, so manipulating H to press it was wireheading, since that wasn't what we intended. But in the second case, the proxy goal is the true goal, so maximising that proxy is not wireheading, it's efficiency. So it seems that the definition of wireheading is only relative to what we actually want to accomplish.

In other domains

I similarly have the feeling that wireheading-style failures in value-learning, low impact, and corrigibility, also depend on a specification of our values and preferences - or at least a partial specification. The more I dig into these areas, the more I'm convinced they require partial value specification in order to work - they are not fully value-agnostic.

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 6:22 PM

Right. I think we can even go a step further and say there's nothing so special about why we might want to satisfy any particular value, whether it has the wirehead structure or not. That is, not only is wireheading in the eye of the beholder, but so is whether or not we are suffering from goodharting in general!

Quite. "wirehead" is a shorthand term for measurement proxy divergence - Goodheart's law. Doing something for a measurement/reward, rather than to achieve a real goal.

Could we say that wireheading is a direct access to one's reward function via self-modification and putting it on maximal level, which makes the function insensitive to any changes of the outside world? I think that such definition is stronger than just goodhearting.

Maybe we can define wireheading as a subset of goodharting, in a way similar to what you're defining.

However, we need the extra assumption that putting the reward on the maximal level is not what we actually desire; the reward function is part of the world, just as the AI is.

Yes, that is what I meant.

We could say whatever we like - Stuart's main point is in the first line: it's not a natural category.

I'd argue that your wording is a fine example of wireheading, but not a definition. There are many behaviors other than just that, which I'd categorize as wireheading. The original usage (Larry Niven around 1970, as far as I can tell) wasn't about self-modification or change of reward functions, it was direct brain stimulation as an addictive pleasure.

A point you make that I think deserves more emphasis is the "eye of the beholder" part you use in the title.

Wireheading is something that exists because we have a particular meaning we assign to a reward. This is true whether we are the one observing the actions we might label wireheading or the one to whom it is happening (assuming we can observe our own wireheading).

For example, addicts are often not unaware that they are doing something, like shooting heroin, that will directly make them feel good at the expense of other things, and then they rationally choose to feel good because it's what they want. From the inside it doesn't feel like wireheading, it feels like getting what you want. It only looks like wireheading from the outside if we pass judgement on an agent's choice of values such that we deem the agent's values to be out of alignment with the objective, a la goodharting. In the case of the heroin addict, they are wireheading from an evolutionary perspective (both the actual evolutionary perspective and the reification of that perspective in people judging a person to be "wasting their life on drugs").

As I say in another comment here, this leads us to realize there is nothing so special about any particular value we might hold so long as we consider only the value. The value of values, then, must exist in their relation to put the world in a particular state, but even how much we value putting the world in particular states itself comes from values, and so we start to see the self-referential nature of it all that leads to a grounding problem for values. So put another way, wireheading only exists so long as you think you can terminate your values in something true.

Mainly agree, but I'll point out that addicts at different moment can prefer to not have heroin - in fact, as a addict of much more minor things (eg News), I can testify that I've done things I knew I didn't want to do at every moment of the process (before, during, and after).

Is this analogous to the stance-dependency of agents and intelligence?

It is analogous, to some extent; I do look into some aspect of Daniel Dennett's classification here: https://www.youtube.com/watch?v=1M9CvESSeVc

I also had a more focused attempt at defining AI wireheading here: https://www.lesswrong.com/posts/vXzM5L6njDZSf4Ftk/defining-ai-wireheading

I think you've already seen that?