Wireheading is in the eye of the beholder

by Stuart_Armstrong 1 min read30th Jan 201910 comments


Ω 6

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

tl;dr: there is no natural category called "wireheading", only wireheading relative to some desired ideal goal.

Suppose that we have a built an AI, and have invited a human H to help test it. The human H is supposed to press a button B if the AI seems to be behaving well. The AI's reward is entirely determined by whether H presses B or not.

So the AI manipulates or tricks H into pressing B. A clear case of the AI wireheading itself.

Or is it? Suppose H was a meddlesome government inspector that we wanted to keep away from our research. Then we want H to press B, so we can get them our of our hair. In this case, the AI is behaving entirely in accordance with our preferences. There is no wireheading involved.

Same software, doing the same behaviour, and yet the first is wireheading and the second isn't. What gives?

Well, initially, it seemed that pressing the button was a proxy goal for our true goal, so manipulating H to press it was wireheading, since that wasn't what we intended. But in the second case, the proxy goal is the true goal, so maximising that proxy is not wireheading, it's efficiency. So it seems that the definition of wireheading is only relative to what we actually want to accomplish.

In other domains

I similarly have the feeling that wireheading-style failures in value-learning, low impact, and corrigibility, also depend on a specification of our values and preferences - or at least a partial specification. The more I dig into these areas, the more I'm convinced they require partial value specification in order to work - they are not fully value-agnostic.