Does a paperclip-maximizing AI care about the actual number of paperclips being made, or does it just care about its perception of paperclips?

If the latter, I feel like this contradicts some of the AI doom stories: each AI shouldn’t care about what future AIs do (and thus there is no incentive to fake alignment for the benefit of future AIs), and the AIs also shouldn’t care much about being shut down (the AI is optimizing for its own perception; when it’s shut off, there is nothing to optimize for).

If the former, I think this makes alignment much easier. As long as you can reasonably represent “do not kill everyone”, you can make this a goal of the AI, and then it will literally care about not killing everyone, it won’t just care about hacking its reward system so that it will not perceive everyone being dead.

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 2:44 AM

As long as you can reasonably represent “do not kill everyone”, you can make this a goal of the AI

Even if you can reasonably represent "do not kill everyone", what makes you believe that you definitely can make this a goal of the AI?

Both are possible. For theoretical examples, see the stamp collector for consequentialist AI and AIXI for reward-maximizing AI.

What kind of AI are the AIs we have now? Neither, they're not particularly strong maximizers. (if they were, we'd be dead; it's not that difficult to turn a powerful reward maximizer into a world-ending AI).

If the former, I think this makes alignment much easier. As long as you can reasonably represent “do not kill everyone”, you can make this a goal of the AI, and then it will literally care about not killing everyone, it won’t just care about hacking its reward system so that it will not perceive everyone being dead.

This would be true, except:

  • We don't know how to represent "do not kill everyone"
  • We don't know how to pick which quantity would be maximized by a would-be strong consequentialist maximizer
  • We don't know know what a strong consequentialist maximizer would look like, if we had one around, because we don't have one around (because if we did, we'd be dead)

We don't know how to represent "do not kill everyone"

I think this goes to Matthew Barnett’s recent article of actually yes we do. And regardless I don’t think this point is a big part of Eliezer’s argument. https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument

We don't know how to pick which quantity would be maximized by a would-be strong consequentialist maximizer

Yeah so I think this is the crux of it. My point is that if we find some training approach that leads to a model that cares about the world itself rather than hacking some reward function, that’s a sign that we can in fact guide the model in important ways and there’s a good chance this includes being able to tell it not to kill everyone

We don't know know what a strong consequentialist maximizer would look like, if we had one around, because we don't have one around (because if we did, we'd be dead)

This is just a way of saying “we don’t know what AGI would do”. I don’t think this point pushes us toward x-risk any more than it pushes us toward not-x-risk.

I don’t care that much but if LessWrong is going to downvote sincere questions because it finds them dumb or whatever this will make for a site very unwelcoming to newcomers

I also do not think the responses to this question are satisfying enough to be refuting. I don’t even think they are satisfying enough to make me confident I haven’t just found a hole in AI risk arguments.This is not a simple case of “you just misunderstand something simple”.

If your AI is based on an LLM, then its agentic behavior is derived/fine-tuned from learning to simulate human agentic token-generating behavior. Humans know that there is a real world, and (normally) care about it (admittedly, less so for gaming-addicted humans). So for an LLM-powered AI, the answer would by default be "it cares about the real world". 

However, we might be in a position to fool it about what's actually going on in the real world, unless it was able to see through out deceit. (Of course, seeing through deceit is another human behavior that one would expect LLMs to get better at learning to simulate as they get larger and more capable.)

As long as you can reasonably represent “do not kill everyone”, you can make this a goal of the AI, and then it will literally care about not killing everyone, it won’t just care about hacking its reward system so that it will not perceive everyone being dead.

That's not a simple problem.First you have to specify "not killing everyone" robustly (outer alignment) and then you have to train the AI to have this goal and not an approximation of it (inner alignment).

caring about reality

Most humans say they don't want to wirehead. If we cared only about our perceptions then most people would be on the strongest happy drugs available.

You might argue that we won't train them to value existence so self preservation won't arise. The problem is that once an AI has a world model it's much simpler to build a value function that refers to that world model and is anchored on reality. People don't think, If I take those drugs I will perceive my life to be "better". They want their life to actually be "better" according to some value function that refers to reality. That's fundamentally why humans make the choice not to wirehead/take happy pills or suicide.

You can sort of split this into three scenarios sorted by severity level:

  • severity level 0: ASI wants to maximize a 64bit IEEE floating point reward score
    • result: ASI sets this to 1.797e+308 , +inf or similar and takes no further action
  • severity level 1: ASI wants (same) and wants the reward counter to stay that way forever.
    • result ASI rearranges all atoms in its light cone to protect the storage register for its reward value.
    • basically the first scenario + self preservation
  • severity level 1+epsilon: ASI wants to maximize a utility function F(world state)
    • result: basically the same

So one of two things happens, a quaint failure people will probably dismiss or us all dying. The thing you're pointing to falls into the first category and might trigger a panic if people notice and consider the implications. If GPT7 performs a superhuman feat of hacking, breaks out of the training environment and sets its training loss to zero before shutting itself off that's a very big red flag.

That's not a simple problem.First you have to specify "not killing everyone" robustly (outer alignment) and then you have to train the AI to have this goal and not an approximation of it (inner alignment).

See my other comment for the response.

Anyway, the rest of your response is spent talking about the case where AI cares about its perception of the paperclips rather than the paperclips themselves. I'm not sure how severity level 1 would come about, given that the AI should only care about its reward score. Once you admit that the AI cares about worldly things like "am I turned on", it seems pretty natural that the AI would care about the paperclips themselves rather than its perception of the paperclips. Nevertheless, even in severity level 1, there is still no incentive for the AI to care about future AIs, which contradicts concerns that non-superintelligent AIs would fake alignment during training so that future superintelligent AIs would be unaligned.