1 min read2nd Sep 20231 comment
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for quick takes by PabloAMC. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

1 comment, sorted by Click to highlight new comments since: Today at 1:35 PM

The main problem with wireheading, manipulation... seems related to a confusion between the goal in the world and its representation inside the agent. Perhaps a way to deal with this problem is to use the fact that the agent may be aware of it being an embedded agent. That means that it could be aware of the goal representing an external fact of the world, and we could potentially penalize the divergence between the goal and its representation during training.