akeshet — LessWrong

In EY's talk AI Alignment: Why its Hard and Where to Start he describes alignment problems with the toy example of the utility function that is {1 if cauldron full, 0 otherwise} and its vulnerabilities. And attempts at making that safer by adding so called Impact Penalties. He talks through (timestamp 18:10) one such possible penalty, the Euclidean Distance penalty, and various flaws that this leaves open.

That penalty function does seem quite vulnerable to unwanted behaviors. But what about a more physical one, such as a penalty for additional-energy-consumed-due-to-agent's-actions, or additional-entropy-created-due-to-agent's-actions? These don't seem to have precisely the same vulnerabilities, and intuitively also seem like they would be more robust against agent attempting to do highly destructive things, which typically consuming a lot of energy.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments