Irrationality as a Defense Mechanism for Reward-hacking
This post was written as part of research done at MATS 9.0 under the mentorship of Richard Ngo. It's related to my previous post, but should be readable as a standalone. Remark: I'm not yet familiar enough with the active inference literature to be sure that the issues I bring up haven't been addressed or discussed. If you think my characterisation of the state and flaws of the theory are missing something substantial, I'd love to know. Introduction In the theory of active inference, agents are described as having a set of internal states that interact with external states (the world) through a membrane of intermediate states, such as the senses. I'm currently exploring how agents are able to exhibit approximations of external reference that allow them to stay alive in the real world. They achieve this even though they only have access to the statistical proxy of their internals, which they could easily reward-hack without optimising the external states at all. One of active inference's weaknesses is that it struggles to model agents' uncertainties about their own preferences. I here propose a potential explanation for why agents are conflicted about these preferences. This perspective posits agents' seeming inconsistency and irrationality about their goals as a mechanism that protects them from reward-hacking their internal states. Internal reward-hacking Consider the following question: > What stops an agent from generating adversarial fulfilment criteria for its goals that are easier to satisfy than the "real", external goals? Take Clippy as an example, whose goal is stated as maximising the amount of paperclips in the world. Since Clippy only has internal reference, it could represent this goal as "I observe that the world has as many paperclips as it could possibly have". I'm wondering what in Clippy's system saves it from "winning at life" by hooking its sensors up to a cheap simulator that generates an infinite stream of fictional paperclips for it t