Ashe Vazquez Nuñez

Irrationality as a Defense Mechanism for Reward-hacking

This post was written as part of research done at MATS 9.0 under the mentorship of Richard Ngo. It's related to my previous post, but should be readable as a standalone. Remark: I'm not yet familiar enough with the active inference literature to be sure that the issues I bring up haven't been addressed or discussed. If you think my characterisation of the state and flaws of the theory are missing something substantial, I'd love to know. Introduction In the theory of active inference, agents are described as having a set of internal states that interact with external states (the world) through a membrane of intermediate states, such as the senses. I'm currently exploring how agents are able to exhibit approximations of external reference that allow them to stay alive in the real world. They achieve this even though they only have access to the statistical proxy of their internals, which they could easily reward-hack without optimising the external states at all. One of active inference's weaknesses is that it struggles to model agents' uncertainties about their own preferences. I here propose a potential explanation for why agents are conflicted about these preferences. This perspective posits agents' seeming inconsistency and irrationality about their goals as a mechanism that protects them from reward-hacking their internal states. Internal reward-hacking Consider the following question: > What stops an agent from generating adversarial fulfilment criteria for its goals that are easier to satisfy than the "real", external goals? Take Clippy as an example, whose goal is stated as maximising the amount of paperclips in the world. Since Clippy only has internal reference, it could represent this goal as "I observe that the world has as many paperclips as it could possibly have". I'm wondering what in Clippy's system saves it from "winning at life" by hooking its sensors up to a cheap simulator that generates an infinite stream of fictional paperclips for it t

47Jan 18

What's the Point of the Math?

44Feb 5

Applications Open for a Weekend Exploring Civilisational Sanity [DEADLINE EXTENDED]

26Oct 10, 2025

Understanding Agency through Markov Blankets

25Jan 12

Ashe Vazquez Nuñez

Message

All writing is my own unless explicitly stated otherwise

159

4mo

What's the Point of the Math?

This post was written while at MATS 9.0 under the mentorship of Richard Ngo. It's only meta-related to my research. I would like to start by quoting a point Jan Kulveit made about economics culture in a recent post. > non-mathematical, often intuitive reasoning of [an] economist leads to some...

Feb 544

Addressing Decision Theory's Simulation Problem

This post was written as part of research done at MATS 9.0 under the mentorship of Richard Ngo. My previous writing on how agents may approximate having external reference is optional useful context. The insufficiencies of idealised agents Logically omniscient agents face notorious difficulties in reasoning about basic situations. The...

Feb 311

Towards Sub-agent Dynamics and Conflict

This post was written as part of research done at MATS 9.0 under the mentorship of Richard Ngo. Introduction This is a follow-up to my previous post. There, I suggest that inconsistency over preferences could be an emergent feature of agentic behaviour that protects against internal reward-hacking. In this piece,...

Jan 2513