This post was written as part of research done at MATS 9.0 under the mentorship of Richard Ngo. My previous writing on how agents may approximate having external reference is optional useful context.
The insufficiencies of idealised agents
Logically omniscient agents face notorious difficulties in reasoning about basic situations. The statement B => C is vacuously true, for any C, whenever B is false (or, equivalently, assigned a probability 0). A machine using this logic is thus able to generate arbitrarily absurd consequences from any statement it assigns a zero probability. This can pose serious problems; the Embedded agency piece by Demski and Garrabrant explains some of the consequences, which include but are not limited to a money-maximiser taking $5 over $10 if its formal decision procedure considers the $5 first.
This issue seems to boil down to idealised agents struggling to run simulations of events they already know their actions will render hypothetical. Humans have no problem doing this; simulation in fact forms a key part of our cognition[1]. If I'm cooking in my kitchen, I'll assign the event of me putting my hand on the hot stove a probability of (very close to) zero[2]. This doesn't mean I cannot reason about it. I actually have a developed model of what putting my hand on the stove would entail, and this understanding motivates me to keep the event's probability negligible. Actions such as maintaining spatial awareness in my kitchen, wearing gloves, or turning off the stove after cooking are all informed by my goal of making a hypothetical event "even more" unlikely.
At first glance, this sort of agentic behaviour fits active inference better than decision theory. In active inference, agents act to enforce observations that they assign high probabilities to. This behaviour is called self-evidencing. Agents that self-evidence need to consider counterfactuals, at the very least to successfully avoid them the way I prevent myself from getting burned.
Unfortunately, as I have written before, kicking the can down the road to active inference doesn't buy us much understanding. An active inference agent might favour $10 over $5 because it already has high priors on the $10 being more valuable. However, the idealised agent will also choose the $10 if it happens to consider that option before the $5; We might as well say (vacuously) that idealised agents make correct choices because they consistently have the instinct to check the right option first. A more enlightening question is therefore what exactly about our decision procedure lets us make such choices reliably.
Simulations as external reference
Embedded agents already have to deal with the issues of living in a world that includes them, and is therefore strictly larger than they are. From that point of view, it might seem unreasonable to expect them to spin up simulations, as this requires generating even more copies of the complex world in their model. However, I think this makes sense if we reframe simulation as enabled by agents' ability to approximate external reference.
Decision theory tends to run into serious conundrums like the aforementioned five-and-ten whenever agents try to model themselves internally. One possible solution is for agents to abstract themselves, or copies of themselves[3], as being external agents. Externalisation allows the same machinery "intended for" multi-agent strategic interactions to additionally simulate your own counterfactual behaviour.
One reason I'm excited about this view of simulation is that it could be compatible with theories of agents coordinating with themselves across time. Temporal versions of an agent can be seen as modelling each other as distinct, uncertain, and not-totally-aligned entities. For instance, the time I decide to set for my alarm clock is based on my (imperfect) model of how sleepy "future me" might be; me sleeping later than I originally planned to finish up a Lesswrong post may elicit feelings of guilt around disappointing my past self. And so on.
Of course, how we exhibit external reference is not trivial at all. I am, in fact, merely kicking the can even further down the road. I nevertheless find it encouraging that a theory of simulation could "fall out" as a consequence of understanding what it really means for agents to think of something as being external. Moreover, suppose agents really do reason about their past, future, and counterfactual selves with the same kinds of simulations they use for "other" agents. Then maybe multi-agent coordination across space and time could be described by a general theory that combines notions of dynamic inconsistency with game theory. There's at least one recent paper that points in this direction. The work frames agents' temporal variants as players in a Bayesian reputation game, and shows[4] that these agents can coordinate with each other.
This post was written as part of research done at MATS 9.0 under the mentorship of Richard Ngo. My previous writing on how agents may approximate having external reference is optional useful context.
The insufficiencies of idealised agents
Logically omniscient agents face notorious difficulties in reasoning about basic situations. The statement B => C is vacuously true, for any C, whenever B is false (or, equivalently, assigned a probability 0). A machine using this logic is thus able to generate arbitrarily absurd consequences from any statement it assigns a zero probability. This can pose serious problems; the Embedded agency piece by Demski and Garrabrant explains some of the consequences, which include but are not limited to a money-maximiser taking $5 over $10 if its formal decision procedure considers the $5 first.
This issue seems to boil down to idealised agents struggling to run simulations of events they already know their actions will render hypothetical. Humans have no problem doing this; simulation in fact forms a key part of our cognition[1]. If I'm cooking in my kitchen, I'll assign the event of me putting my hand on the hot stove a probability of (very close to) zero[2]. This doesn't mean I cannot reason about it. I actually have a developed model of what putting my hand on the stove would entail, and this understanding motivates me to keep the event's probability negligible. Actions such as maintaining spatial awareness in my kitchen, wearing gloves, or turning off the stove after cooking are all informed by my goal of making a hypothetical event "even more" unlikely.
At first glance, this sort of agentic behaviour fits active inference better than decision theory. In active inference, agents act to enforce observations that they assign high probabilities to. This behaviour is called self-evidencing. Agents that self-evidence need to consider counterfactuals, at the very least to successfully avoid them the way I prevent myself from getting burned.
Unfortunately, as I have written before, kicking the can down the road to active inference doesn't buy us much understanding. An active inference agent might favour $10 over $5 because it already has high priors on the $10 being more valuable. However, the idealised agent will also choose the $10 if it happens to consider that option before the $5; We might as well say (vacuously) that idealised agents make correct choices because they consistently have the instinct to check the right option first. A more enlightening question is therefore what exactly about our decision procedure lets us make such choices reliably.
Simulations as external reference
Embedded agents already have to deal with the issues of living in a world that includes them, and is therefore strictly larger than they are. From that point of view, it might seem unreasonable to expect them to spin up simulations, as this requires generating even more copies of the complex world in their model. However, I think this makes sense if we reframe simulation as enabled by agents' ability to approximate external reference.
Decision theory tends to run into serious conundrums like the aforementioned five-and-ten whenever agents try to model themselves internally. One possible solution is for agents to abstract themselves, or copies of themselves[3], as being external agents. Externalisation allows the same machinery "intended for" multi-agent strategic interactions to additionally simulate your own counterfactual behaviour.
One reason I'm excited about this view of simulation is that it could be compatible with theories of agents coordinating with themselves across time. Temporal versions of an agent can be seen as modelling each other as distinct, uncertain, and not-totally-aligned entities. For instance, the time I decide to set for my alarm clock is based on my (imperfect) model of how sleepy "future me" might be; me sleeping later than I originally planned to finish up a Lesswrong post may elicit feelings of guilt around disappointing my past self. And so on.
Of course, how we exhibit external reference is not trivial at all. I am, in fact, merely kicking the can even further down the road. I nevertheless find it encouraging that a theory of simulation could "fall out" as a consequence of understanding what it really means for agents to think of something as being external. Moreover, suppose agents really do reason about their past, future, and counterfactual selves with the same kinds of simulations they use for "other" agents. Then maybe multi-agent coordination across space and time could be described by a general theory that combines notions of dynamic inconsistency with game theory. There's at least one recent paper that points in this direction. The work frames agents' temporal variants as players in a Bayesian reputation game, and shows[4] that these agents can coordinate with each other.
I find it aesthetically pleasing that the following argument itself uses a hypothetical as a reasoning aide.
Section 2.1 of Embedded Agency explains briefly why giving "impossible" events a low-but-not-zero probability doesn't adequately address the issue.
In cases such as the dilemmas presented by Demski's work on trust
under some assumptions I haven't properly scrutinised yet