Or: Identities as Schelling Fences for Embedded Agents
This post was written as part of research done at MATS 9.0 under the mentorship of Richard Ngo. He contributed significantly to the ideas discussed within.
Introduction
This post questions the sanctity of the "agent" and discusses how Temporal Instances (TIs) of an agent can enter conflict due to distrust. These dynamics are describable mathematically as an intrapersonal cooperative game. I define a time-version of Nash equilibria and show an example of a self-punishing pattern between TIs that is nevertheless stable.
This leads us to ask what conditions allow disparate parts of an agent to cooperate harmoniously. I conjecture that agents showing a degree of consistency in their actions over time can be seen as adhering to an identity that replaces Common Knowledge of Rationality (CKR) between the game's players. In subscribing to a common identity, TIs declare trust in each other akin to that which an updateless[1] agent would embody.
I next deliberate on the shape that a formal statement and proof of this conjecture is likely to take. This will involve a translation of universal type spaces to intrapersonal games for a complete treatment of CKR. I also cover what other notions of equilibria and solution concepts would be helpful to adapt into a framework of self-coordination games.
At the end, I give a brief treatise on the relevance of my work to AI, including plans to increase the same.
The incoherent "self"
Much hay has been made over whether individuals have consistent preferences over worlds, and what properties their induced utility function might have. A primary motivation for this line of work is that individuals who fail to order worlds coherently could be money-pumped for arbitrarily high value. However, money pumping requires a sequence of trades, which makes it a fundamentally temporal phenomenon. It is thus equally important for unexploitable agents to have consistent preferences across ti