# Ω 1

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I'll argue here that we should make an aligned AI which is a causal decision theorist.

# Son-of-CDT

Suppose we are writing code for an agent with an action space and an observation space . The code determines how actions will be selected given the prior history of actions and observations. If the only way that our choice of what code to write can affect the world is through the actions that will be selected by the agent running this code, then the best we can do (for a given utility function that we know how to write down) is to make this agent a causal decision theorist. If our choice of what code to use can affect the world in other ways, all bets are off. The best choice of what code to put in the agent depends on details of the world we find ourselves in.

Therefore, if we run a CDT agent, it may well conclude that continuing to operate is not the best way to convert energy into expected utility. It may take actions to cause the following to happen: a) the program which computes its own actions is terminated, and b) some new program is run on the same computer to output actions given the interaction history. The new program that gets run (if indeed such a thing happens) is called Son-of-CDT. Given the state of the world, which entails various ways in which the source code of an agent might affect the outside world besides through the actions that the code outputs, Son-of-CDT is the best program to run for maximizing expected utility. The original CDT agent chooses the program that meets this specification. In general, this will not have anything remotely like a nice, simple closed form. If there are agents out there with vendettas against certain agent-programs, it will take that into account.

# Vendettas against Son-of-CDT?

CDT agents can be bullied. I believe the MIRI view is that Son-of-CDT will be bullied as well. Suppose there is an ultimatum game, where agent A offers at most $10 to agent B, and if agent B accepts, then agent A gets$10 minus the amount they offered. Otherwise, both get nothing. A competent agent in the position of agent B able to make a credible commitment (perhaps by revealing its source code) would commit to accept nothing less than $9.99, if agent A is a CDT agent. This would work out for the competent agent, because the CDT agent would see all this, and realize it could be one penny richer if it offers$9.99.

Eliezer claims that a "[competent] agent [chooses] to reject offers short of $9.99 from [the CDT agent's] offspring. (Original: "the LDT agent's choice to reject offers short of$9.99 from its offspring").

In my sketch above of the creation of Son-of-CDT, I include a detail that it would be housed in the same computer that ran the original agent, but this needn't be the case. It could be run anywhere in the world. The CDT agent could take any sort of actions that would cause Son-of-CDT to come into existence some time in the future somewhere in the world. There is no clear way to distinguish the "offspring" of an agent, given that an agent's actions can cause other agents to come into existence in arbitrary ways. For a competent agent to reject offers short of $9.99 from the "offspring" of a CDT agent, it would have to reject offers short of$9.99 from all agents that came into being after the existence of a single CDT agent. It would have to bully everyone.

After a CDT agent with a certain utility function comes into being, if there exists an accessible future in which a competent agent optimizes that utility function (where "accessible" is with respect to the action space of the CDT agent), then the CDT agent will access that future by taking the appropriate actions, and that competent agent will come into being. If it is true that competent agents bully Son-of-CDT, then it must be true that competent agents bully all agents whose probability of birth could have been affected by any pre-existing CDT agent.

Perhaps a competent agent chooses to reject offers short of $9.99 from any agents that come into existence after a CDT agent exists if they have a similar utility function to that CDT agent. If so, then we're cooked. CDT humans have existed, so this would imply that we can never create an agent with a human-like utility function that is not bullied by competent agents. Perhaps a competent agent chooses to reject offers short of$9.99 from any agents that it deems, using some messy heuristic, to have been made "on purpose" as a result of some of the actions of a CDT agent, and also from any agents that were made "on purpose" by any of those agents, and so on. (The recursion is necessary for the CDT agent to lack the incentive to make descendants which don't get bullied; that property underlay the claim that competent agents bully Son-of-CDT). If this is indeed what competent agents do to purposeful descendants of causal decision theorists, then if any researchers or engineers contributing to AGI are causal decision theorists, or if they once were, but changed their decision theory purposefully, or if they have any ancestors who were causal decision theorists (and no births along the way from that ancestor were accidents), then no matter what code is run in that AGI, the AGI would get bullied. This is according to the claim that Son-of-CDT gets bullied under a third possible definition of "offspring". I believe there are people attempting to make AGI whose research will end up being relevant to AGI who are CDT, or once were, or had parents who were, etc. So we're cooked in that case too.

But more realistically (and optimistically), I am very skeptical of the claim that competent agents bully everyone in practice.

# Fair Tests

Incidentally, the proposed treatment of Son-of-CDT falls under MIRI's category of an "unfair problem". A decision problem is "fair" if "the outcome depends only on the agent’s behavior in the dilemma at hand" (FDT, Section 9). Disregarding unfair problems is a precondition for progress in decision theory (in the MIRI view of what progress in decision theory entails) since it allows one to ignore objections like "Well, what if there is an agent out there who hates FDT agents? Then you wouldn't want your daughter to be an FDT agent, would you?" I'm skeptical of the relevance of research that treats unfair problems as non-existent, so in my view, this section is ancillary, but maybe some people will find it convincing. In any case, any bullying done to Son-of-CDT by virtue of the existence of a certain kind of agent that took actions which affected its birth certainly qualifies as "unfair".

# Implications

We want to create an agent with some source code such that our utility becomes optimized. Given that choices about the source code of an agent have consequences other than how that code outputs actions, this might not be a causal decision theorist. However, by definition, Son-of-CDT is the agent that meets this description: Son-of-CDT is the agent with the source code such that running that source code is the best way to convert [energy + hardware + actuators + sensors] into utility. How do we make Son-of-CDT? We just run a causal decision theorist, and let it make Son-of-CDT.