# Ω 14

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The purpose of this post is to discuss the relationship between the concepts of Updatelessness and the "Son of" operator.

Making a decision theory that is reflectively stable is hard. Most agents would self-modify into a agent if given the chance. For example if a CDT agent knows that it is going to be put in a Newcomb's problem, it would precommit to one-box, "causing" Omega to predict that it one-boxes. We say "son of CDT" to refer to the agent that a CDT agent would self-modify into, and more generally "son of X" to refer to the agent that agent X would self-modify into.

Thinking about these "son of" agents is unsatisfying for a couple reasons. First, it is very opaque. There is an extra level of indirection, where you cant just directly reason about what agent X will do. Instead have to reason about what agent X will modify into, which gives you a new agent, which you probably understand much less than you understand agent X, and then you have to reason about what that new agent will do. Second, it is unmotivated. If you had a good reason to like Son of X, you would probably not be calling in Son of X. Important concepts get short names, and you probably don't have as many philosophical reasons to like Son of X as you have to like X.

Wei Dai's Updateless Decision Theory is perhaps our current best decision theory proposal. A UDT agent chooses a function from its possible observations to its actions, without taking into account its observations, and then applies that function.

The main problem with this proposal is in formalizing it in a logical uncertainty framework. Some of the observations that an agent makes are going to be logical observations, for example, an agent may observe the millionth digit of . Then it is not clear how an agent can not take the digit of into account in its calculation of the best policy. If we do not tell it the digit through the standard channel, it might still compute the digit while computing the best policy. As I said here, it is important to note logical updatelessness is about computations and complexity, not about what logical system you are in.

So what would true logical updatelessness look like? Well the agent would have to not update on computations. Since it is a computation itself, we cannot keep it independent from all computations, but we can restrict it to some small class of computations. The way we do this is by giving the updateless part of the decision theory finite computational bounds. Computational facts not computable within those bounds are still observed, but we do not take them into account when choosing a policy. Instead, we use our limited computation to choose a policy in the form of a function from how the more difficult computations turn out to actions.

The standard way to express a policy is a bunch of input/output pairs. However, since the inputs here are results of computations, this can equivalently be expressed by a single computation that gives an output. (To see the equivalence, note that we can write down a single computation which computes all the inputs and produces the corresponding output. Conversely, given a single computation, we can just supply the identity function of the output of that computation.) Thus, logical updatelessness consists of a severely resource bounded agent choosing what policy (In the form of a computation) it wants to run given more resources.

Under this model, it seems that whenever you have an agent collecting more computational resources over time, with the ability to rewrite itself, you get an updateless agent. The limited agent is choosing using its bounded resources what algorithm it wants to run to choose its output when it has collected more computational resources. The future version of the agent with more resources is the updateless version of the original agent, in that it is following the policy specified by the original agent before updating on all the computational facts. However, this is also exactly what we mean when we say that the later agent is the son of the bounded agent.

There is still a free parameter in Logical Updatelessness, which is what decision procedure the limited version uses to select its policy. This is also underspecified in standard UDT, but I believe it is often taken to be EDT. Thus, we have logically updateless versions of many decision policies, which I claim is actually pointing at the same thing as Son on those various policies (in an environment where computational resources are collected over time).

# Ω 14

New Comment
[-]Wei DaiΩ660

This does seem to be the "obvious" next step in the UDT approach. I proposed something similar as "UDT2" in a 2011 post to the "decision theory workshop" mailing list, and others have made similar proposals.

But there is a problem with having to choose how much time/computing resources to give to the initial decision process. If you give it too little then its logical probabilities might be very noisy and you could end up with a terrible decision, but if you give it too much then it could update on too many logical facts and lose on acausal bargaining problems. With multiple AI builders, UDT2 seems to imply a costly arms-race situation where each has an incentive to give their initial decision process less time (than would otherwise be optimal) so that their AI could commit faster (and hopefully be logically updated upon by other AIs) and also avoid updating on other AI's commitments.

I'd like to avoid this but don't know how. I'm also sympathetic to Nesov's (and others such as Gary Drescher's) sentiment that maybe there is a better approach to the problems that UDT is trying to solve, but I don't know what that is either.

So my plan is to "solve" the problem of choosing how much time to give it by having a parameter (which stage of a logical inductor to use), and trying to get results saying that if we set the parameter sufficiently high, and we only consider the output on sufficiently far out problems, then we can prove that it does well.

It does not solve the problem, but it might let us analyze what we would get if we did solve the problem.

Prior to working more on decision theory or thin priors, it seems worth clearly fleshing out the desiderata that a naive account (say a task/act-based AI that uses CDT) fails to satisfy.

You say: "[Son of X] is very opaque. There is an extra level of indirection, where you can't just directly reason about what agent X will do. Instead have to reason about what agent X will modify into, which gives you a new agent, which you probably understand much less than you understand agent X, and then you have to reason about what that new agent will do. Second, it is unmotivated. If you had a good reason to like Son of X, you would probably not be calling it Son of X"

But if we trust X's decisions, then shouldn't we trust its decision to replace itself with Son of X? And if we don't trust X's decisions, aren't we in trouble anyway? Why do we need to reason about Son of X, any more than we need to reason about the other ways in which the agent will self-modify, or even other decisions the agent will make?

I agree this makes thinking about Son of X unsatisfying, we should just be thinking about X. But I don't see why this makes building X problematic. I'm not sure if you are claiming it does, but other people seem to believe something like that, and I thought I'd respond here.

I agree that there is a certain perspective on which this situation is unsatisfying. And using a suboptimal decision theory / too thick a logical prior in the short term will certainly will involve some cost (this is a special case of my ongoing debate with Wei Dai about the urgency of philosophical problems). But it seems to me that it is way less pressing than other possible problems---e.g. how to do aligned search or aligned induction in the regime with limited unlabelled data.

These other problems: (a) seem like they kill us by default in a much stronger sense, (b) seem necessary on both MIRI agendas as well as my agenda, and I suspect are generally going to be necessary, (c) seem pretty approachable, (d) don't really seem to be made easier by progress on decision theory, logical induction, on the construction of a thin prior, etc.

(Actually I can see how a good thin prior would help with the induction problem, but it leads to a quite different perspective and a different set of desiderata.)

On the flip side, I think there is a reasonably good chance that problems like decision theory will be obsoleted by a better understanding of how to build task/act-based AI (and I feel like for the most part people haven't convincingly engaged with those arguments).

UDT, in its global policy form, is trying to solve two problems: (1) coordination between the instances of an agent faced with alternative environments; and (2) not losing interest in counterfactuals as soon as observations contradict them. I think that in practice, UDT is a wrong approach to problem (1), and the way in which it solves problem (2) obscures the nature of that problem.

Coordination, achieved with UDT, is like using identical agents to get cooperation in PD. Already in simple use cases we have different amounts of computational resources for instances of the UDT agent that could make the decision processes different, hence workarounds with keeping track of how much computation to give the decision processes, so that coordination doesn't break, or hierarchies of decision processes that can access more and more resources. Even worse, the instances could be working on different problems and don't need to coordinate at the level of computational resources needed to work on these problems. But we know that cooperation is possible in much greater generality, even between unrelated agents, and I think this is the right way of handling the differences between the instances of an agent.

It's useful to restate the problem of not ignoring counterfactuals, as a problem of preserving values. It's not quite reflective stability, as it's stability under external observations rather than reflection, but when an agent plans for future observations it can change itself to preserve its values when the observations happen (hence "Son of CDT" that one-boxes). One issue is that the resulting values are still not right, they ignore counterfactuals that are not in the future of where the self-modification took place, and it's even less clear how self-modification addresses computational uncertainty. So the problem is not just preserving values, but formulating them in the first place so that they can already talk about counterfactuals and computational resources.

I think that in the first approximation, the thing in common between instances of an agent (within a world, between alternative worlds, and at different times) should be a fixed definition of values, while the decision algorithms should be allowed to be different and to coordinate with each other as unrelated agents would. This requires an explanation of what kind of thing values are, their semantics, so that the same values (1) can be interpreted in unrelated situations to guide decisions, including worlds that don't have our physical laws, and by agents that don't know the physical laws of the situations they inhabit, but (2) retain valuation of all the other situations, which should in particular motivate acausal coordination as an instrumental drive. Each one of these points is relatively straightforward to address, but not together. I'm utterly confused about this problem, and I think it deserves more attention.

[-]Wei DaiΩ110

But we know that cooperation is possible in much greater generality, even between unrelated agents

It seems to me like cooperation might be possible in much greater generality. I don't see how we know that it is possible. Please explain?

Each one of these points is relatively straightforward to address, but not together.

I'm having trouble following you here. Can you explain more about each point, and how they can be addresses separately?

This is more or less what I was talking about here (see last paragraph). This should also give us superrationality, provided that instead of allowing an arbitrary "future version", we constrain the future version to be a limited agent with access to a powerful "oracle" for queries of the form for all possible policies (which might involve constructing another, even more powerful, agent). If we don't impose this constraint, we run into the problem of "self-stabilizing mutually detrimental blackmail" in multi-agent scenarios.

[-]Wei DaiΩ000

I may be misunderstanding what you're proposing, but assuming that each decision process has the option to output "I've thought enough, no need for another version of me, it's time to take action X" and have X be "construct this other agent and transfer my resources to it", the constraint on future versions doesn't seem to actually do much.

Well, the time to take a decision is limited. I guess that for this to work in full generality we would need that the total computing time of the future agents over a time discount horizon will be insufficient to simulate the "oracle" of even the first agent, which might be a too harsh restriction. Perhaps restricting space will help since space aggregates as max rather than as sum.

I don't have a detailed understanding of this, but IMO any decision theory that yields robust superrationality (i.e. not only for symmetric games and perfectly identical agents) needs to have some aspect that behaves like this.