Consider the following world: a set  of past nodes, a set of environment nodes  a set of actor/agent nodes , and a set of future nodes . Consider our previously defined function  for , which represents how much  is optimizing the value of  with respect to . For some choices of  and , we might find that the values  allow us to split the members of  into two categories: those for which , those with , and those with .

Picking P Apart

We can define . In other words,  is the subset of  which has no local influence on the future node  whatsoever. If you want to be cute you can think of  as the victim of  with respect to , because it is causally erased from existence.

Remember that  is high when "freezing" the information going out of  means that the value of  depends much more on the value . What does it mean when "freezing" the information of  does the opposite?

Now lets define . This means that when we freeze , these nodes have much less influence on the future. Their influence flows through . We can call  the utility-function-like region of  with respect to . We might also think of the actions of  as amplifying the influence of  on .

For completeness we'll define .

Let's now define  i.e. the set of points which are utility-like for any future point.  will be the set of points which are victim-like for any future points. We might want to define  as the set of "total" victims and  similarly.  meaning  only contains nodes which really don't interact with  at all in an optimizing/amplifying capacity.

We can also think of  and  as the sets of past nodes which have an outsizedly large and small influence on the future as a result of the actions of , respectively.

Describing D

, in other words  is the region of  in which  is removing the influence of . We can even define a smaller set , in other words the region in which  is totally removing the (local) influence of the total victim nodes . We can call  the domain of .

Approaching A

One issue which we haven't dealt with is how to actually pick the set ! We sort of arbitrarily declared it to be relevant at the start with no indication as to how we would do it. 

There are lots of ways to pick out sets of nodes in a causal network. One involves thinking about minimum information-flow in a very John Wentworth-ish way. Another might be to just start with a very large choice of  and iteratively try moving nodes from  to . If this shrinks  by a lot then this node might be important for , otherwise it might not be. This might not always be true though!

Perhaps sensible choices of  are ones which make  and  more similar, and  and  more similar i.e. those for which nodes in  are cleanly split into utility-like and victim-like nodes.

Either way, it seems like we can probably find ways to find agents in a system.


I conjecture that if  is a powerful optimizer, this will be expressed by having a large set . It will also be expressed by having the values of  be large.

I conjecture that , and especially  nodes get a pretty rough deal, and that it is bad to be a person living in .

I conjecture that the only nodes which get a good deal out of the actions of  are in , and that for an AI to be aligned,  needs to contain the AI's creator(s).

How does this help us think about optimizers?

The utility-function framework has not always been a great way to think about things. The differential optimization framework might be better, or failing that, different. The phrase "utility function" often implies a function which explicitly maps  and is explicitly represented in code. This framework defines  in a different way: as the set of regions of the past which have an oversized influence on the future via an optimizer .

Thinking about the number of nodes in  and  (and how well the latter are optimized with respect to the former) also provides a link to Eliezer's definition of optimization power found here. The more nodes in , the more information we are at liberty to discard; the more nodes in , the more of the world we can still predict; and the more heavily optimized the nodes of , the smaller our loss of predictive power.

New Comment
2 comments, sorted by Click to highlight new comments since:

Neat. Why is it worth calling U "utility," or even utility-like, though? If I tell you the set of things that I observe that significantly change my behavior, this tells you a lot about me but it doesn't tell you which function of these observations I'm using to make decisions.

E.g. both teams in a soccer game will respond to the position of the ball (the ball is in U - or some relaxed notion of it, since I think your full notion might be too strong), but want to do different things with it.

I think the position of the ball is in V, since the players are responding to the position of the ball by forcing it towards the goal. It's difficult to predict the long-term position of the ball based on where it is now. The position of the opponent's goal would be an example of something in U for both teams. In this case both team's utility-functions contain a robust pointer to the goal's position.