Alright it's been a long hiatus. I may not post again for another year or two if no inspiration strikes me. I will summarize my work so far.
Optimization is written as for "past" node , "future" node , and a section of the causal network. It measures the (negative) log of the ratios of two derivatives of (i.e the value of in world ) with respect to . The first is the "normal" world where varies, and the second being an imaginary world where is "frozen" at it's value in , unable to respond to infinitesimal changes in .
We can overall write the following:
If is optimizing with respect to , we would expect that some (or all) of the changes in which are caused by changes in will be removed; so the derivative when is allowed to vary will be smaller than the derivative when is fixed. This means will be positive.
I have a few observations about what this might mean and why it might be important.
This is a simple calculation: if doesn't depend on in any way, then the two derivative terms will be equal, because won't vary in either of them. The ability of to optimize with respect to is related to its ability to gather information about .
For simple systems like the thermostat, it seems like has high values when the thermostat "gets what it wants". It kind of looks like across one axis is the second derivative of our utility function, at least within regions of world-space where has roughly equal power and knowledge.
This seems pretty intuitively obvious. The more "powerful" the thermostat was in terms of having a stronger heating and cooling unit, the more it was able to optimize the world.
We already have mathematical proofs that the "knowledge" and "values" of an agent-like thing cannot be disentangled exactly. So if we want a mathematically well-defined measure of agent behaviour, we must take them both at once.
Secondly, the sorts of histories that works on are deliberately chosen to be both very general, requiring no notion of absolute time and space in the style of much of John Wentworth's work. A specific case of these networks is the activations of a neural network, so these tools could in theory be applied directly to AI interpretability work.
Worlds where is large tend to be "good" for the optimizing region in question. They seem to correspond to local minimal (or at least local pareto frontiers) of a utility function. They also correspond to worlds where is both knowledgeable and powerful. They correspond to worlds where is "in control". Here are a few potential thoughts on making safer AI designs using this concept:
World history here meaning a given set of numeric values which describe the state of a causal network. For example if we have the network [Temperature in Celsius] [State of Water] then the following are examples of world histories: [-10] [0], [45] [1], and [120] [2]. Where we've represented the state of water as a number {0: Solid, 1: Liquid, 2: Gas}.
So if we consider our previous world history examples, the value of our optimization metric at a temperature of 310 Kelvin doesn't depend on the behaviour of the system at 315 Kelvin.