Mathematically-Defined Optimization Captures A Lot of Useful Information

J Bostock

Alright it's been a long hiatus. I may not post again for another year or two if no inspiration strikes me. I will summarize my work so far.

We can define a function over causal networks which describes how hard a given portion of the network is optimizing some node around some particular world history^[1].
This function depends on some partial derivatives of the network state, making it a local function over world histories i.e. it does not depend on world histories which are "far away" in world-space.^[2]
Across one dimension, we can use path-integrals to find the "power" of the optimizer, i.e. how much variation it removes from the system.

Math summary

Optimization is written as for "past" node $p$ , "future" node $f$ , and a section $A$ of the causal network. It measures the (negative) log of the ratios of two derivatives of $w^{f}$ (i.e the value of $f$ in world $w$ ) with respect to $w^{p}$ . The first is the "normal" world where $A$ varies, and the second being an imaginary world where $A$ is "frozen" at it's value in $w^{f}$ , unable to respond to infinitesimal changes in $w^{p}$ .

We can overall write the following:

$O p (A; p, f) = - log (\frac{d w^{f}}{d w^{p}} {∣ ∣}_{A v a r i e s} / \frac{d w^{f}}{d w^{p}} {∣ ∣}_{A c o n s t a n t})$

If $A$ is optimizing $f$ with respect to $p$ , we would expect that some (or all) of the changes in $w^{f}$ which $w^{p}$ are caused by changes in $w^{p}$ will be removed; so the derivative when $A$ is allowed to vary will be smaller than the derivative when $A$ is fixed. This means $O p$ will be positive.

I have a few observations about what this might mean and why it might be important.

Optimization relates to Information

This is a simple calculation: if $A$ doesn't depend on $p$ in any way, then the two derivative terms will be equal, because $A$ won't vary in either of them. The ability of $A$ to optimize $f$ with respect to $p$ is related to its ability to gather information about $p$ .

Optimization relates to a Utility-Like Function

For simple systems like the thermostat, it seems like $O p$ has high values when the thermostat "gets what it wants". It kind of looks like $O p$ across one axis is the second derivative of our utility function, at least within regions of world-space where $A$ has roughly equal power and knowledge.

Optimization relates to Power

This seems pretty intuitively obvious. The more "powerful" the thermostat was in terms of having a stronger heating and cooling unit, the more it was able to optimize the world.

Why is this important?

We already have mathematical proofs that the "knowledge" and "values" of an agent-like thing cannot be disentangled exactly. So if we want a mathematically well-defined measure of agent behaviour, we must take them both at once.

Secondly, the sorts of histories that $O p$ works on are deliberately chosen to be both very general, requiring no notion of absolute time and space in the style of much of John Wentworth's work. A specific case of these networks is the activations of a neural network, so these tools could in theory be applied directly to AI interpretability work.

Worlds of High Optimization Are "Good" Worlds for The Optimizer

Worlds where $O p$ is large tend to be "good" for the optimizing region $A$ in question. They seem to correspond to local minimal (or at least local pareto frontiers) of a utility function. They also correspond to worlds where $A$ is both knowledgeable and powerful. They correspond to worlds where $A$ is "in control". Here are a few potential thoughts on making safer AI designs using this concept:

Having a mathematically well-defined measure of optimization means it can be hard-coded rather than relying on machine learning.
Lots of thought has gone into trying to make AI "low impact", and using $O p$ in this way might let us specify this better.
If worlds of high $O p$ tend to be good for the optimizerin question then this might provide a route towards a way to encode things that are good for humans.
$O p$ can be defined for $A$ regions in the AI's past, which makes it harder to reward-hack, or modify the preferences of the humans in question to hack $A$ .
$O p$ is defined locally, but extending it to distributions over worlds is probably trivial.

^{^}
World history here meaning a given set of numeric values which describe the state of a causal network. For example if we have the network [Temperature in Celsius] $\to$ [State of Water] then the following are examples of world histories: [-10] $\to$ [0], [45] $\to$ [1], and [120] $\to$ [2]. Where we've represented the state of water as a number {0: Solid, 1: Liquid, 2: Gas}.
^{^}
So if we consider our previous world history examples, the value of our optimization metric at a temperature of 310 Kelvin doesn't depend on the behaviour of the system at 315 Kelvin.

LESSWRONG
LW