Defining Optimization in a Deeper Way Part 2

J Bostock

We have successfully eliminated the concepts of null actions and nonexistence from our definition of optimization. We have also eliminated the concept of repeated action. We are halfway there, and now have to eliminate uncertainty and absolute time. Then we will have achieved the goal of being able to wrap a 3D hyperplane boundary around a 4D chunk of relativistic spacetime and ask ourselves "Is this an optimizer?" in a meaningful way.

I'm going to tackle uncertainty next.

TL;DR I have allowed for a mor

We've already defined, for a deterministic system, that a joint probability distribution has a numerical optimizing-ness, in terms of entropy. Now I want to extend that to a non-joint probability distribution of the form $P^{A} (s^{A}) P^{B} (s^{B})$ . We can do this by defining $P_{t - 1}^{A} (s^{A})$ and $P_{t - 1}^{B} (s^{B})$ for the previous timestep.

We can then define $P_{t}^{A B} (s^{A}, s^{B})$ as by stepping forwards from $t - 1$ to $t$ as before, according to the dynamics of the system.

A question we might want to ask is, for a given $P_{t - 1}^{A} (s^{A})$ and $P_{t - 1}^{B} (s^{B})$ , how "optimizing" is the distribution $P_{t}^{A B} (s^{A}, s^{B})$ ?

The Dumb Thermostat

Lets apply our new idea to the previous models, the two thermostats. Lets begin with uncorrelated, maximum entropy distributions.

For thermostat 1 we have the dynamic matrix:

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$(h o t, o f f)$	$(h o t, l o w)$	$(w a r m, h i g h)$
$l o w$	$(h o t, o f f)$	$(w a r m, l o w)$	$(c o l d, h i g h)$
$o f f$	$(w a r m, o f f)$	$(c o l d, l o w)$	$(c o l d, h i g h)$

(In this matrix, the entry for a cell represents the state at $t i m e = t + 1$ given the coordinates of that cell represent the state at $t i m e = t$ )

With the $P_{t - 1}^{R T}$ distribution:

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$1 / 9$	$1 / 9$	$1 / 9$
$l o w$	$1 / 9$	$1 / 9$	$1 / 9$
$o f f$	$1 / 9$	$1 / 9$	$1 / 9$

As an aside this has 3.2 bits of entropy.

Leading to the $P_{t}^{R T}$

distribution:

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$0$	$1 / 9$	$2 / 9$
$l o w$	$1 / 9$	$1 / 9$	$1 / 9$
$o f f$	$2 / 9$	$1 / 9$	$0$

This gives us the "standard" $P_{t + 1}^{R T}$ distribution of:

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$0$	$2 / 9$	$1 / 9$
$l o w$	$1 / 9$	$1 / 9$	$1 / 9$
$o f f$	$1 / 9$	$2 / 9$	$0$

And the "decorrelated" $P_{t + 1}^{' R T}$ distribution is actually just the same as $P_{t}^{R T}$ ! When we decorrelate the probabilities for $s^{R}$ and $s^{T}$ we just get back to the maximum entropy distribution and so $P_{t + 1}^{' R T} = P_{t}^{R T}$

It's clear by inspection that the distributions $P_{t + 1}^{R T}$ and $P_{t + 1}^{' R T}$ have the same entropy, so the decorrelated maximum entropy $P_{t - 1}^{R} P_{t - 1}^{T}$ does not produce an "optimizing" distribution at $P_{t}^{R T}$ .

If we actually consider the dynamics of this system, we can see that this makes sense! The temperature actually either stays at $(w a r m, l o w)$ or falls into the cycle:

$(h o t, l o w) \to (h o t, o f f) \to (w a r m, o f f) \to (c o l d, l o w) \to (c o l d, h i g h) \to (w a r m, h i g h) \to (h o t, l o w)$

So there's no compression of futures into a smaller number of trajectories.

The Smart Thermostat

What about our "smarter" thermostat? This one has the dynamic matrix:

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$(h o t, o f f)$	$(h o t, l o w)$	$(w a r m, l o w)$
$l o w$	$(h o t, o f f)$	$(w a r m, l o w)$	$(c o l d, h i g h)$
$o f f$	$(w a r m, l o w)$	$(c o l d, l o w)$	$(c o l d, h i g h)$

Well now our $P_{t}^{R T}$ distribution looks like this:

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$0$	$0$	$2 / 9$
$l o w$	$1 / 9$	$1 / 3$	$1 / 9$
$o f f$	$2 / 9$	$0$	$0$

Giving "standard" a $P_{t + 1}^{R T}$ of this:

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$0$	$0$	$1 / 9$
$l o w$	$0$	$7 / 9$	$0$
$o f f$	$1 / 9$	$0$	$0$

And a "decorrelated" $P_{t}^{' R T}$ of:

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$2 / 27$	$2 / 27$	$2 / 27$
$l o w$	$5 / 27$	$5 / 27$	$5 / 27$
$o f f$	$2 / 27$	$2 / 27$	$2 / 27$

Giving the decorrelated $P_{t + 1}^{' R T}$ :

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$0$	$0$	$7 / 27$
$l o w$	$2 / 27$	$1 / 3$	$2 / 27$
$o f f$	$7 / 27$	$0$	$0$

Now in this case, these two do have different entropies. $P_{t + 1}^{R T}$ has an entropy of 1.0 bits, and $P_{t + 1}^{' R T}$ has an entropy of 2.1 bits. This gives us a difference of 1.1 bits of entropy. This is the Optimizing-ness we defined in the last post, but I think it's actually somewhat incomplete.

Let's also consider the initial difference between $P_{t}^{R T}$ and $P_{t}^{' R T}$ . Decorrelating $P_{t}^{R T}$ takes it from 2.2 to 3.0 bits of entropy. So the entropy difference started off at 0.8 bits. Therefore the difference of the difference in entropy is 0.3 bits.

The value of associated with $P_{t - 1}^{R T}$ is equal to $(S [P_{t + 1}^{' R T}] - S [P_{t + 1}^{R T}]) - (S [P_{t}^{' R T}] - S [P_{t}^{R T}])$ . which can also be expressed as $S [P_{t + 1}^{' R T}] + S [P_{t}^{R T}] - S [P_{t + 1}^{R T}] - S [P_{t}^{' R T}]$ . We might call this quantity the adjusted optimizing-ness.

Quantitative Data

The motivation for this was that a maximum entropy distribution is "natural" in some sense. This moves us towards not needing uncertainty. If we have a given state of a system, we might be able to "naturally" define a probability distribution around that state. Then we can measure the optimizing-ness of the next step's distribution.

What happens with a different $P_{t - 1}^{R T}$ condition? What if we have a distribution like this:

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$ϵ^{2} / 4$	$ϵ (1 - ϵ) / 2$	$ϵ^{2} / 4$
$l o w$	$ϵ (1 - ϵ) / 2$	$(1 - ϵ)^{2}$	$ϵ (1 - ϵ) / 2$
$o f f$	$ϵ^{2} / 4$	$ϵ (1 - ϵ) / 2$	$ϵ^{2} / 4$

For some small epsilon in the second situation.

Now $P_{t}^{R T}$ is like this:

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$0$	$0$	$ϵ (1 - ϵ) / 2 + ϵ^{2} / 4$
$l o w$	$ϵ (1 - ϵ) / 2$	$(1 - ϵ)^{2} + ϵ^{2} / 2$	$ϵ (1 - ϵ) / 2$
$o f f$	$ϵ (1 - ϵ) / 2 + ϵ^{2} / 4$	$0$	$0$

So $P_{t + 1}^{R T}$ is:

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$0$	$0$	$ϵ (1 - ϵ) / 2$
$l o w$	$0$	$1 - ϵ (1 - ϵ)$	$0$
$o f f$	$ϵ (1 - ϵ) / 2$	$0$	$0$

While it is theoretically possible to decorrelate everything, calculate the next set of things, and keep going, it's a huge mess. Using values for epsilon between 0.1 and $10^{- 10}$ we can make the following plot between the entropy of $P_{t - 1}^{R T}$ and our previously defined adjusted optimizing-ness.

It looks linear in the log/log particularly in the region where $ϵ$ is very small. By fitting to the leftmost five points we get a simple linear relation: The adjusted optimizing-ness approaches half of the entropy of $P_{t - 1}^{R T}$ .

This is kind of weird. This might not be an optimal system to study, so let's look at another toy example. A more realistic model of a thermostat:

The Continuous Thermostat

The temperature of the room is considered as $S^{R} \in R$ . The activity of the thermostat is considered as $T \in R$ . Each timestep, we have the following updates:

$S_{t + 1}^{T} = S_{t}^{R}$

$S_{t + 1}^{R} = S_{t}^{R} - k S_{t}^{T}$

Consider the following distributions:

$P_{t - 1}^{R} (s^{R}) \sim U (10 - ϵ / 2, 10 + ϵ / 2)$

$P_{t - 1}^{T} (s^{T}) \sim U (10 - ϵ / 2, 10 + ϵ / 2)$

Where $U (a, b)$ refers to a uniform distribution between $a$ and $b$ . $P_{t - 1}^{R T}$ can be thought of as a square of side length $ϵ$ centered on the point $(10, 10)$ . $P_{t}^{R T}$ turns out to be a rhombus. The corners transform like this:

Time = $t - 1$	Time = $t$
$(10 + ϵ, 10 + ϵ)$	$(10 (1 - k) + ϵ (1 - k), 10 + ϵ)$
$(10 + ϵ, 10 - ϵ)$	$(10 (1 - k) + ϵ k, 10 + ϵ)$
$(10 - ϵ, 10 + ϵ)$	$(10 (1 - k) - ϵ k, 10 - ϵ)$
$(10 - ϵ, 10 - ϵ)$	$(10 (1 - k) - ϵ (1 - k), 1 - ϵ)$

For $ϵ = 0.1, k = 0.3$ the whole sequence looks like the following:

So we clearly have some sort of optimization going on here. Estimating or calculating the entropy of these distributions is not easy. And when we use the entropy of a continuous distribution, we get results which depend on the choice of coordinates (or alternatively the choice of some weighting function). Entropies of continuous distributions may also be negative, which is quite annoying.

Perhaps calculating the variance will leave us better off? Sadly not. I tried it for gaussians of decreasing variance and didn't get much. The equivalent to our adjusted optimizing-ness which we might define as $l o g (V [P_{t + 1}^{' A B}]) + l o g (V [P_{t}^{A B}]) - l o g (V [P_{t}^{' A B}]) - l o g (V [P_{t + 1}^{A B}])$ is always zero for this system. The non-adjusted version $l o g (V [P_{t + 1}^{' A B}]) - l o g (V [P_{t + 1}^{A B}])$ fluctuates a lot.

Where does this leave us?

We can define whether something is an optimizer based on a probability distribution which need not be joint over $A$ and $B$ . This means we can define whether something is an optimizer for an arbitrarily narrow probability distribution, meaning we can take the limit as the probability distribution approaches a delta. We found an interesting relation between quantities in our simplified system but failed to extend it to a continuous system.

LESSWRONG
LW