Briefly Extending Differential Optimization to Distributions

J Bostock

I've done some work on a definition of optimization which applies to "trajectories" in deterministic, differentiable models. What happens when we try and introduce uncertainty?

Suppose we have the following system consisting of three variables, the past , future $F$ , and some agent $A$ . The agent "acts" on the system to push the value of $F$ 80% of the way towards being zero. We can think of this as follows: $A = 0.8 P, F = P - A$ . Under these circumstances, $\frac{\partial F}{\partial P} |_{A v a r i e s} / \frac{\partial P}{\partial F} |_{A c o n s t a n t} = 0.2$ which means our optimization function gives: $O p (P, F; A) = - log (| 0.2 |) \approx - 1.61 n a t s$ .

What if we instead consider a normal distribution over $P$ ? This must be parameterized by a mean $μ_{P}$ and a standard deviation $σ_{P}$ . Our formulae now look like this:

$P \sim N (μ_{P}, σ_{P})$
$A \sim N (0.8 μ_{P}, 0.8 σ_{P})$
$F \sim N (0.2 μ_{P}, 0.2 σ_{P})$

So what does it look like for $A$ to "not depend" on $P$ ? We could just "pick" some value for $A$ but this seems like cheating. What if we set up a new model, in which $F^{'}$ depends on $P^{'}$ and $A^{'}$ , but $A^{'}$ depends on $P^{''}$ instead of $P^{'}$ ? We can allow $P^{'}$ and $P^{''}$ to have the same distributions as before:

$P^{'} \sim N (μ_{P}, σ_{P})$
$P^{''} \sim N (μ_{P}, σ_{P})$
$A \sim N (0.8 μ_{P}, 0.8 μ_{P})$

Calculating $F$ is a bit more difficult. We can think of it as adding two uncorrelated normal distributions together. For normal distributions this just means adding the means and variances together. Our distributions have means $μ_{P}$ and $- 0.8 μ_{P}$ , and variances $σ_{P}^{2}$ and $0.64 σ_{P}^{2}$ . Therefore we get a new distribution with mean $0.2 μ_{P}$ and variance $1.64 σ_{P}^{2}$ . This gives a standard deviation of $1.28 σ_{P}$ .

$F^{'} \sim N (0.2 μ_{P}, 1.28 σ_{P})$

What's the entropy of a normal distribution? Well, it's difficult to say properly, since entropy is poorly-defined on continuous variables. If one take the limiting density of discrete points one gets $log (N) + \frac{1}{2} log (2 π e σ^{2})$ , where $N$ goes to infinity. This is a problem unless we happen to be subtracting one entropy from another. So let's do that.

$H (F) - H (F^{'}) = log (N) + \frac{1}{2} log (2 π e σ_{F}^{2}) - log (N) - \frac{1}{2} log (2 π e σ_{F^{'}}^{2})$
$H (F) - H (F^{'}) = \frac{1}{2} log (σ_{F}^{2}) - \frac{1}{2} log (σ_{F^{'}}^{2})$
$H (F) - H (F^{'}) = log (σ_{F}) - log (σ_{F^{'}})$
$H (F) - H (F^{'}) = log (0.2 σ_{P}) - log (1.28 σ_{P})$
$H (F) - H (F^{'}) = log (0.2 / 1.28) \approx - 1.86 n a t s$

Ok so we got the sign wrong the first time. Nevermind. But there is another issue, this is higher than our previous value. This is because we're double-counting the variance from $P$ . We get the variance from $P^{'}$ and $P^{''}$ in $F^{'}$ . We can correct this by changing the object of study from $H (F^{'})$ to $H (F^{'} | P^{''})$ . This works exactly like you'd expect: it gives a weighted average of the value of $H (F^{'} | P^{''} = p^{''})$ for all possible values of $p^{''}$ . In this case it is trivial: for any fixed value of $p^{''}$ we get $F^{'} \sim N (μ_{P} - 0.8 p^{''}, σ_{P}^{2})$ . So lets take a look:

$H (F^{'} | P^{''}) - H (F) = \frac{1}{2} log (σ_{F^{'} | P^{''}}^{2}) - \frac{1}{2} log (σ_{F}^{2})$
$H (F^{'} | P^{''}) - H (F) = log (σ_{F^{'} | P^{''}}) - log (σ_{F})$
$H (F^{'} | P^{''}) - H (F) = log (σ_{F}) - \frac{1}{2} log (0.2 σ_{F})$
$H (F^{'} | P^{''}) - H (F) = - \frac{1}{2} log (0.2 / 1) \approx 1.61 n a t s$

In any Bayes-ish net-ish model, if we can get an agent's behaviour in the following form:

A network with nodes P, A, and F. There are arrows from P to F, P to A, and A to F

We can make the following transformation, and get $O p (F, P; A) = H (F^{'} | P^{''}) - H (F)$ .

The network from above is shown. An arrow points from it to a new network with nodes P', P'', F'. and A'. There are arrows P' to F', P'' to A'', and A' to F'

I will think more about whether this extension is properly valid. One limitation is that we cannot have multiple sets of arrows into and out of $A$ , since this would mess with the splitting of $P$ .

LESSWRONG
LW

Briefly Extending Differential Optimization to Distributions

4

New to LessWrong?

4