Hypothesis Space Entropy

lsusr

In Why quantitative finance is so hard I explained why the entropy of your dataset must exceed the entropy of your hypothesis space. I used a simple hypothesis space with equally likely hypotheses each with $m$ tunable parameters. Real life is not usually so homogeneous.

No Tunable Parameters

Consider an inhomogeneous hypothesis space with zero tunable parameters. Instead of $H = log n$ which works for homogeneous hypothesis spaces, we must use more complicated entropy equation.

$H = - n \sum i = 1 ρ_{i} ln ρ_{i}$

This equation makes intuitive sense. It vanishes when one $ρ_{i = j}$ equals 1 and all other $ρ_{i \neq j}$ equal 0. Our equation is extremized when all $ρ_{i}$ are equal at $\frac{1}{n}$ . $H = log n$ is the maximal case when $ρ_{i} = \frac{1}{n} \forall i \in {1, \dots, n}$ ^[1].

With Tunable Parameters

Suppose each hypothesis $i$ has $m_{i}$ tunable parameters. We can plug $m_{i}$ into our entropy equation.

$H = n \sum i = 1 ρ_{i} (m_{i} - ln ρ_{i})$

Our old equation $H = m + log n$ is just the special case where all $ρ_{i}$ are homogenous and $m_{i}$ are homogeneous too.

We have so far treated $m_{i}$ as representative of each hypothesis's tunable parameters. More generally, $m_{i}$ represents each hypothesis's internal entropy. If we think of hypotheses as a weighted tree, $m_{i}$ is what you get when you iterate one level down the tree. Our variable $H$ identifies the root of the tree. Suppose $i$ th branch of the next level down is called $H_{i}$ .

$H = n \sum i = 1 ρ_{i} (H_{i} - ln ρ_{i})$

We can define the entropy of the rest of the tree with a recursive equation.

$\begin{matrix} H_{μ} & = & n \sum i = 1 ρ_{i} (H_{μ, i} - ln ρ_{i}) = & n \sum i = 1 (ρ_{i} H_{μ, i} - ρ_{i} ln ρ_{i}) \end{matrix}$

There are two parts to this equation: the recursive component $ρ_{i} H_{μ, i}$ and the branching component $- ρ_{i} ln ρ_{i}$ .

Branching component $- ρ_{i} ln ρ_{i}$

The $- ρ_{i} ln ρ_{i}$ component is maximized when $ρ_{i} = \frac{1}{e}$ .

$\begin{matrix} - ρ_{i} ln ρ_{i} & = & - \frac{1}{e} ln \frac{1}{e} = & \frac{1}{e} ln e = & \frac{1}{e} \end{matrix}$

The branching component tops out at $\frac{1}{e}$ . It can never contribute a massive quantity of entropy to your hypothesis space because it is limited to $\frac{1}{e}$ entropy per level of the tree.

$0 \leq - ρ_{i} ln ρ_{i} \leq \frac{1}{e}$

The branching factor is mostly unimportant. The bulk of our entropy comes from the recursive component.

Recursive component $ρ_{i} H_{μ, i}$

Fix $ρ_{i}$ at a positive value. There is no limit to how big $H_{μ, i}$ can become. You can make it arbitrarily large just by adding parameters. Consequently $ρ_{i} H_{μ, i}$ can become arbitrarily large too. In real world situations we should expect the recursive components of our hypothesis space to dominate the branching components.

If $ρ_{i}$ vanishes then the recursive component disappears. This might explain why human minds like to round "extremely unlikely" $ϵ > ρ_{i} > 0$ to "impossible" $ρ_{i} = 0$ when $H_{μ, i}$ is large. It removes lots of entropy from our hypothesis space still being right almost all of the time. This may be related to synaptic pruning.

Lessons for Hypothesis Space Design

Once again, we have confirmed that having hypotheses with lots of parameters is a worse problem that having lots of hypotheses to choose between. More generally, one or more hypotheses with exceptionally high entropy dominate the total entropy of your hypothesis space. If you want better priors then the first step of your optimization should be to eliminate these complex subtrees from your hypothesis space.

Proof: $\begin{matrix} H & = & - n \sum i = 1 ρ_{i} ln ρ_{i} = & - n \sum i = 1 \frac{1}{n} ln \frac{1}{n} = & - ln \frac{1}{n} = & - ln (n^{- 1}) = & ln n \end{matrix}$ ↩︎

[-]Alexander Gietelink Oldenziel4y10

I am confused what is meant by a 'hypothesis'.

Is this a probability distribution? What is the mathematical object that you denote by hypothesis?

[-]lsusr4y30

It's a probability distribution. A hypothesis space is a probability distribution of probability distributions.

LESSWRONG
LW

LESSWRONG
LW

17

Hypothesis Space Entropy

17

No Tunable Parameters

With Tunable Parameters

Branching component $- ρ_{i} ln ρ_{i}$

Recursive component $ρ_{i} H_{μ, i}$

Lessons for Hypothesis Space Design

17

17

17

Hypothesis Space Entropy

17

No Tunable Parameters

With Tunable Parameters

Branching component −ρilnρi

Recursive component ρiHμ,i

Lessons for Hypothesis Space Design

17

17

Branching component $- ρ_{i} ln ρ_{i}$

Recursive component $ρ_{i} H_{μ, i}$