An Analogue Of Set Relationships For Distributions

David Lorell

Here’s a conceptual problem David and I have been lightly tossing around the past couple days.

“A is a subset of B” we might visualize like this:

If we want a fuzzy/probabilistic version of the same diagram, we might draw something like this:

And we can easily come up with some ad-hoc operationalization of that “fuzzy subset” visual. But we’d like a principled operationalization.

Here’s one that I kinda like, based on maxent machinery.

Background Concept 1: Encodes The Same Information About $X$ As $P$ Itself

First, a background concept. Consider this maxent problem:

${max}_{P^{'}} - \sum_{X} P^{'} [X] l o g P^{'} [X] s.t. - \sum_{X} P^{'} [X] l o g P [X] \leq - \sum_{X} P [X] l o g P [X]$

Or, more compactly

$maxent [X] s.t. E [- l o g P [X]] \leq H_{P} (X)$

In English: what is the maximum entropy distribution $P^{'}$ for which (the average number of bits used to encode a sample from $P^{'}$ using a code optimized for distribution $P$ ) is at most (the average number of bits used to encode a sample from $P$ using a code optimized for $P$ )?

The solution to this problem is just $P^{'} = P$ .

Proof

First, the constraint must bind, except in the trivial case where $P$ is uniform. If the constraint did not bind, the solution would be the uniform distribution $U [X]$ . In that case, the constraint would say

$- \sum_{X} U [X] l o g P [X] \leq - \sum_{X} P [X] l o g P [X]$

$\leq - \sum_{X} U [X] l o g U [X]$ (because the uniform distribution has maximal entropy)

… but then adding $\sum_{X} U [X] l o g U [X]$ yields $D_{K L} (U | | P) \leq 0$ , which can be satisfied iff the two distributions are equal. So unless $P$ is uniform, we have a contradiction, therefore the constraint must bind.

Since the constraint binds, the usual first-order condition for a maxent problem tells us that the solution has the form $P^{'} [X] = \frac{1}{Z} e^{α l o g P [X]}$ , where $Z$ is a normalizer and the scalar $α$ is chosen to satisfy the constraint. We can trivially satisfy the constraint by choosing $α = 1$ , in which case $Z = 1$ normalizes the distribution and we get $P^{'} [X] = P [X]$ . Uniqueness of maxent distributions then finishes the proof.

So conceptually, leveraging the zen of maxent distributions, the constraint $E [- l o g P [X]] \leq H_{P} (X)$ encodes the same information about $X$ as the distribution $P$ itself.

Background Concept 2: … So Let’s Use Maxent To Fuse Distributions?

Conceptually, if the constraint $E [- l o g P [X]] \leq H_{P} (X)$ encodes all the information from $P$ into a maxent problem, and the constraint $E [- l o g Q [X]] \leq H_{Q} (X)$ encodes all the information from $Q$ into a maxent problem, then solving the maxent problem with both of those constraints integrates “all the information from both $P$ and $Q$ ” in some sense.

Qualitatively, here’s what that looks like in an example:

$P$ says that $X$ is probably in the red oval. $Q$ says that $X$ is probably in the blue oval. So together, they conceptually say that $X$ is probably somewhere in the middle, roughly where the two intersect.

Mathematically, the first order maxent condition says $P^{'} [X] = \frac{1}{Z} P [X]^{α} Q [X]^{β}$ , for some $α, β$ (which we will assume are both positive, because I don’t want to dive into the details of that right now). For any specific $X$ value, $P [X]^{α}$ and $Q [X]^{β}$ can be no larger than 1, but they can be arbitrarily close to 0 (they could even be 0 exactly). And since they’re multiplied, when either one is very close to 0, we intuitively expect the product to be very close to 0. Most of the probability mass will therefore end up in places where neither distribution is very close to 0 - i.e. the spot where the ovals roughly intersect, as we’d intuitively hoped.

Notably, in the case where $P$ and $Q$ are uniform over their ovals (so they basically just represent sets), the resulting distribution is exactly the uniform distribution over the intersection of the two sets. So conceptually, $P$ says something like “ $X$ is in set $S_{P}$ ”, $Q$ says something like “ $X$ is in set $S_{Q}$ ”, and then throwing both of those into a maxent problem says something like “ $X$ is in $S_{P}$ and $X$ is in $S_{Q}$ , i.e. $X$ is in the intersection”.

So that hopefully gives a little intuition for how and why maxent can be used to combine the information “assumed in” two different distributions $P, Q$ .

Something Like A Subset Relation?

What if we throw $E [- l o g P [X]] \leq H_{P} (X)$ and $E [- l o g Q [X]] \leq H_{Q} (X)$ into a maxent problem, but it turns out that the $Q$ constraint is nonbinding? Conceptually, that would mean that $P$ already tells us everything about $X$ which $Q$ tells us (and possibly more). Or, in hand wavy set terms, it would say that $S_{P}$ is a subset of $S_{Q}$ , and therefore puts a strictly stronger bound on $X$ .

In principle, we can check whether $Q$ ’s constraint is binding without actually running the maxent problem. We know that if $Q$ ’s constraint doesn’t bind, the maxent solution is $P$ , so we can just evaluate $Q$ ’s constraint at $P$ and see if it’s satisfied. The key condition is therefore:

$E_{P} [- l o g Q [X]] \leq H_{Q} (X)$

$Q$ ’s constraint is nonbinding iff that condition holds, so we can view $E_{P} [- l o g Q [X]] \leq H_{Q} (X)$ as saying something conceptually like “The information about $X$ implicitly encoded in $Q$ is implied by the information about $X$ implicitly encoded in $P$ ” or, in the uniform case, “ $S_{P}$ is a subset of $S_{Q}$ ”.

Now for an interesting check. If we’re going to think of this formula as analogous to a subset relationship, then we’d like to have transitivity: $A \subset B$ and $B \subset C$ implies $A \subset C$ . So, do we have

( $E_{P} [- l o g Q [X]] \leq H_{Q} (X)$ and $E_{Q} [- l o g R [X]] \leq H_{R} (X)$ ) implies $E_{P} [- l o g R [X]] \leq H_{R} (X)$

Based on David’s quick computational check the answer is “no”, which makes this look a lot less promising, though I’m not yet fully convinced.

LESSWRONG
LW

LESSWRONG
LW

32

An Analogue Of Set Relationships For Distributions

32

32

Background Concept 1: Encodes The Same Information About $X$ As $P$ Itself

Background Concept 2: … So Let’s Use Maxent To Fuse Distributions?

Something Like A Subset Relation?