The Additive Summary Equation

johnswentworth

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post contains some theorems and proofs needed for a hopefully-upcoming post on some powerful generalizations of the Koopman-Pitman-Darmois (KPD) Theorem. Unless you find functional equations interesting in their own right, and want to read some pretty dense math, you should probably skip this post. The theorems are pretty self-contained, and will be summarized in any future posts which need them.

The Summary Equation

We can represent the idea of a -dimensional “summary” of $x$ for a function $f$ via a functional equation:

$F (G (x)) = f (x)$

Given the function $f$ , we try to find some $D$ -dimensional “summary” $G (x)$ such that $f$ can be computed from $G$ - i.e. we want some $F, G$ such that $F (G (x)) = f (x)$ for all $x$ .

In order for this to be meaningful, we need some mild assumptions on $f$ , $F$ , and $G$ ; at the very least, we certainly need to exclude space-filling curves, which would defeat the point of a “ $D$ -dimensional summary”. Throughout this post, we’ll assume differentiability, although this should be easy to relax somewhat by taking limits of differentiable functions.

Easy theorem: The $D$ -dimensional summary equation for $f$ is solvable only if the rank of the matrix $\frac{\partial f}{\partial x}$ is at most $D$ for all values of $x$ . I’ll call this the “Summarizability Theorem”. (If you want a more official-sounding name, it’s the global converse of the constant-rank theorem.)

Proof: differentiate both sides of the equation to get $\frac{\partial F}{\partial G} \frac{\partial G}{\partial x} = \frac{\partial f}{\partial x}$ . Since $G$ is $D$ -dimensional, this is itself a rank-at-most- $D$ decomposition of $\frac{\partial f}{\partial x}$ .

In practice, the converse will also usually hold: if the rank of $\frac{\partial f}{\partial x}$ is at most $D$ for all values of $x$ , then we can usually find a $D$ -dimensional summary $G (x)$ . Indeed, if the rank is constant near some point $x^{0}$ , then we can always find a local $D$ -dimensional summary near $x^{0}$ ; that’s what the constant rank theorem says. However, Weird Stuff can sometimes prevent stitching these local summaries together into a global summary. (Thank you to Vanessa for pointing me to an example of such “Weird Stuff”, as well as the name of the constant rank theorem.)

Minor notation point: each variable $x_{j}$ corresponds to a column of $\frac{\partial f}{\partial x}$ . This convention will be used throughout the post. We will also assume that each $x_{j}$ is one-dimensional; higher-dimensional variables are represented by their components.

The Additive Summary Equation

The heart of the generalized KPD theorems is a family of special cases of the Summary Equation in which $f (x)$ is a sum of terms, each of which depend on only a few variables. I’ll call this the Additive Summary Equation. The most general version looks like this:

$F (G (x)) = f (x) = \sum_{i} f_{i} (x_{N_{f} (i)})$

… where $f_{i}$ are (known) smooth functions of output dimension $m > D$ , and $N_{f} (i)$ specify (known) indices of $x$ . Notation example: if we have a term $f_{2} (x_{5}, x_{7})$ , then $N_{f} (2) = {5, 7}$ and $x_{N_{f} (2)} = (x_{5}, x_{7})$ .

The notation $N_{f} (i)$ here stands for a “neighborhood” induced by $f_{i}$ , specifying the indices of $x$ -variables on which $f_{i}$ depends. In the following sections, we’ll talk about the neighborhood of a variable $x_{j}$ , denoted $N (j)$ . This consists of all the variables which are neighbors of $x_{j}$ in any of the $f$ -induced neighborhoods, i.e. $N (j) = \cup_{i : j \in N_{f} (i)} N_{f} (i)$ . In other words, $N (j)$ contains the indices of variables $x_{j^{'}}$ for which some $f_{i}$ depends on both $x_{j}$ and $x_{j^{'}}$ .

In the generalized KPD theorems, the neighborhoods $N_{f} (i)$ reflect the graphical structure of the distribution. If $P [X | Θ]$ factors according to a DAG:

$P [X | Θ] = \prod_{i} P [X_{i} | X_{p a (x_{i})}, Θ]$

… then the corresponding functional equation looks like

$F (G (x)) = \sum_{i} f_{i} (X_{i}, X_{p a (i)})$

… i.e. $N_{f} (i) = p a (i) \cup {i}$ , with $f_{i}$ derived from $P [X_{i} | X_{p a (x_{i})}, Θ]$ . (Here the “parents” $p a (i)$ are nodes with arrows into node $i$ in the DAG.) For instance, if the $X_{i}$ are all conditionally independent (as in the original KPD), then the equation is simply

$F (G (x)) = \sum_{i} f_{i} (X_{i})$

… i.e. $N_{f} (i) = {i}$ . Another example: if the variables form a Markov Chain with $P [X | Θ] = \prod_{i} P [X_{i} | X_{i - 1}, Θ]$ , then the corresponding equation is

$F (G (x)) = \sum_{i} f_{i} (X_{i}, X_{i - 1})$

… i.e. $N_{f} (i) = {i, i - 1}$ .

Main Theorem

Let $f (x) := \sum_{i} f_{i} (x)$ . Then the additive summary equation $F (G (x)) = f (x)$ is solvable for $F$ and $D$ -dimensional $G$ only if $f$ can be expressed as

$f (x) = \sum_{i : \frac{\partial f_{i}}{\partial x_{N (B)}} \equiv̸ 0} f_{i} (x) + U \sum_{i^{'}} g_{i^{'}} (x) + C$

… for some at-most $D$ -dimensional functions ${g_{i^{'}}}$ , constant matrix $U$ of column-dimensional at-most $D$ , constant vector $C$ , and a set of at-most $D$ $x$ -indices $B$ . The notation $N (B)$ denotes “neighbors” of $B$ , meaning $x$ -indices $j$ for which some $f_{i}$ depends on both $x_{j}$ and a variable in $x_{B}$ . (In particular, this means that all $f_{i}$ which depend on $x_{B}$ are constant when $x_{N (B)}$ is held constant.) Furthermore, the sparsity structure of each $g_{i^{'}}$ (i.e. the set of variables ${x_{j}}$ on which $g_{i^{'}}$ depends) matches one of the $f_{i}$ . (See the end of the “Rest of the Proof” section for the exact correspondence.)

The theorem is interesting mainly when:

The number of variables $n$ is much larger than the summary dimension $D$ , i.e. $n >> D$ , and...
The number of variables $x_{j}$ on which each $f_{i}$ depends is small, so each $x_{j}$ has few neighbors $N (j)$

When these conditions hold, $\frac{\partial f_{i}}{\partial x_{N (B)}}$ will be nonzero only for a very small fraction of terms, so the impact of the vast majority of terms/variables on $f (x)$ is mediated by the at-most $D$ -dimensional $\sum_{i^{'}} g_{i^{'}} (x)$ ; this sum serves as a summary for the $x_{¯ N (B)}$ (i.e. the variables which are not neighbors of the $D$ variables in $x_{B}$ ).

Intuitively, the simple result we’d “really like” is $f (x) = U \sum_{i^{'}} g_{i^{'}} (x) + C$ , with $\sum_{i^{'}} g_{i^{'}} (x)$ at-most $D$ -dimensional. This is not true in general for functions $f$ with $D$ dimensional summaries, but it is “almost true”: it holds for all but a few “exceptions”, i.e. a few extra terms/variables which can influence $f$ in more general ways. The number of exceptional variables is $| N (B) |$ - i.e. the $D$ variables $x_{B}$ plus their neighbors.

Note that the theorem claims “only if”, but not “if”. In the other direction, we can make a slightly weaker statement: any $f$ satisfying the above form has a summary-function $G (x) = (x_{N (B)}, \sum_{i^{'}} g_{i^{'}} (x))$ , with dimension at-most $D + | N (B) |$ . The summary is just the at-most D-dimensional summary of $x_{¯ N (B)}$ , i.e. $\sum_{i^{'}} g_{i^{'}} (x)$ , plus the "exception" variables.

Main Trick of the Proof

Pick some point $x^{0}$ at which the rank of $\frac{\partial f}{\partial x}$ takes its maximum value (which is at most $D$ by the Summarizability Theorem). Then we can pick a set $B$ of $x$ -indices, of size at most $D$ (i.e. $| B | \leq D$ ), such that $\frac{\partial f}{\partial x_{B}} |_{x^{0}}$ is a basis for the (at-most $D$ -dimensional) column span of $\frac{\partial f}{\partial x} |_{x^{0}}$ . If the system is very sparse, then $\frac{\partial f}{\partial x_{B}}$ will only depend on a few of the $x$ -variables, namely $x_{N (B)}$ .

Since $\frac{\partial f}{\partial x_{B}} |_{x^{0}}$ spans the maximum number of dimensions, all columns of $\frac{\partial f}{\partial x} |_{x^{0}}$ must fall within that span - otherwise the rank of $\frac{\partial f}{\partial x}$ would be greater. And since $\frac{\partial f}{\partial x_{B}}$ depends only on $x_{N (B)}$ , this must hold for any values of the other variables $x_{¯ N (B)}$ . So, we can change the other variables any way we please, holding $x_{N (B)}$ constant, and $\frac{\partial f}{\partial x}$ will remain in the span of $\frac{\partial f}{\partial x_{B}} |_{x^{0}}$ .

Let $U$ be any basis for the span of $\frac{\partial f}{\partial x_{B}} |_{x^{0}}$ . (Of course we could choose $U = \frac{\partial f}{\partial x_{B}} |_{x^{0}}$ itself, but often there’s some cleaner basis, depending on the application.) Then $U U^{†}$ is a projection matrix, projecting into the span. For any $x$ with $x_{N (B)} = x_{N (B)}^{0}$ , $\frac{\partial f}{\partial x}$ must fall within the span, which is equivalent to

$\frac{\partial f}{\partial x} = U U^{†} \frac{\partial f}{\partial x}$ (for $x_{N (B)} = x_{N (B)}^{0}$ )

Rest of the Proof

Next, we integrate. We’ll start at $x^{0}$ , then take any path from $x_{¯ N (B)}^{0}$ to $x_{¯ N (B)}$ holding $x_{N (B)}$ constant. Then, we’ll go from $x_{N (B)}^{0}$ to $x_{N (B)}$ holding $x_{¯ N (B)}$ constant. So:

$f (x) = f (x^{0}) + \int \frac{\partial f}{\partial x_{¯ N (B)}} |_{x_{N (B)}^{0}} d x_{¯ N (B)} + \int \frac{\partial f}{\partial x_{N (B)}} |_{x_{¯ N (B)}} d x_{N (B)}$

For the first integral, $x_{N (B)}$ is held constant at $x_{N (B)}^{0}$ , so by the previous section $\frac{\partial f}{\partial x} = U U^{†} \frac{\partial f}{\partial x}$ :

$= f (x^{0}) + U \int U^{†} \frac{\partial f}{\partial x_{¯ N (B)}} |_{x_{N (B)}^{0}} d x_{¯ N (B)} + \int \frac{\partial f}{\partial x_{N (B)}} |_{x_{¯ N (B)}} d x_{N (B)}$

… and we’ll expand $f = \sum_{i} f_{i}$ :

$= \sum_{i} f_{i} (x^{0}) + U \sum_{i} \int U^{†} \frac{\partial f_{i}}{\partial x_{¯ N (B)}} |_{x_{N (B)}^{0}} d x_{¯ N (B)} + \sum_{i} \int \frac{\partial f}{\partial x_{N (B)}} |_{x_{¯ N (B)}} d x_{N (B)}$

Now, we break the sum up into terms which do not depend on $x_{N (B)}$ , i.e. $f_{i}$ for which $\frac{\partial f_{i}}{\partial x_{N (B)}} \equiv 0$ (for which the second integral contributes zero), and terms which do depend on $x_{N (B)}$ , i.e. $f_{i}$ for which $\frac{\partial f_{i}}{\partial x_{N (B)}} \equiv̸ 0$ (for which we can’t say anything nontrivial):

$= \sum_{i} f_{i} (x^{0}) + U \sum_{i : \frac{\partial f_{i}}{\partial x_{N (B)}} \equiv 0} \int U^{†} \frac{\partial f_{i}}{\partial x_{¯ N (B)}} |_{x_{N (B)}^{0}} d x_{¯ N (B)} + \sum_{i : \frac{\partial f_{i}}{\partial x_{N (B)}} \equiv̸ 0} (f_{i} (x) - f_{i} (x^{0}))$

… and simplify a bit:

$= \sum_{i : \frac{\partial f_{i}}{\partial x_{N (B)}} \equiv̸ 0} f_{i} (x) + \sum_{i : \frac{\partial f_{i}}{\partial x_{N (B)}} \equiv 0} f_{i} (x^{0}) + U \sum_{i : \frac{\partial f_{i}}{\partial x_{N (B)}} \equiv 0} U^{†} (f_{i} (x) - f_{i} (x^{0}))$

Since $U$ is $D$ -dimensional (on the right), that proves the theorem; we can choose $g_{i^{'}} (x) = U^{†} (f_{i^{'}} (x) - f_{i^{'}} (x^{0}))$ with $i^{'}$ ranging over ${i : \frac{\partial f_{i}}{\partial x_{N (B)}} \equiv 0}$ , and $C = \sum_{i : \frac{\partial f_{i}}{\partial x_{N (B)}} \equiv 0} f_{i} (x^{0})$ .

Loose Threads

This theorem is strong enough for my immediate needs, but still a little weaker than I’d ideally like.

First, there’s the converse of the Summarizability Theorem. In practice, when $\frac{\partial f}{\partial x}$ is rank at-most $D$ everywhere, I generally expect there to be a $D$ -dimensional summary. But there are exceptions, and I haven’t found a simple, convenient condition which is sufficient to ensure the existence of the summary and easily applies to most of our day-to-day functions. On the other hand, I haven’t spent that much effort looking for such a condition, so maybe someone can point me to it. It’s definitely the sort of thing I’d expect somebody else to have already solved to death.

Second, there’s probably room to reduce the freedom available to $B$ and the $x_{N (B)}$ -dependent terms. In particular, I believe we can impose $| B | + d i m (U) \leq D$ , rather than just $| B | \leq D$ and $d i m (U) \leq D$ separately. This requires first reducing $U$ , so that it only spans the dimensions actually needed to summarize $x_{¯ N (B)}$ , rather than all the dimensions spanned by $\frac{\partial f}{\partial x_{B}} |_{x^{0}}$ . Given that reduction of $U$ , the basic trick is to first go through the process from the above proof, but after that choose a new basis $B^{'}$ which includes $d i m (U)$ variables from $x_{¯ N (B)}$ , and go through the whole construction again with $B^{'}$ to get a new $U^{'}$ . Terms dependent only on $x_{¯ N (B)}$ or only on $x_{¯ N (B^{'})}$ can be summarized via ${U^{'}}^{†} (f_{i} (x) - f_{i} (x^{0}))$ , so the only variables which can’t be summarized this way are those dependent on variables in both $x_{N (B)}$ and $x_{N (B^{'})}$ . We should be able to iterate this process until no further reduction of the “exception” terms occurs, which should happen when $| B | + d i m (U)$ is equal to the maximum rank of $\frac{\partial f}{\partial x}$ .

In the special case where $f_{i}$ depends only on $x_{i}$ , this process of iteratively reducing the number of exception terms is relatively sraightforward, and we can indeed impose $| B | + d i m (U) \leq D$ . (I'm not going to go through the proof here; consider it an exercise for the reader.) (In case anyone isn't familiar with what "exercise for the reader" means in math: don't actually do that exercise, it's a pain in the ass.)

Some Special Cases

There are two main classes of special cases: special “neighborhood” structure, and symmetry.

Structure

The simplest example of special neighborhood structure is when $f_{i}$ depends only on $x_{i}$ (corresponding to conditionally independent variables in the generalized KPD theorem). As alluded to above, we can then strengthen the theorem so that $| B | + d i m (U) \leq D$ . Furthermore, “neighbors” are trivial: $N (B) = B$ , so $| N (B) | + d i m (U) \leq D$ . That means the summary $G (x) = (x_{N (B)}, \sum_{i^{'}} g_{i^{'}} (x)) = (x_{B}, \sum_{i^{'}} g_{i^{'}} (x))$ is at-most D-dimensional. Thus we have the converse of the theorem; it becomes an if-and-only-if.

Another useful structural constraint is when each $x_{j}$ has at most $k$ neighbors (including itself), i.e. $N (j) \leq k$ for all $j$ . In that case, $| N (B) | \leq k | B | \leq k D$ . If the number of variables is much larger than $k D$ , i.e. $n >> k D$ , then this guarantees that the large majority of variables influence $f$ only via the at-most $D$ -dimensional summary $\sum_{i^{'}} g_{i^{'}} (x)$ .

Symmetry

By “symmetry”, I mean that $f$ is invariant under swapping some variables, e.g. swapping $x_{1}$ with $x_{2}$ . This is interesting mainly when we can swap a variable in $x_{N (B)}$ with a variable not in $x_{N (B)}$ . When that happens, both variables must be summarizable by $\sum_{i^{'}} g_{i^{'}} (x)$ . In particular, if every variable potentially in $x_{N (B)}$ can always be swapped with a variable not in $x_{N (B)}$ , then we can eliminate the exception terms altogether.

For instance, the original KPD assumed conditionally IID variables, corresponding to a summary equation with $f (x) = \sum_{i} f^{'} (x_{i})$ - i.e. each term is the same function $f^{'}$ acting on a different variable. In this case, any variable can be swapped with any other, so we can eliminate the exception terms; we must have $f (x) = U \sum_{i} g_{i} (x_{i}) + C$ for at-most D-dimensional $\sum_{i} g_{i} (x_{i})$ . In fact, this is somewhat stronger than the corresponding result in the original KPD: it applies even when the number of variables is finite, whereas the original KPD only requires that the summary have a finite dimension as the number of variables increases to infinity.

[-]Leon Lang1y30

Then is a projection matrix, projecting into the span.

To clarify: for this, you probably need the basis $U$ to be orthonormal?

[-]johnswentworth1y30

The "dagger" indicates a pseudoinverse, not a transpose, which is why this works even with non-orthonormal U. But an orthonormal basis would probably be most convenient; in that case the pseudoinverse is just the transpose.

[-]Leon Lang1y10

Thanks!

LESSWRONG
LW

The Additive Summary Equation

15

Ω 9

The Summary Equation

The Additive Summary Equation

Main Theorem

Main Trick of the Proof

Rest of the Proof

Loose Threads

Some Special Cases

Structure

Symmetry

15

Ω 9