# Ω 7

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post contains some theorems and proofs needed for a hopefully-upcoming post on some powerful generalizations of the Koopman-Pitman-Darmois (KPD) Theorem. Unless you find functional equations interesting in their own right, and want to read some pretty dense math, you should probably skip this post. The theorems are pretty self-contained, and will be summarized in any future posts which need them.

## The Summary Equation

We can represent the idea of a -dimensional “summary” of  for a function  via a functional equation:

Given the function , we try to find some -dimensional “summary”  such that  can be computed from  - i.e. we want some  such that  for all .

In order for this to be meaningful, we need some mild assumptions on , and ; at the very least, we certainly need to exclude space-filling curves, which would defeat the point of a “-dimensional summary”. Throughout this post, we’ll assume differentiability, although this should be easy to relax somewhat by taking limits of differentiable functions.

Easy theorem: The -dimensional summary equation for  is solvable only if the rank of the matrix  is at most  for all values of . I’ll call this the “Summarizability Theorem”. (If you want a more official-sounding name, it’s the global converse of the constant-rank theorem.)

Proof: differentiate both sides of the equation to get . Since  is -dimensional, this is itself a rank-at-most- decomposition of .

In practice, the converse will also usually hold: if the rank of  is at most  for all values of , then we can usually find a -dimensional summary . Indeed, if the rank is constant near some point , then we can always find a local -dimensional summary near ; that’s what the constant rank theorem says. However, Weird Stuff can sometimes prevent stitching these local summaries together into a global summary. (Thank you to Vanessa for pointing me to an example of such “Weird Stuff”, as well as the name of the constant rank theorem.)

Minor notation point: each variable  corresponds to a column of . This convention will be used throughout the post. We will also assume that each  is one-dimensional; higher-dimensional variables are represented by their components.

The heart of the generalized KPD theorems is a family of special cases of the Summary Equation in which  is a sum of terms, each of which depend on only a few variables. I’ll call this the Additive Summary Equation. The most general version looks like this:

… where  are (known) smooth functions of output dimension , and  specify (known) indices of . Notation example: if we have a term , then  and .

The notation  here stands for a “neighborhood” induced by , specifying the indices of -variables on which  depends. In the following sections, we’ll talk about the neighborhood of a variable , denoted . This consists of all the variables which are neighbors of  in any of the -induced neighborhoods, i.e. . In other words,  contains the indices of variables  for which some  depends on both  and .

In the generalized KPD theorems, the neighborhoods  reflect the graphical structure of the distribution. If  factors according to a DAG:

… then the corresponding functional equation looks like

… i.e. , with  derived from . (Here the “parents”  are nodes with arrows into node  in the DAG.) For instance, if the  are all conditionally independent (as in the original KPD), then the equation is simply

… i.e. . Another example: if the variables form a Markov Chain with , then the corresponding equation is

… i.e. .

### Main Theorem

Let . Then the additive summary equation  is solvable for  and -dimensional  only if  can be expressed as

… for some at-most -dimensional functions , constant matrix  of column-dimensional at-most , constant vector , and a set of at-most  -indices . The notation  denotes “neighbors” of , meaning -indices  for which some  depends on both  and a variable in . (In particular, this means that all  which depend on  are constant when  is held constant.) Furthermore, the sparsity structure of each  (i.e. the set of variables  on which  depends) matches one of the . (See the end of the “Rest of the Proof” section for the exact correspondence.)

The theorem is interesting mainly when:

• The number of variables  is much larger than the summary dimension , i.e. , and...
• The number of variables  on which each  depends is small, so each  has few neighbors

When these conditions hold,  will be nonzero only for a very small fraction of terms, so the impact of the vast majority of terms/variables on  is mediated by the at-most -dimensional ; this sum serves as a summary for the  (i.e. the variables which are not neighbors of the  variables in ).

Intuitively, the simple result we’d “really like” is , with  at-most -dimensional. This is not true in general for functions  with  dimensional summaries, but it is “almost true”: it holds for all but a few “exceptions”, i.e. a few extra terms/variables which can influence  in more general ways. The number of exceptional variables is  - i.e. the  variables  plus their neighbors.

Note that the theorem claims “only if”, but not “if”. In the other direction, we can make a slightly weaker statement: any  satisfying the above form has a summary-function , with dimension at-most . The summary is just the at-most D-dimensional summary of , i.e. , plus the "exception" variables.

### Main Trick of the Proof

Pick some point  at which the rank of  takes its maximum value (which is at most  by the Summarizability Theorem). Then we can pick a set  of -indices, of size at most  (i.e. ), such that  is a basis for the (at-most -dimensional) column span of . If the system is very sparse, then  will only depend on a few of the -variables, namely .

Since  spans the maximum number of dimensions, all columns of  must fall within that span - otherwise the rank of  would be greater. And since  depends only on , this must hold for any values of the other variables . So, we can change the other variables any way we please, holding  constant, and  will remain in the span of .

Let  be any basis for the span of . (Of course we could choose  itself, but often there’s some cleaner basis, depending on the application.) Then  is a projection matrix, projecting into the span. For any  with ,   must fall within the span, which is equivalent to

(for )

### Rest of the Proof

Next, we integrate. We’ll start at , then take any path from  to  holding  constant. Then, we’ll go from  to  holding  constant. So:

For the first integral,  is held constant at , so by the previous section :

… and we’ll expand :

Now, we break the sum up into terms which do not depend on , i.e.  for which  (for which the second integral contributes zero), and terms which do depend on , i.e.  for which  (for which we can’t say anything nontrivial):

… and simplify a bit:

Since  is -dimensional (on the right), that proves the theorem; we can choose  with  ranging over , and .

This theorem is strong enough for my immediate needs, but still a little weaker than I’d ideally like.

First, there’s the converse of the Summarizability Theorem. In practice, when  is rank at-most  everywhere, I generally expect there to be a -dimensional summary. But there are exceptions, and I haven’t found a simple, convenient condition which is sufficient to ensure the existence of the summary and easily applies to most of our day-to-day functions. On the other hand, I haven’t spent that much effort looking for such a condition, so maybe someone can point me to it. It’s definitely the sort of thing I’d expect somebody else to have already solved to death.

Second, there’s probably room to reduce the freedom available to  and the -dependent terms. In particular, I believe we can impose , rather than just  and  separately. This requires first reducing , so that it only spans the dimensions actually needed to summarize , rather than all the dimensions spanned by . Given that reduction of , the basic trick is to first go through the process from the above proof, but after that choose a new basis  which includes  variables from , and go through the whole construction again with  to get a new . Terms dependent only on   or only on  can be summarized via , so the only variables which can’t be summarized this way are those dependent on variables in both  and . We should be able to iterate this process until no further reduction of the “exception” terms occurs, which should happen when  is equal to the maximum rank of .

In the special case where  depends only on , this process of iteratively reducing the number of exception terms is relatively sraightforward, and we can indeed impose . (I'm not going to go through the proof here; consider it an exercise for the reader.) (In case anyone isn't familiar with what "exercise for the reader" means in math: don't actually do that exercise, it's a pain in the ass.)

## Some Special Cases

There are two main classes of special cases: special “neighborhood” structure, and symmetry.

### Structure

The simplest example of special neighborhood structure is when  depends only on  (corresponding to conditionally independent variables in the generalized KPD theorem). As alluded to above, we can then strengthen the theorem so that . Furthermore, “neighbors” are trivial: , so . That means the summary  is at-most D-dimensional. Thus we have the converse of the theorem; it becomes an if-and-only-if.

Another useful structural constraint is when each  has at most  neighbors (including itself), i.e.  for all . In that case, . If the number of variables is much larger than , i.e. , then this guarantees that the large majority of variables influence  only via the at-most -dimensional summary .

### Symmetry

By “symmetry”, I mean that  is invariant under swapping some variables, e.g. swapping  with . This is interesting mainly when we can swap a variable in  with a variable not in . When that happens, both variables must be summarizable by . In particular, if every variable potentially in  can always be swapped with a variable not in , then we can eliminate the exception terms altogether.

For instance, the original KPD assumed conditionally IID variables, corresponding to a summary equation with  - i.e. each term is the same function  acting on a different variable. In this case, any variable can be swapped with any other, so we can eliminate the exception terms; we must have  for at-most D-dimensional . In fact, this is somewhat stronger than the corresponding result in the original KPD: it applies even when the number of variables is finite, whereas the original KPD only requires that the summary have a finite dimension as the number of variables increases to infinity.

Mentioned in
New Comment