Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Mentioned in

The Additive Summary Equation

3Leon Lang

3johnswentworth

1Leon Lang

New Comment

3 comments, sorted by Click to highlight new comments since: Today at 8:49 AM

The "dagger" indicates a pseudoinverse, not a transpose, which is why this works even with non-orthonormal U. But an orthonormal basis would probably be most convenient; in that case the pseudoinverse is just the transpose.

This post contains some theorems and proofs needed for a hopefully-upcoming post on some powerful generalizations of the

Koopman-Pitman-Darmois (KPD) Theorem. Unless you find functional equations interesting in their own right, and want to read some pretty dense math, you should probably skip this post. The theorems are pretty self-contained, and will be summarized in any future posts which need them.## The Summary Equation

We can represent the idea of a D-dimensional “summary” of x for a function f via a functional equation:

F(G(x))=f(x)

Given the function f, we try to find some D-dimensional “summary” G(x) such that f can be computed from G - i.e. we want some F,G such that F(G(x))=f(x) for all x.

In order for this to be meaningful, we need some mild assumptions on f, F, and G; at the very least, we certainly need to exclude

space-filling curves, which would defeat the point of a “D-dimensional summary”. Throughout this post, we’ll assume differentiability, although this should be easy to relax somewhat by taking limits of differentiable functions.Easy theorem:

TheD-dimensional summary equation forfis solvable only if the rank of the matrix∂f∂xis at mostDfor all values ofx. I’ll call this the “Summarizability Theorem”. (If you want a more official-sounding name, it’s the global converse of theconstant-rank theorem.)Proof: differentiate both sides of the equation to get ∂F∂G∂G∂x=∂f∂x. Since G is D-dimensional, this is itself a rank-at-most-D decomposition of ∂f∂x.

In practice, the converse will also usually hold: if the rank of ∂f∂x is at most D for all values of x, then we can usually find a D-dimensional summary G(x). Indeed, if the rank is constant near some point x0, then we can

alwaysfind alocalD-dimensional summary near x0; that’s what the constant rank theorem says. However, Weird Stuff can sometimes prevent stitching these local summaries together into a global summary. (Thank you toVanessafor pointing me toan exampleof such “Weird Stuff”, as well as the name of the constant rank theorem.)Minor notation point: each variable xj corresponds to a

columnof ∂f∂x. This convention will be used throughout the post. We will also assume that each xj is one-dimensional; higher-dimensional variables are represented by their components.## The Additive Summary Equation

The heart of the generalized KPD theorems is a family of special cases of the Summary Equation in which f(x) is a sum of terms, each of which depend on only a few variables. I’ll call this the Additive Summary Equation. The most general version looks like this:

F(G(x))=f(x)=∑ifi(xNf(i))

… where fi are (known) smooth functions of output dimension m>D, and Nf(i) specify (known) indices of x. Notation example: if we have a term f2(x5,x7), then Nf(2)={5,7} and xNf(2)=(x5,x7).

The notation Nf(i) here stands for a “neighborhood” induced by fi, specifying the indices of x-variables on which fi depends. In the following sections, we’ll talk about the neighborhood of a

variablexj, denoted N(j). This consists of all the variables which are neighbors of xj in any of the f-induced neighborhoods, i.e. N(j)=∪i:j∈Nf(i)Nf(i). In other words, N(j) contains the indices of variables xj′ for which some fi depends on both xj and xj′.In the generalized KPD theorems, the neighborhoods Nf(i) reflect the graphical structure of the distribution. If P[X|Θ] factors according to a DAG:

P[X|Θ]=∏iP[Xi|Xpa(xi),Θ]

… then the corresponding functional equation looks like

F(G(x))=∑ifi(Xi,Xpa(i))

… i.e. Nf(i)=pa(i)∪{i}, with fi derived from P[Xi|Xpa(xi),Θ]. (Here the “parents” pa(i) are nodes with arrows into node i in the DAG.) For instance, if the Xi are all conditionally independent (as in the original KPD), then the equation is simply

F(G(x))=∑ifi(Xi)

… i.e. Nf(i)={i}. Another example: if the variables form a Markov Chain with P[X|Θ]=∏iP[Xi|Xi−1,Θ], then the corresponding equation is

F(G(x))=∑ifi(Xi,Xi−1)

… i.e. Nf(i)={i,i−1}.

## Main Theorem

Let f(x):=∑ifi(x). Then the additive summary equation F(G(x))=f(x) is solvable for F and D-dimensional G only if f can be expressed as

f(x)=∑i:∂fi∂xN(B)≢0fi(x)+U∑i′gi′(x)+C

… for some at-most D-dimensional functions {gi′}, constant matrix U of column-dimensional at-most D, constant vector C, and a set of at-most D x-indices B. The notation N(B) denotes “neighbors” of B, meaning x-indices j for which some fi depends on both xj and a variable in xB. (In particular, this means that all fi which depend on xB are constant when xN(B) is held constant.) Furthermore, the sparsity structure of each gi′ (i.e. the set of variables {xj} on which gi′ depends) matches one of the fi. (See the end of the “Rest of the Proof” section for the exact correspondence.)

The theorem is interesting mainly when:

When these conditions hold, ∂fi∂xN(B) will be nonzero only for a very small fraction of terms, so the impact of the vast majority of terms/variables on f(x) is mediated by the at-most D-dimensional ∑i′gi′(x); this sum serves as a summary for the x¯N(B) (i.e. the variables which are

notneighbors of the D variables in xB).Intuitively, the simple result we’d “really like” is f(x)=U∑i′gi′(x)+C, with ∑i′gi′(x) at-most D-dimensional. This is not true in general for functions f with D dimensional summaries, but it is “almost true”: it holds for all but a few “exceptions”, i.e. a few extra terms/variables which can influence f in more general ways. The number of exceptional variables is |N(B)| - i.e. the D variables xB plus their neighbors.

Note that the theorem claims “only if”, but not “if”. In the other direction, we can make a slightly weaker statement: any f satisfying the above form has a summary-function G(x)=(xN(B),∑i′gi′(x)), with dimension at-most D+|N(B)|. The summary is just the at-most D-dimensional summary of x¯N(B), i.e. ∑i′gi′(x), plus the "exception" variables.

## Main Trick of the Proof

Pick some point x0 at which the rank of ∂f∂x takes its maximum value (which is at most D by the Summarizability Theorem). Then we can pick a set B of x-indices, of size at most D (i.e. |B|≤D), such that ∂f∂xB|x0 is a basis for the (at-most D-dimensional) column span of ∂f∂x|x0. If the system is very sparse, then ∂f∂xB will only depend on a few of the x-variables, namely xN(B).

Since ∂f∂xB|x0 spans the maximum number of dimensions,

allcolumns of ∂f∂x|x0 must fall within that span - otherwise the rank of ∂f∂x would be greater. And since ∂f∂xB depends only on xN(B), this must hold foranyvalues of the other variables x¯N(B). So, we can change the other variables any way we please, holding xN(B) constant, and ∂f∂x will remain in the span of ∂f∂xB|x0.Let U be any basis for the span of ∂f∂xB|x0. (Of course we could choose U=∂f∂xB|x0 itself, but often there’s some cleaner basis, depending on the application.) Then UU† is a projection matrix, projecting into the span. For any x with xN(B)=x0N(B), ∂f∂x must fall within the span, which is equivalent to

∂f∂x=UU†∂f∂x (for xN(B)=x0N(B))

## Rest of the Proof

Next, we integrate. We’ll start at x0, then take any path from x0¯N(B) to x¯N(B) holding xN(B) constant. Then, we’ll go from x0N(B) to xN(B) holding x¯N(B) constant. So:

f(x)=f(x0)+∫∂f∂x¯N(B)|x0N(B)dx¯N(B)+∫∂f∂xN(B)|x¯N(B)dxN(B)

For the first integral, xN(B) is held constant at x0N(B), so by the previous section ∂f∂x=UU†∂f∂x:

=f(x0)+U∫U†∂f∂x¯N(B)|x0N(B)dx¯N(B)+∫∂f∂xN(B)|x¯N(B)dxN(B)

… and we’ll expand f=∑ifi:

=∑ifi(x0)+U∑i∫U†∂fi∂x¯N(B)|x0N(B)dx¯N(B)+∑i∫∂f∂xN(B)|x¯N(B)dxN(B)

Now, we break the sum up into terms which do not depend on xN(B), i.e. fi for which ∂fi∂xN(B)≡0 (for which the second integral contributes zero), and terms which do depend on xN(B), i.e. fi for which ∂fi∂xN(B)≢0 (for which we can’t say anything nontrivial):

=∑ifi(x0)+U∑i:∂fi∂xN(B)≡0∫U†∂fi∂x¯N(B)|x0N(B)dx¯N(B)+∑i:∂fi∂xN(B)≢0(fi(x)−fi(x0))

… and simplify a bit:

=∑i:∂fi∂xN(B)≢0fi(x)+∑i:∂fi∂xN(B)≡0fi(x0)+U∑i:∂fi∂xN(B)≡0U†(fi(x)−fi(x0))

Since U is D-dimensional (on the right), that proves the theorem; we can choose gi′(x)=U†(fi′(x)−fi′(x0)) with i′ ranging over {i:∂fi∂xN(B)≡0}, and C=∑i:∂fi∂xN(B)≡0fi(x0).

## Loose Threads

This theorem is strong enough for my immediate needs, but still a little weaker than I’d ideally like.

First, there’s the converse of the Summarizability Theorem. In practice, when ∂f∂x is rank at-most D everywhere, I generally expect there to be a D-dimensional summary. But there are exceptions, and I haven’t found a simple, convenient condition which is sufficient to ensure the existence of the summary and easily applies to most of our day-to-day functions. On the other hand, I haven’t spent that much effort looking for such a condition, so maybe someone can point me to it. It’s definitely the sort of thing I’d expect somebody else to have already solved to death.

Second, there’s probably room to reduce the freedom available to B and the xN(B)-dependent terms. In particular, I believe we can impose |B|+dim(U)≤D, rather than just |B|≤D and dim(U)≤D separately. This requires first reducing U, so that it only spans the dimensions actually needed to summarize x¯N(B), rather than all the dimensions spanned by ∂f∂xB|x0. Given that reduction of U, the basic trick is to first go through the process from the above proof, but after that choose a new basis B′ which includes dim(U) variables from x¯N(B), and go through the whole construction again with B′ to get a new U′. Terms dependent only on x¯N(B) or only on x¯N(B′) can be summarized via U′†(fi(x)−fi(x0)), so the only variables which

can’tbe summarized this way are those dependent on variables inbothxN(B) and xN(B′). We should be able to iterate this process until no further reduction of the “exception” terms occurs, which should happen when |B|+dim(U) is equal to the maximum rank of ∂f∂x.In the special case where fi depends only on xi, this process of iteratively reducing the number of exception terms is relatively sraightforward, and we can indeed impose |B|+dim(U)≤D. (I'm not going to go through the proof here; consider it an exercise for the reader.) (In case anyone isn't familiar with what "exercise for the reader" means in math: don't actually do that exercise, it's a pain in the ass.)

## Some Special Cases

There are two main classes of special cases: special “neighborhood” structure, and symmetry.

## Structure

The simplest example of special neighborhood structure is when fi depends only on xi (corresponding to conditionally independent variables in the generalized KPD theorem). As alluded to above, we can then strengthen the theorem so that |B|+dim(U)≤D. Furthermore, “neighbors” are trivial: N(B)=B, so |N(B)|+dim(U)≤D. That means the summary G(x)=(xN(B),∑i′gi′(x))=(xB,∑i′gi′(x)) is at-most D-dimensional. Thus we have the converse of the theorem; it becomes an if-and-only-if.

Another useful structural constraint is when each xj has at most k neighbors (including itself), i.e. N(j)≤k for all j. In that case, |N(B)|≤k|B|≤kD. If the number of variables is much larger than kD, i.e. n>>kD, then this guarantees that the large majority of variables influence f only via the at-most D-dimensional summary ∑i′gi′(x).

## Symmetry

By “symmetry”, I mean that f is invariant under swapping some variables, e.g. swapping x1 with x2. This is interesting mainly when we can swap a variable in xN(B) with a variable

notin xN(B). When that happens,bothvariables must be summarizable by ∑i′gi′(x). In particular, ifeveryvariable potentially in xN(B) can always be swapped with a variable not in xN(B), then we can eliminate the exception terms altogether.For instance, the original KPD assumed conditionally IID variables, corresponding to a summary equation with f(x)=∑if′(xi) - i.e. each term is the same function f′ acting on a different variable. In this case, any variable can be swapped with any other, so we can eliminate the exception terms; we must have f(x)=U∑igi(xi)+C for at-most D-dimensional ∑igi(xi). In fact, this is somewhat stronger than the corresponding result in the original KPD: it applies even when the number of variables is finite, whereas the original KPD only requires that the summary have a finite dimension as the number of variables increases to infinity.