You've left out the condition for the theorem to be true that the region with non-zero probabilitiy density can't vary with the parameter.
That's how, for example, the uniform distribution over (0,theta) can have a scalar sufficient statistic (the maximum data point) desipite it not being an exponential family model.
I'm a bit curious about what job "dimension" is doing here. Given that I can map an arbitrary vector in to some point in via a bijective measurable map (https://en.wikipedia.org/wiki/Standard_Borel_space#Kuratowski's_theorem), it would seem that the KPD theorem is false. Is there some other notion of "sufficient statistic complexity" hiding behind the idea of dimensionality, or am I missing something?
There's a smoothness assumption. I assumed differentiability, although that could be weakened somewhat. (The assumption is hidden in the sister post to this one, The Additive Summary Equation.)
Correct.
The term along with for all is a sufficient statistic, of dimension at most 2D. So if we assume that the 's are 1-dimensional, then there's a "gap" in the theorem between D and 2D: there exists a summary statistic of dimension D only if the distribution has that form, and if the distribution has that form then and for all together form a summary statistic of dimension at-most 2D.
I already understood that because you explain it in the text; the further doubt I have is: concerning only the "only if" part, given a -dimensional sufficient statistic exists by assumption, if is also a -dimensional sufficient statistic or not.
I think not, because it should not be able to capture what goes on with the variables, that's hidden in the completely arbitrary term.
This annoys me because I can't see the form of the sufficient statistic like in the i.i.d. case.
Summary
An abstraction of a high-dimensional random variable X is a low-dimensional summary G(X) that can be used to make predictions about X. In the case that X is sampled from some parameterized distribution P(X | theta), G(X) may take the form of a sufficient statistic, i.e., a function of X such that P(theta | X) = P(theta | G(X)). To make predictions about X, one may then determine theta from P(theta | G(X)), and predict a new data point X from theta.
In this post, John shows that if you have a very low-dimensional sufficient statistic G(X), then in many cases, X will “almost follow” the exponential family form and thus be fairly easy to deal with.
Further Thoughts
I don’t yet quite understand what John wants to use the theorem for, but I will hopefully learn this in his project update post. My current guess is that he would like to identify abstractions as sufficient statistics of exponential families and that this might be a data structure that is “easier to identify” in the world and in trained machine learning models than our initially “broad guess” for what abstractions could be.
Note that I’m only writing this down to have a prediction to update once I’m reading John’s update post. This seems strictly more useful than not having a prediction at all, even though I don’t place a high chance on my prediction actually being fully correct/nuanced enough.
Another thought is that I feel slightly uneasy about the viewpoint that abstractions are the thing we use to make “predictions about X”. In reality, if a person is in a particular state (meaning that the person is represented by an extremely high-dimensional sample vector X), then to make predictions, I definitely only use a very low-dimensional summary based on my sense of the person’s body part positions and coarse brain state. However, these predictions are not about X: I don’t use the summary to sample vectors that are consistent with the summary; instead, I use the summary to make predictions about summaries themselves. I.e., what will be the person’s mental state and body positions a moment from now, and how will this impact abstractions of other objects in the vicinity of that person? There should then be a “commutative diagram” relating objects in reality to their low-dimensional abstractions, and real-world state transitions to predictions.
I hope to eventually learn more about how this abstraction work feeds into such questions.
Is there a generalization of "sufficient statistic" that applies to summaries which grow as the log (or polynomial of the log) in the size of the data?
Despite the fact that your summary keeps growing, this seems like it might also be a useful definition. In the limit of infinite data, a log(n) summary is vanishingly small relative to the size of the data you have.
Good question.
The theorems here should apply directly to that situation. The summary will eventually be lower-dimensional than the data, and that's all we need for these theorems to apply. At any data size n sufficiently large that the O(log n) summary dimension is smaller than the data dimension, the distribution of those n data points must be exponential family.
I see! To reiterate: for a fixed n, having a smaller summary dimension means that the distribution of those n points is part of the exponential family.
It seems like there are three cases here:
The size of the summary remains constant as number n of samples grows. The distribution of these samples remains constant. Once enough data is collected, you can estimate the summary statistics and infer the overall distribution.
The size of the summary grows as O(log(n)). The distribution of these samples changes for each n as the size of the summary grows. You converge on the correct distribution in the limit of infinite n. (This seems weird/incorrect, I might be missing something here. I am trying to think of a concrete example)
degenerate case: the size of the required "summary" grows faster than O(n) so you are better off just using the data itself as the summary.
For the second case, each data point can measure something different, possibly correlated with each other, and related in different ways to the parameters we're trying to estimate. For instance, maybe we're trying to estimate some parameters of a car, so we measure the wheel sizes, axle length, number of gears, engine cylinder volume, etc, etc. Every now and then we measure something which gives us a totally different "kind" of information from the other things we measured - something which forces a non-exponential-family update. When that happens, we have to add a new summary component. Eventually other data points may measure the same "kind" of information and also contribute to that component of the summary. But over time, it becomes more and more rare to measure something which no other data point has captured before, so we add summary dimensions more and more slowly.
(Side note: there's no reason I know why O(log n) growth would be special here; the qualitative story would be similar for any sub-linear summary growth.)
Suppose we want to estimate the mean and variance of a normal distribution from some samples X1…Xn from the distribution. It turns out that we don’t need to keep around all of the data: only the sample mean ^μ=1n∑iXi and sample variance ^Σ=1n∑i(Xi−^μ)2 are needed for the estimation problem. From those, we can calculate the entire posterior distribution of μ,Σ. The pair (^μ,^Σ) is a “sufficient statistic” for X1…Xn: a statistic which summarizes all the estimation-relevant information from the data.
In principle, every estimation problem has a sufficient statistic: just let the summary be all the data. But this isn’t very useful. We really want to compress the data, pull out the actually-useful parts and throw away the rest. In the normal distribution example, our n samples of data are each 1-dimensional for a total of n dimensions of data, but the summary is only 2-dimensional, no matter how large n grows.
When will an estimation problem have a fixed-dimensional sufficient statistic as the number of data points increases, like the normal-distribution estimation problem? The Koopman-Pitman-Darmois (KPD) Theorem says that, when estimating distribution parameters from IID samples, there is a fixed-dimensional sufficient statistic if-and-only-if the distribution can be expressed in exponential family form:
P[X|Θ]=1Z(Θ)ef(Θ)T∑ig(Xi)∏ih(Xi)
… for some vector-valued functions f,g and positive-real-valued function h. The sufficient statistic is then ∑ni=1g(Xi). For instance, for a normal distribution, the parameters Θ are (μ,Σ), and we can use g(x)=(x,x2) and f(Θ)=f(μ,Σ)=(−Σ−1μ,−12Σ−1).
Intuitively, this says the “coupling” between X and Θ is only via the fixed-dimensional “summary” vectors f(Θ) and ∑ig(Xi).
That’s the original KPD Theorem. We’ll see in this post that the theorem generalizes a lot. It’s not just estimation of distribution parameters from IID samples - KPD can also generalize to estimation of parameters for Markov chains, or causal models with symmetry - aka stochastic programs. The theorem also extends to causal models without any symmetry, albeit slightly weaker: a small number of variables can violate the exponential form, but (roughly speaking) the number of variables which can violate the exponential form is proportional to the summary dimension. So if we have a low-dimensional summary of lots of variables, the vast majority of variables must still satisfy the exponential form. The condition which does most of the work is sparsity: we’ll assume the conditional distribution factors according to a causal model with a sparse graph. Or, in more physical terms: direct interactions are local; long-range interactions are always mediated by short-range interactions. Conditionally independent variables are just the most extreme version of this - their conditional distribution factors according to a graph with no edges at all (i.e. there are no interactions at all).
In particular, I now expect most abstractions to mostly satisfy the exponential family form, though that’s a topic for future posts.
The Theorems
We’ll cover three generalized KPD theorems, each with different assumptions on the structure of P[X|Θ]:
The first is intended as a relatively-simple case to illustrate the core ideas, before we introduce more complexity. The second is the most general theorem. The third is intended to illustrate how we can leverage symmetry to rule out the annoying “exception” variables.
This is definitely not a comprehensive set of KPD generalizations; these theorems and proofs are mainly intended to illustrate the general tools and techniques by which KPD generalizations can be derived.
Independent But Non-Identical KPD
The simplest starting point for these theorems is parameter estimation from independent-but-not-identically-distributed variables. (If you want a concrete image in mind, picture scientists trying to estimate the values of some physical constants using a bunch of different experiments.)
Theorem
Let (X1,…,Xn) be continuous random variables, conditionally independent given Θ (i.e. P[X|Θ]=∏iP[Xi|Θ]). There exists a D-dimensional sufficient statistic G(X) summarizing all information in X relevant to Θ only if the distribution has the form
P[X|Θ]=1Z(Θ)ef(Θ)⋅∑i∉Egi(Xi)∏i∉EP[Xi|Θ0]∏i∈EP[Xi|Θ]
… where:
Intuitively, this is like the original KPD, but some variables can be “exceptions” and not follow the exponential form - their coupling with Θ is not mediated by the summary vectors f(Θ) and ∑i∉Egi(xi). However, there can only be at-most D exceptions. When the number of variables is much greater than the summary dimension (i.e. n>> D), this means that the vast majority of variables must satisfy the exponential form. The summary vectors mediate the interactions with Θ of all but at-most D variables.
Note that this is an “only if” theorem, not an “if-and-only-if” theorem. It does establish that a fixed-dimension sufficient statistic exists if-and-only-if the distribution takes the above form, since any distribution with the above form can use G(X)=(XE,∑i∉Egi(Xi)) as a sufficient statistic. However, that sufficient statistic has dimension at-most 2D, whereas the “only if” theorem above assumes the existence of a statistic of dimension D. So there’s a “gap” in the if-and-only-if characterization: any conditionally-independent distribution with a D-dimensional sufficient statistic satisfies the form, but given a distribution which satisfies the form, we may only be able to find a 2D-dimensional sufficient statistic rather than a D-dimensional sufficient statistic.
Of course, for applications with n>> D, the difference between D and 2D won't matter much anyway.
Proof
The first key idea is the Minimal Map Theorem (also proved more elegantly here): if we want to summarize all the information in X which is relevant to Θ, then the posterior distribution (θ↦P[Θ=θ|X]) is a sufficient statistic. It’s also “minimal” in the sense that the posterior distribution can be calculated from any other sufficient statistic, therefore any other sufficient statistic must contain at least as much information (this can be proven via the Data Processing Inequality).
So, given our hypothetical fixed-dimensional summary statistic G(X) (where X is a vector of all the data points), we can write
F(G(X))=(θ↦P[Θ=θ|X])
… for some function F. The right-hand side is a data structure containing the values of P[Θ=θ|X] for EVERY θ-value; we can think of it as a vector indexed by θ. (In the case where θ has m possible values θ1,…,θm, any map (θ↦f(θ)) with real-valued f can be represented as a length-m vector v in which vi is f(θi).)
Now, the key property underlying (this version of) the KPD is conditional independence of X, i.e. P[X|θ]=∏iP[Xi|θ]. We want to leverage that factorization, so we’ll transform the distribution into a representation we can factor. First, transform the distribution to a log-odds representation:
logOdds(F(G(X)))=(θ↦lnP[Θ=θ|X]P[Θ=θ0|X])
… for some arbitrary reference parameters θ0. Next, we apply Bayes’ Rule:
=(θ↦lnP[X|Θ=θ]P[X|Θ=θ0]+lnP[Θ=θ]P[Θ=θ0])
… then subtract off the prior lnP[Θ=θ]P[Θ=θ0] from both sides, since it’s not a function of X:
logOdds(F(G(X)))−(θ↦lnP[Θ=θ]P[Θ=θ0])=(θ↦lnP[X|Θ=θ]P[X|Θ=θ0])
To clean up the notation, we’ll define F′(G):=logOdds(F(G))−(θ↦lnP[Θ=θ]P[Θ=θ0]). Finally, we can apply the factorization (remember, that was the point of all these transformations) and get
F′(G(X))=∑i(θ↦lnP[Xi|Θ=θ]P[Xi|Θ=θ0])
The right-hand side of this expression is a sum of vectors indexed by θ. Each vector is a function of just one of the Xi, and the left-hand side says that we can express the sum in terms of a D-dimensional summary G(X). So, this is an Additive Summary Equation (see appendix below for details). In fact, it’s a particularly simple Additive Summary Equation: each fi depends only on xi, and each xi has no neighbors besides itself.
The Additive Summary Equation Theorem then tells us that the equation is solvable (i.e. a D-dimensional summary statistic exists) only if
lnP[X|Θ=θ]P[X|Θ=θ0]=∑i∈B(lnP[Xi|Θ=θ]P[Xi|Θ=θ0])+U(θ)∑i′∉Bgi′(xi′)+C(θ)
… for some B consisting of at-most D X-indices, and some U(θ) of dimension at-most D.
Exponentiating both sides and simplifying a bit, this condition becomes
P[X|Θ=θ]=1Z(θ)eU(θ)∑i′∉Bgi′(xi′)∏i∈BP[Xi|Θ=θ]∏i′∉BP[Xi′|Θ=θ0]
… which is the desired form.
Key Takeaways
The main takeaway from the theorem is the idea of “exception” variables, which circumvent the exponential family form, and the idea that the number of exception variables is limited by the dimensionality of the sufficient statistic.
The main takeaways from the proof are:
We’ll reuse this general structure in the next proof.
Causal Model/Bayes Net KPD
Now we address a much more general case: factorization according to a Bayes Net/Causal Model. We still imagine estimating some parameters Θ, but our data is no longer conditionally independent. Conditional on the parameters, interactions between data points are no longer non-existent. However, the direct interactions are sparse - i.e. most data points interact directly with only a few other data points (though they may indirectly interact with everything). This would be the case if, for example, we are trying to estimate parameters of some complicated physical system in real-time from a single instance of the system.
Theorem
Let (X1,…,Xn) be continuous random variables, whose distribution conditional on Θ forms a Bayes Net on a DAG (i.e. P[X|Θ]=∏iP[Xi|Xpa(i),Θ], where pa(i) denotes the nodes with edges directly into i in the DAG). There exists a D-dimensional sufficient statistic G(X) summarizing all information in X relevant to Θ only if the distribution has the form
P[X|Θ]=1Z(Θ)ef(Θ)∑i∉Egi(xi,xpa(i))∏i∉EP[Xi|Xpa(i),Θ=θ0]∏i∈EP[Xi|Xpa(i),Θ=θ]
… where:
This mostly generalizes the previous theorem in the obvious way, with one major exception: we can now have a lot more exception variables. Roughly speaking, exceptions happen in “chunks”: neighborhoods each consisting of a variable, its “neighbors” in the Markov blanket, and its neighbors' children. So, in order for the theorem to say anything nontrivial, the DAG has to be sparse: each variable has to have only a few neighbors. If a single variable directly touches most of the DAG, then an exception-neighborhood around that variable could include most of the variables in the model.
As long as the DAG is sparse and n>> D, the vast majority of variables must still satisfy the exponential form. If each variable has at-most k variables in its Markov blanket, then the number of exception variables is less than k2D. For instance, in a minimal computationally-complete circuit model (e.g. NAND gates), k=6 (two inputs per node, two children per node to allow copying, and one other parent for each child), so the number of exceptions would be less than 36D. If the number of variables exceeds the summary dimension by several orders of magnitude, this would be a drop in the bucket - the vast majority of variables would still need to satisfy the exponential form.
Proof
This follows broadly the same structure as the proof for the Independent But Non-Identical KPD, so we’ll speed through the parts which are the same and just focus on differences.
As in that proof, we apply the Minimal Map Theorem, then apply a log odds transformation, and factor the distribution. The factorization is now P[X|Θ]=∏iP[Xi|Xpa(i),Θ], so the summary equation is
F′(G(X))=(θ↦lnP[X|Θ=θ]P[X|Θ=θ0])=∑i(θ↦lnP[Xi|Xpa(i),Θ=θ]P[Xi|Xpa(i),Θ=θ0])
The neighbors N(j) of a variable Xj are no longer trivial; they are exactly the Markov blanket of Xj (i.e. parents, children, and children of parents of Xj in the Bayes net). The terms which depend on XN(j) are the factors corresponding to neighbors plus children-of-neighbors, i.e. N(j)∪ch(N(j)). The Additive Summary Equation Theorem then says that a D-dimensional summary exists only if
lnP[X|Θ=θ]P[X|Θ=θ0]=∑i∈E(lnP[Xi|Xpa(i),Θ=θ]P[Xi|Xpa(i),Θ=θ0])+U(θ)∑i′∉Egi′(xi′,xpa(i′))+C(θ)
… where E=N(B)∪ch(N(B)), and |B| is at-most D.
Exponentiating and simplifying a bit yields:
P[X|Θ]=1Z(Θ)eU(Θ)∑i′∉Egi′(xi′,xpa(i′))∏i∉EP[Xi|Xpa(i),Θ0]∏i∈EP[Xi|Xpa(i),Θ]
… which is the desired form.
Key Takeaways
The main takeaway here is the idea that exceptional variables occur in neighborhoods, so sparsity is a necessary condition for the theorem to say anything nontrivial. On the other hand, causal models are extremely general, sparsity is ubiquitous in the physical sciences, and a sparse causal model is all we need to apply the theorem. We should thus expect exponential family distributions to show up in most places where low-dimensional summaries exist, in practice. In particular, we should expect exponential family forms for most abstractions, in practice.
Another important point: the methods used in the proof are “smooth”, in the sense that small violations of the assumptions should produce only small violations of the conclusions. In other words, we should also expect distributions with approximate low-dimensional summaries, which approximately factor as a sparse Bayes Net, to approximately fit the exponential form.
IID KPD
When our variables have some symmetry, we can sometimes leverage it to eliminate the annoying “exception” variables.
The basic requirement is that our distribution P[X|Θ] is invariant under permuting some of the variables. For instance, if we have a time-symmetric Markov Chain, then we can replace every Xt with Xt+1, and the distribution remains the same. In the case of IID variables, we can swap any two variables Xi and Xj, and the distribution remains the same, i.e.
P[Xi=xi,Xj=xj,…|Θ]=P[Xj=xi,Xi=xj,…|Θ]
If we can swap exception variables with non-exception variables, then we can sometimes eliminate the exceptions altogether.
Theorem
Let (X1,…,Xn) be continuous random variables, conditionally independent and identically distributed given Θ (i.e. P[X=x|Θ]=∏iP[X1=x|Θ]). There exists a D-dimensional sufficient statistic G(X) summarizing all information in X relevant to Θ, with D <n, if-and-only-if the distribution has the form
P[X|Θ]=1Z(Θ)ef(Θ)⋅∑ig(Xi)∏iP[Xi|Θ0]
… where f,g are some at-most D-dimensional summary functions.
Other than eliminating the exception terms, there are two differences from the independent-but-not-identically-distributed theorem: the gi’s are all the same function, and we explicitly require D <n. When D ≥n in the earlier theorem, all variables can be exceptions, so the theorem says nothing. For this theorem, we need at least one non-exceptional variable to swap the other variables with, so we require D <n.
Note that this theorem is strictly stronger than the original KPD as typically stated, which also applies to the IID case. The original KPD talks about existence of a finite-dimensional summary for an infinite stream of IID variables, whereas this version applies even to a finite set of variables, so long as the dimension of the sufficient statistic is smaller than the number of variables.
Proof
Since this a special case of independent variables, we can start from our independent but non-identical theorem:
P[X|Θ]=1Z(Θ)ef(Θ)⋅∑i∉Egi(Xi)∏i∉EP[Xi|Θ0]∏i∈EP[Xi|Θ]
Since we assume D <n, there must be at least one non-exception variable. So, without loss of generality, assume X1∉E.
For any other variable i∈E, we can swap Xi with X1 and integrate out all the other variables to find
1Z′(Θ)ef(Θ)⋅g1(x1)P[X1=x1|Θ0]P[Xi=xi|Θ]=P[X1=x1,Xi=xi|Θ]
=P[X1=xi,Xi=x1|Θ]
=1Z′(Θ)ef(Θ)⋅g1(xi)P[X1=xi|Θ0]P[Xi=x1|Θ]
… then integrate out x1:
P[Xi=xi|Θ]=1Z′′(Θ)ef(Θ)⋅g1(xi)P[X1=xi|Θ0]
(Minor note: we’re absorbing constants into Z(Θ) from each integral, which is why it keeps gaining primes.)
For variables i∉E, we can also swap with X1 in order to replace gi with g1. Swapping and integrating as before yields
ef(Θ)⋅g1(xi)P[X1=xi|Θ0]=ef(Θ)⋅gi(xi)P[Xi=xi|Θ0]
Substituting both of those into the original expression from the independent but non-identical theorem yields
P[X=x|Θ]=1Z′′′(Θ)ef(Θ)⋅∑ig1(xi)∏iP[X1=xi|Θ0]
… which is the desired form.
Key Takeaways
While this is a minor improvement over the original KPD, the real point is to show how symmetry can remove the exception terms.
I expect this to be an especially powerful tool in conjunction with the techniques in Writing Causal Models Like We Write Programs. That post illustrates how to turn recursively-defined functions into recursively-defined causal models. The recursive calls become exactly the sort of symmetries which could potentially be used to remove exception terms in the Causal Model/Bayes Net KPD.
Takeaways Summary
The original Koopman-Pitman-Darmois theorem says that a fixed-dimensional sufficient statistic exists for a stream of IID variables if-and-only-if the distribution is of exponential form.
This generalizes a lot. Neither independence nor identical distribution is necessary; we need only a sparse causal model. However, this generalization comes with a cost: some “exception” variables may not follow the exponential form. As long as the system is sparse, and the number of variables is much larger than the dimension of the sufficient statistic, the number of exception variables will be relatively small, and the vast majority of variables in the system will satisfy the exponential form.
The proofs should also generalize to approximations - systems with approximate sufficient statistics should be approximately exponential family, except for a few variables.
We can also sometimes leverage symmetry to eliminate the exception variables. If the distribution is invariant under some permutation of the variables, then we can try to permute exception variables to non-exception variables, in which case they must follow the exponential form.
Appendix: Additive Summary Equation Theorem
An Additive Summary Equation is a functional equation of the form
F(G(x))=f(x)=∑ifi(x)
… with F and G unknown, G of dimension D <dim(f), D <dim(x).
In order to say anything nontrivial about the equation, we need each f-term fi to only depend on a few x-components xj, and each x-component to only influence a few f-terms. Given some xj, we can pick out the (indices of) f-components which it influences via the expression
{i:∂fi∂xj≢0}
We’re also interested in the “neighbors” of xj, i.e. x-components xj′ for which some fi depends on both xj and xj′. We’ll write the (indices of) xj’s neighbors as N(j); note that we do consider xj a neighbor of itself.
The post on the Additive Summary Equation shows that the equation is solvable only if we can choose a set of at-most D components of x, called xB, for which f can be expressed as
f(x)=∑i:∂fi∂xN(B)≢0fi(x)+U∑i′gi′(x)+C
… where the gi’s have output dimension at-most D, U is a constant dim(f) by D matrix, and C is a constant vector. In particular, we can choose
gi′(x)=U†(fi′(x)−fi′(x0)),
C=∑i′fi′(x0)
… for some x0, with i′ ranging over terms not dependent on xN(B), i.e. {i′:∂fi′∂xj≡0}.
Intuitively, this says that we can’t really say anything about terms dependent on xN(B). However, xN(B) consists of at-most D variables plus their neighbors, so in a large sparse system, hopefully most terms will not depend on xN(B). And for all the terms not dependent on xN(B), the information content can be aggregated by summation - i.e. the sum ∑i′gi′(x).
Note that we may sometimes want to think about f(x) whose output dimension is uncountably infinite - i.e. f(x) returns a distribution over a continuous variable θ. In that case, the dot product U†(fi′(x)−fi′(x0)) defining gi′(x) becomes an integral ∫θU†(θ)(fi′(x)(θ)−fi′(x0)(θ))dθ; everything else remains the same. For simplicity, this post's notation implicitly assumes that θ has finitely many values, but the generalization is always straightforward.