KPD is a weak obstruction

JustinSheek

Summary

I'll begin with the notion of sufficient statistics: why they matter for bounded agents, the 90 year old obstruction one encounters, and why this isn't so bad. Then I'll present an alternative approach with proofs and exercises, following Probability Theory: The Logic of Science (Jaynes 2003) extensively.

In pt.1 I use a symmetry argument to show that KPD assumes too much. All the benefits of sufficient statistics can be realized without assuming any kind of independence.

In pt.2 I generalize Maxent to elicit a class of non-iid models exhibiting sufficient statistics that are compatible with pt.1 by construction. Taken together, this can be viewed in some sense as a representation theorem for finite exchangeability.

Pt. 0 -- preliminaries

Let's start with a couple historical remarks on the notion of sufficient statistics, as introduced by RA Fisher (ca. 1920) in On the Mathematical Foundations of Theoretical Statistics.

That the statistic chosen should summarise the whole of the relevant information supplied by the sample. This may be called the Criterion of Sufficiency.

Not long after, three independent (and presumably identically distributed) authors, e.g. Koopman 1936, discovered an obstruction to sufficiency. Today this result is known as the Koopman-Pitman-Darmois (KPD) theorem.

According to the Pitman–Koopman–Darmois theorem, among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a sufficient statistic whose dimension remains bounded as sample size increases. Intuitively, this states that nonexponential families of distributions on the real line require nonparametric statistics to fully capture the information in the data.
Less tersely, suppose are independent identically distributed real random variables whose distribution is known to be in some family of probability distributions, parametrized by $θ$ , satisfying certain technical regularity conditions, then that family is an exponential family if and only if there is a $R^{m}$ -valued sufficient statistic $T (X_{1}, . . ., X_{n})$ whose number of scalar components $m$ does not increase as the sample size $n$ increases.

At first blush, this seems to place a considerable restriction on an agent's ability to compress data for the purpose of inference. Our brains summarize sensory data extensively. Does that mean we throw away most of the relevant data, or that we have representations of exponential families tucked away somewhere in our brains, or else what precisely? Once we accept KPD, we are stuck with the exponential family. Fortunately, the exponential family is actually pretty great!

Uniquely determined by KPD
Uniquely determined by Maxent
Fast parallelizable inference w/ conjugate priors
Used nearly everywhere in physics

Yet I will argue that we can do better than KPD. We can extend Maxent. We can still do fast inference. Physics itself permits more general models. But first, I will motivate sufficiency as an essential ingredient for bounded agentic reasoning.

Why should an agent care about sufficiency?

A typical sufficient statistic will be the sample mean and variance. For $N$ data points storage is $O (N)$ , whereas for sample mean and variance storage is $O (1)$ . This is a dramatic difference for large datasets. For the purpose of this document I will call this discrepancy the agent compression problem. When reasoning about the environment, a bounded agent with access to lots of sensory data will greatly benefit from finding sufficient statistics. Almost every parametric model you can name uses a sufficient statistic. Consider the task of measuring a few thermodynamic variables like temperature, pressure, and chemical potential vs. trying to track the individual motions of $10^{23}$ molecules. Lest I've undersold the importance and ubiquity of this idea, though it's rarely (never?) described this way, the Standard Model is essentially a catalog of sufficient statistics for elementary particle interactions, i.e. sufficient statistics that we've experimentally uncovered and/or proposed on the basis of symmetry (or symmetry-breaking) arguments. From scattering cross-sections to thermodynamic equations of state, physicists have spent a century and a half developing fine-grained microscopic models from coarse-grained macroscopic measurements.

What is sufficiency, really?

Jaynes (ch. 8) introduces the notion of sufficiency, with the following observations:

Properly an invariance of the posterior: "...given $r$ [the sufficient statistic] alone we should be able to draw the same inferences about [a model parameter] $θ$ ..."
Orthogonal to assumptions about sampling characteristics: "Note that we are not assuming independent or exchangeable sampling here..."
Produces a kernel condition on the prior (Jaynes 8.5). Fisher sufficiency is a special case of vanishing kernel, because it has no notion of a prior.
"...use of a particular prior may make certain particular aspects of the data irrelevant. Then a different prior may make different aspects of the data irrelevant. One who is not prepared for this may think that a contradiction or paradox has been found."
Only vanishing kernel implies Fisher-Neyman factorization (Jaynes 8.9), which is the starting point of most proofs of KPD.

Provided vanishing kernel, for a statistic $r (x_{1} . . . x_{n})$ sufficient for $θ$ , the conditional distribution satisfies Fisher-Neyman (F-N) factorization:

p (x_{1} . . . x_{n} | θ) = q (r | θ) b (x_{1} . . . x_{n})

Yet KPD makes restrictive assumptions about sampling characteristics and makes no use of the prior or posterior. Based on the above caveats, we should expect KPD to fail to capture the essence of sufficiency. I suspect KPD is a poor fit because it does not approach the agent compression problem directly -- let's explore that!

Pt. I -- sufficiency from symmetry

Let's approach agent compression directly by describing an "inference-only" agent, i.e. one that is not intervening on its inputs, that is considering the following propositions:

D = "receive data -- from some operationalized process -- in the form of elementary outcomes $x_{1} . . . x_{n}$ "
E = "likewise $x_{n + 1} . . . x_{n + m}$ "
Note: D vs E is a logical boundary, not necessarily a causal or temporal boundary, e.g.

D corresponds to the first run of an experiment, E corresponds to the second run of the same experiment
D corresponds to the kth flips of a coin with k prime, E corresponds to the kth flips of a coin with k composite
D corresponds to a simple microscopic model of the magnetic dipoles of a particular cube of pure nickel in thermal equilibrium with its environment at some particular temperature, within some particular magnetic field, etc. E corresponds to the same degrees of freedom in a different cube of pure nickel under the same conditions

A is some model proposition, e.g.

$A_{f}$ : "in spite of any other evidence, the long-run frequency of Heads under these experimental settings is f"
$A_{p}$ : "in spite of any other evidence, the probability of Heads under these experimental settings is p"
$A_{m e o w}$ : "this sensory input contains a cat"

T(D) = "the output of program T acting on input D"

Define Bayes sufficiency, i.e. "T(D) is Bayes sufficient for A given D":

P (A | D) = P (A | T (D))

Note that this does not require A to be a differentiable parameter (or even a parameter at all). The posterior $P (A | D)$ is itself Bayes sufficient. To wit,

P (A | P (A | D)) = P (A | D)

This last equality is evident in Jaynes' exploration of sufficiency, wherein he alludes to the agent compression problem.

...the only property of E which the robot needs in order to reason out the effect of new information is this density $(A_{p} | E)$ . Everything that the robot has ever learned which is relevant to proposition A may consist of millions of isolated separate facts. But when it receives new information, it does not have to go back and search its entire memory for every little detail of its information relevant to A. Everything it needs in order to reason about A from that past experience is contained summarized in this one function, $(A_{p} | E)$ (Jaynes p. 557)

For sufficiency to hold for any $n$ , it must follow that:

P (A | E D) = P (A | T (E D))

It should be clear from this expression that -- without further assumptions -- an agent needs to retain D to compute $T (E D)$ , $T (F E D)$ , and so on. Sufficiency alone does not solve the compression problem.

Whence KPD?

One way to view KPD is that it makes progress by implicitly assuming a symmetry called infinite exchangeability. To build up to that, first consider the following symmetry under the permutation group -- here denoted $S y m (D)$ -- otherwise known as finite exchangeability:

\forall π \in S y m (D) : P (D) = P (π (D))

Exercise: show that this relation implies the $x_{i}$ are identically distributed.

Infinite exchangeability is the extension of finite exchangeability to any finite symmetric subgroup, in the limit $n \to \infty$ . Representation theorems, due to de Finetti 1937 and others, establish conditions in which infinite exchangeability is equivalent to a mixture of iid. For a survey of various forms of exchangeability, see Diaconis 1988.

Together with F-N factorization -- modulo proof-specific technical assumptions -- infinite exchangeability entails KPD. There are many proofs. What they appear to have in common is a step where the independence assumption is used with a logarithm to break a product of independent probabilities into a sum. Then the identical distribution property is used to argue that the statistic can also be broken into a sum. Matching these expressions yields the exponential family. But one can represent nearly any probability distribution as a sum, by application of Kolmogorov-Arnold. The logarithm is perfectly suited only to independent distributions. This should give us some intuition that KPD's independence assumption is unnecessarily restrictive.

Time for T

I will make a slightly different assumption, informed by Bayes sufficiency -- one that I think is more natural as it involves the sufficient statistic directly:

\forall π \in S y m (D) : T (D) = T (π (D))

A pedagogical note. This assumption differs from exchangeability in that inference on A, given D -- rather than the plausibility of D itself -- is invariant to $D \to π (D)$ :

\begin{matrix} ⟹ P (A | D) & = P (A | T (D)) = P (A | T (π (D))) = P (A | π (D)) \end{matrix}

Due to its permutation symmetry, T can be represented by a particularly nice class of functions. Namely, power sum symmetric polynomials (essentially sample moments in this context) which form a basis of universal approximators via the Stone-Weierstrass theorem. See the Appendix of Deep Sets for a detailed discussion. There the authors show that permutation symmetry permits an invertible embedding of D (n dim) into $ϕ_{k} (x) = x^{k}, k \in 0, 1, . . ., n$ (n+1 dim). From there, a simple way to achieve compression is to assume that inference on A depends only on some subspace, say polynomials up to power m ( $m < n$ ). $ϕ_{k} (x) = x^{k}, k \in 0, 1, . . ., m$ . A more principled approach to compression will be explored in pt.2. With those ingredients in mind:

\begin{matrix} T_{k} (D) & = \sum i ϕ_{k} (x_{i}) T & = [T_{0}, T_{1}, . . ., T_{m}] ⟹ T (E D) & = T (E) + T (D) ⟹ P (A | E D) & = P (A | T (E) + T (D)) \end{matrix}

Now, so long as inference for A is the only goal, the agent is able to discard D in favor of T(D). This leads to the same update power as conjugate priors for the exponential family, apparently without any restrictions on model family.

Note that in thermodynamic settings, $T (E D) = T (E) + T (D)$ is called extensivity.

Low-effort example

Let's try a simple illustration when T(D) are power sums:

\begin{matrix} ϕ_{k} (x) & = x^{k}, k = 0, 1, 2 D & = [3, 12, 4, 9] E & = [15, 4, 8] \end{matrix}

$\begin{matrix} ϕ (D) & = [(1, 3, 9), (1, 12, 144), (1, 4, 16), (1, 9, 81)] ϕ (E) & = [(1, 15, 225), (1, 4, 16), (1, 8, 64)] ϕ (E D) & = [(1, 3, 9), (1, 12, 144), (1, 4, 16), (1, 9, 81), (1, 15, 225), (1, 4, 16), (1, 8, 64)] T (D) & = [(4, 28, 250)] T (E) & = [(3, 27, 305)] T (E D) & = [(7, 55, 555)] T (E) + T (D) & = [(7, 55, 555)] \end{matrix}$

Next steps

That's it. Start with Bayes sufficiency, use the representation theorem implied by permutation symmetry, compress, evade KPD. But what about Maxent? Doesn't that also justify exponential families? Jaynes recasts sufficiency in these terms:

...if we measure information in terms of entropy, then zero information loss in going from the full data set D to a statistic r is equivalent to sufficiency of r... (footnote to Jaynes p. 247)

Evidently, sufficiency has something to do with information entropy. As Maxent does not make an iid assumption (see Jaynes eq. 14.58), it provides a more generic justification for exponential families. Let's explore these ideas more in the next section.

Pt. II -- constructive non-iid models

Much of this section reflects a wider effort I undertook many years ago to generalize Maxent. These results were developed independently from pt.1 and as such do not depend on exchangeability. Instead they will reflect the notion of sufficiency in the sense of information entropy. I will offer a construction of models that simply make use of exchangeability, for the purpose of illustrating the compatibility and practical applicability of these approaches to agent compression.

f-divergence basic properties

Beginning with two probability distributions -- p and q -- and a convex function f, with $R^{+} \subseteq d o m (f)$ , define the f-divergence:

D_{f} (p, q) = \int d x p (x) f (\frac{q (x)}{p (x)})

Discrete and path integral analogues may be formulated mutatis mutandis. Beware that there are conflicting conventions in the literature -- sometimes f-divergence is defined with opposite sign or with arguments reversed. When p and q are in the same parametric family, then through an abuse of notation one also writes:

D_{f} (θ, θ^{'}) = \int d x p (x | θ) f (\frac{p (x | θ^{'})}{p (x | θ)})

Following section 3.10 of Nielsen 2018, the f-divergence shares key desiderata with Kullback–Leibler divergence (KL). Namely Jensen's Inequality, the Data Processing Inequality, invariance under 1:1 transformations, etc.

Exercise: prove 1 or 2 of the above properties.

For well-behaved convex functions, it's useful to define two involutions ( $#$ and $⋄$ ):

\begin{matrix} f^{#} (f^{'} (x)) & = x f^{'} (x) - f (x) f^{⋄} (u) & = u f (\frac{1}{u}) & (u & > 0) \end{matrix}

Exercise: show that these operations are involutions on the space of convex functions.

Without loss of generality and because it makes some of the math simpler, I tend to use so-called "standard" f-divergences, satisfying (note for the last relation there's a typo in Nielsen):

\begin{matrix} f (1) & = 0 & f^{'} (1) & = 0 & f^{''} (1) & = 1 \end{matrix}

Exercise: prove the well-known identity for convex conjugates:

f^{#'} = f^{' - 1}

Exercise: show that:

f^{⋄'} (1) = f (1) - f^{'} (1) = 0

Observe that KL is a special case of f-divergence, ordinarily defined with $f (u) = - l n u$

Exercise: show that KL in "standard" form is $f_{K L} (u) = - 1 - l n u + u$

Lagrangians are neat!

Here is a quick review of the Maxent solution (Jaynes 11.6) via the method of Lagrange multipliers (recall that $T_{0} (D) = n$ ):

\begin{matrix} δ [- \sum D p (D) l n (\frac{q (D)}{p (D)}) - \sum k λ_{k} \sum D p (D) T_{k} (D)] {∣ ∣}_{p = p^{⋆}} = [- l n (\frac{q (D)}{p (D)}) + 1 - \sum k λ_{k} T_{k} (D)] {∣ ∣}_{p = p^{⋆}} δ p = 0 \end{matrix}

\begin{matrix} ⟹ p^{⋆} (D | λ) & = q (D) e^{- 1 + \sum_{k} λ_{k} T_{k} (D)} = q (D) e^{- 1 + ⟨ λ, T (D) ⟩} \end{matrix}

Exercise: convince yourself that this is an exponential family in the so-called "canonical form".

This result is what I mean when I say Maxent justifies the exponential family. Exponential families are the extreme points of KL. The extrema of f-divergences share key properties with the extrema of KL, in particular the form of F-N. I'll show this now.

Beware I'm skipping some details such as using KKT conditions to restrict the feasible set to $p (D) \geq 0$ , which you got "for free" in the exponential family solution. Nevertheless, following the same method as above, but now for f-divergences:

\begin{matrix} δ [\sum D p (D) f (\frac{q (D)}{p (D)}) - \sum k λ_{k} \sum D p (D) T_{k} (D)] {∣ ∣}_{p = p^{⋆}} δ [\sum D q (D) f^{⋄} (\frac{p (D)}{q (D)}) - \sum k λ_{k} \sum D p (D) T_{k} (D)] {∣ ∣}_{p = p^{⋆}} = [f^{⋄'} (\frac{p (D)}{q (D)}) - \sum k λ_{k} T_{k} (D)] {∣ ∣}_{p = p^{⋆}} δ p = 0 \end{matrix}

\begin{matrix} ⟹ p^{⋆} (D | λ) & = q (D) f^{⋄' - 1} (\sum k λ_{k} T_{k} (D)) \end{matrix}

Exercise: show that because $f^{#'} = f^{' - 1}$ and $f^{⋄'} (1) = 0$ , we can write:

p^{⋆} (D | λ) = p^{⋆} (D | λ = 0) f^{⋄ #'} (⟨ λ, T (D) ⟩)

By inspection, this is in the form of F-N, so it automatically reflects Bayes sufficiency:

\begin{matrix} ⟹ P (λ | D) & = P (λ | T (D)) \end{matrix}

Exercise: show this.

Remarks on $p^{⋆}$

Now that we have a solution to our Lagrangian, we can return to the question of compression. Start with any fixed size representation of T. Standard arguments from convex optimization show how a redundant statistic $T_{k}$ will become a slack condition, eliminating that constraint automatically through the vanishing of its Lagrange multiplier ( $λ_{k} \to 0$ ) -- see Jaynes pp. 369-370. In physics, this phenomenon underlies universality. Although a lifetime worth of complications arise in continuum models, techniques like renormalization group (RG) inform particular procedures to enable one to argue that all but finitely many terms vanish. On the other hand, unaccounted statistics will cause recognizable errors in reasoning, which has historically led to new postulates such as Planck's energy quantization -- to explain blackbody radiation -- or the Anderson-Higgs mechanism.

When $T (D)$ is permutation invariant, $p^{⋆} (D | λ)$ models $D = x_{1} . . . x_{n}$ as identically distributed, but not independent. (It's possible to relax the identical distribution property by assuming a weaker symmetry known as partial exchangeability. Jaynes eq. 14.58 exhibits the essential idea.) Independence holds exclusively in the KL case. But KL never gives power laws or heavy tails which are in reality ubiquitous. An f-divergence readily produces power laws.

The f-divergence based on Amari $α$ -divergence has extrema that are power laws:

\begin{matrix} f_{α} (u) & = \frac{4}{1 - α^{2}} (\frac{1 + α}{2} + \frac{1 - α}{2} u - u^{\frac{1}{2} (1 - α)}) f_{α}^{⋄} (u) & = \frac{4}{1 - α^{2}} (\frac{1 - α}{2} + \frac{1 + α}{2} u - u^{\frac{1}{2} (1 + α)}) = f_{- α} (u) y & = f_{α}^{⋄'} (u) = \frac{2}{1 - α} (1 - u^{- \frac{1}{2} (1 - α)}) f_{α}^{⋄' - 1} (y) & = (1 - \frac{1 - α}{2} y)^{- \frac{2}{1 - α}} \end{matrix}

This last expression is sometimes called the q-exponential ( $α = 3 - 2 q$ ). $α$ -divergences lead to useful applications and insights in information geometry and machine learning, e.g. they provide a geometric interpretation of the expectation-maximization (E-M) algorithm.

Exercise: show that ${lim}_{α \to 1} f_{α} (u) = - 1 - l n u + u = f_{K L} (u)$

Exercise: show that ${lim}_{α \to 1} f_{α}^{⋄' - 1} (y) = e^{y}$

Here's a fun example that demonstrates an f-divergence of purely combinatoric origin. If you follow the Wallis derivation (Jaynes 11.4) of entropy, but instead of classical quanta you perform the thought experiment with bosons, as I recall you end up with a non-standard f-divergence with extrema mimicking Bose-Einstein statistics:

\begin{matrix} f (u) & = - (1 + u) l n (1 + u) + u l n u f^{⋄} (u) & = u [- (1 + 1 / u) l n (1 + 1 / u) + (1 / u) l n (1 / u)] = f (u) y & = f^{⋄'} (u) = - l n (1 / u + 1) f^{⋄' - 1} (y) & = \frac{1}{e^{- y} - 1} \end{matrix}

Exercise: show that the "standardized" form satisfies:

\begin{matrix} f_{B E} (u) & = 2 u l n (u) - 2 (1 + u) l n (\frac{1}{2} (1 + u)) f_{B E}^{⋄' - 1} (y) & = \frac{1}{2 e^{- y / 2} - 1} \end{matrix}

Same macro, different micro

Let's see f-divergence in action. Consider the following thought experiment optional -- you can skip ahead to contemplating bit strings if you like. Imagine that you have a big rock primarily composed of two isotopes of carbon, carbon-12 and carbon-14, in roughly 1:1 proportion. You also have a black box. When you put a chunk of your rock in the box, it ejects some sludge and then prints out a sheet of graphene: perfectly rectangular, one atom thick, and without defects. You look at it under a microscope and see a hexagonal lattice (side length $a$ ) of carbon atoms, but your microscope can't distinguish the carbon-12 from the carbon-14. Because of this -- and because it might take months to scan your microscope over the entire sheet, counting hexagons -- you decide to make some macroscopic measurements. First you measure the area of the sheet $A$ , which lets you calculate the number of carbon atoms $N = \frac{2 A}{\sqrt{3} a^{2}}$ . Second you measure the mass of the sheet $M$ , which will help you calculate the proportion of each isotope. Let $ζ$ be the proportion of carbon-12, then $M = ζ N m_{C - 12} + (1 - ζ) N m_{C - 14}$ . Suppose you look up the masses of each isotope in a table and, based on your measurements, use those values to calculate unexpectedly that $ζ = 1 / 4$ . After further testing -- smaller chunks, larger chunks, cutting, balancing, electrifying, and so on -- you convince yourself based on the apparent translation symmetry of the lattice that each unit cell should be statistically identical. So if you let $n = 6$ represent the number of carbon atoms per unit cell, and $k$ represent the number of carbon-12 atoms per unit cell, you can picture your graphene sheet as an ensemble of hexagons with $⟨ k ⟩ = ζ n = 3 / 2$ . Now you are curious, but your apparatus haven't told you, what is the isotopic composition of the unit cells produced by the black box? This motivates the following model:

Imagine that you have access to a bit string:

\begin{matrix} D & = x_{1} . . . x_{n} & (n = 6) \end{matrix}

With the property:

\begin{matrix} ⟨ k ⟩ & = ζ n = 3 / 2 & (ζ = 1 / 4) \end{matrix}

For a finite state space $Ω$ one can always make the choice of uniform reference measure:

\begin{matrix} q (D) & = \frac{1}{| Ω |} = \frac{1}{2^{n}} & (Ω = {0, 1}^{n}) \end{matrix}

There is a physical intuition that this corresponds to: provided that whatever process is happening in the box is unable to differentiate the carbon isotopes when forming graphene, the proportion of isotopes in the output will be the same as it was in the input, 1:1. That this has been experimentally disconfirmed is why there is any non-trivial inference left to do.

Exercise: show that if the experiment had instead confirmed $⟨ k ⟩ = 1 / 2$ , all of the models would be trivial, i.e. starting with any f-divergence:

p^{⋆} (D | λ) \to q (D) = \frac{1}{2^{n}}

Let $T (D) = [\sum_{i} 1, \sum_{i} x_{i}] = [n, k]$ . Canonically, the binomial distribution is expressed in terms of the pushforward measure $T_{⋆} P (T (D))$ rather than $P (D)$ . In other words, it's treated as a distribution acting on $k \in N$ rather than a distribution acting on $D \in Ω$ . This treatment is taken so much for granted that you'll typically see the function $T_{⋆} P (T (D))$ presented as the definition of the binomial distribution. That definition is only sensible in the first place because $k$ is a sufficient statistic... smuggling in exchangeability after all. We will adopt that convention here:

\begin{matrix} T_{⋆} q (T (D)) & = q (T^{- 1} ([n, k])) = \sum D : T (D) = [n, k] q (D) = \frac{1}{2^{n}} (\frac{n}{k}) & = B (k, n, \frac{1}{2}) T_{⋆} p^{⋆} (T (D) | λ) & = p^{⋆} (T^{- 1} ([n, k]) | λ) = q (T^{- 1} ([n, k])) f^{⋄ #'} (⟨ λ, [n, k] ⟩) = \frac{1}{2^{n}} (\frac{n}{k}) f^{⋄ #'} (λ_{0} n + λ_{1} k) & = ν (n, k, λ) \end{matrix}

From here it is easy to proceed. Pick an f. Use your favorite numerical solver to find $[λ_{0}, λ_{1}]$ . That's your model.

$f_{K L} (u) = - 1 - l n u + u$
$ν_{K L} (n, k, λ) \to 0.1780 (\frac{6}{k}) e^{- 1.099 k}$
${lim}_{α \to - 1} f_{α} (u) = 1 - u + u l n u$
$ν_{α = - 1} (n, k, λ) \to \frac{(\frac{6}{k})}{64 (0.039 + 0.641 k)}$
$f_{B E} (u) = 2 u l n (u) - 2 (1 + u) l n (\frac{1}{2} (1 + u))$
$ν_{B E} (n, k, λ) \to \frac{0.014934 (\frac{6}{k})}{- 0.9558 + e^{0.3474 k}}$

Exercise: check for each model that, as expected:

\forall i : P (x_{i} = 1) = ζ = 1 / 4

Now the question of whether any of these models is correct is in principle testable. Build (or apply for beamtime on) a device that can distinguish the isotopes in a unit cell. For example, neutron scattering might be a good technique for this. Once you can distinguish the isotopic configurations with your measurements, you can distinguish your models, because even though they all match your macroscopic measurements, they make remarkably different microscopic predictions $P (x_{1} . . . x_{n})$ . For comparison, here are the values for $P (000000)$ and $P (111111)$ :

\begin{matrix} ν_{K L} (k = 0) & \to 0.1780 & ν_{K L} (k = 6) & \to 0.000244 ν_{α = - 1} (k = 0) & \to 0.40 & ν_{α = - 1} (k = 6) & \to 0.00402 ν_{B E} (k = 0) & \to 0.338 & ν_{B E} (k = 6) & \to 0.002108 \end{matrix}

Because the extrema of f-divergences are not characterized by independence, they should have novel application to source coding, large deviations theory, Solomonoff induction, in fact anywhere that Shannon entropy or KL already appears... just to name a few ideas. A practical roadblock is that the above numerical results for inference are terribly slow to compute, not to mention unsatisfying. I will address this in a future post. (Hint: the answer is something like Feynman diagrams, but worse!)

This feels like a very very high quality post and like... I worry that maybe it belongs on arxiv, and the post here should announce that you put this kind of material on arxiv, and give a silly and readable and fun explainer of "the basic idea" here?

Because like... that's a lot of math! But also... it had good jokes!

Not long after, three independent (and presumably identically distributed) authors... discovered an obstruction to sufficiency.

Dad jokes?? In my math reading? Its more likely than you think! ;-)

Content wise, I feel like the framing here was super super important to Ontology and Semantics works that aim at the general problem of being able to say "Please Become Good and Don't Make Us Regret Creating You Oh Mighty ASI!" and then somehow there is math that we can be confident in to build an ASI be such as to understand what we mean by that, and then the math plus the sentence causes the ASI to simply become good, and then make us not regret having created em?

This part caught my eye at the beginning because it feels like it goes with "uniquely correct ways to think" like Bayes and VNM and so on...

Once we accept KPD, we are stuck with the exponential family. Fortunately, the exponential family is actually pretty great!
Uniquely determined by KPD
Uniquely determined by Maxent
Fast parallelizable inference w/ conjugate priors
Used nearly everywhere in physics
Yet I will argue that we can do better than KPD. We can extend Maxent.

I get the sense that something very very deep and important is being talked about from bits like this...

Lest I've undersold the importance and ubiquity of this idea, though it's rarely (never?) described this way, the Standard Model is essentially a catalog of sufficient statistics for elementary particle interactions, i.e. sufficient statistics that we've experimentally uncovered and/or proposed on the basis of symmetry (or symmetry-breaking) arguments. From scattering cross-sections to thermodynamic equations of state, physicists have spent a century and a half developing fine-grained microscopic models from coarse-grained macroscopic measurements.

But also, I sorta skimmed the math? This bit jumps out at me as maybe close to the essential notation for the machinery you're playing with?

So I can imagine that "T" and also "π" are both ways of "compressing big data 'D' down to some smaller thingy that is still somehow sufficient for computing bayesian probabilities base don D that are relevant to the probability of A"? Maybe?

And then I can sorta get the sense that the rest of the math in your post is defining how they work, proving the method works, and illustrating how to use it?

And "π" is the frequentist way to do it? And "T" is the Jaynesian way to do it? Maybe?

I feel like this post deserves more interaction and so I'd like to request that you correct my errors, or brag about how this is useful, or otherwise link it to things I already sorta know about.

It seems like it is doing similar things to the Natural Latents stuff, and to the Infrabayes stuff, maybe, in terms of looking at the essence of sufficient (mathematical) semantics on the one side, and looking at Bayes-proximate math cleverness on the other side.

Zero-math-just-words summary:
So the natural world seems to be full of this symmetry thing, which is great because I suspect otherwise we wouldn't have any hope of making a good map of it. Here's some math that makes constructive epistemic claims about a particularly simple example -- permutation symmetry -- that people have been generally confused about for 90 years. Then here's how that idea would testably interact with a thought experiment that you could almost pull off in a lab (BYO magic box).

Remember T is a program that's supposed to emit all the computable parts of D that are relevant for doing inference on A. That's P(A|D) = P(A|T(D)). (Naturally, the identity function on D does a perfectly fine job of preserving the relevant parts of D, so to be non-trivial you really want T to compress D in some way.) Here is some permutation of D (this doesn't compress D). T(D) = T( $π$ (D)) just says that T doesn't care about the ordering of D. Then you combine these facts to resolve that inference on A also doesn't care about the ordering of D. That's P(A|D) = P(A| $π$ (D)).

Negative example:
$A_{m e o w}$ = "this sensory input contains a cat"
D = "the pixels of the sensory input"
T = "AlexNet without the final softmax layer"

Observe $π$ destroys inference for $A_{m e o w}$ . P( $A_{m e o w}$ |D) $\neq$ P( $A_{m e o w}$ | $π$ (D)). Classification of natural images isn't invariant to arbitrary reordering of pixels. AlexNet isn't invariant to reordering of pixels because gradient descent would rapidly exit that region of weight space. T(D) $\neq$ T( $π$ (D)).

Perhaps there are some other (maybe approximate) symmetries ( $σ$ ) like reflections, translations, blurring, recoloring, rescaling. P( $A_{m e o w}$ |D) $\approx$ P( $A_{m e o w}$ | $σ$ (D)). AlexNet might be in or near those regions of weight space. T(D) $\approx$ T( $σ$ (D)). This should also apply to your cortex or aliens or whatever has a notion of a cat!

I would be very surprised if this didn't connect to natural latents. It connects to universality. I don't think it connects directly to infrabayes but they might be compatible. It also connects to SLT, but for reasons that are beyond the scope of this post.

This was a cool post! I was familiar with f-divergences as a generalization of KL divergence, and of course familiar with maxent methods, but I hadn't seen the two put together before.

Problem: I'm left unsure when or why I would ever want to use this machinery. Sufficient statistics are an intuitive concept, and I can look around at the world and make guesses about where I would want to use sufficient statistics to model things. Independence is also an intuitive concept, and I can look around the world and make guesses about where to use that. Combining those two, I can notice places in the world where (some version of) KPD should bind, and if it doesn't bind then I'm surprised.

But I don't see how to notice places where max-f-divergence distributions should bind to reality. I can see where sufficient statistics should exist in the world, but that's not a sufficient condition for a max-f-divergence distribution; there are (IIUC) lots of other distributions for which sufficient statistics exist. So why focus on the max-f-divergence class specifically? What intuitive property of (some) real-world systems nails down that class of distributions? Maybe some kind of convexity condition on the distribution or on updates or something?

A practical roadblock is that the above numerical results for inference are terribly slow to compute...

Not sure exactly what you're doing numerically, but here's how I usually handle vanilla maxent problems numerically (in my usual notation, which is not quite the same as yours, apologies for putting marginal translation work on you). We start with

maxent subject to $E [f (X)] = μ$

Transform that to

maxent $_{P [X]}$ subject to $E [f (X) - μ] = 0$

This gives a dual problem for the partition function, which I'll write in full:

min $_{λ}$ $l n (\sum_{x} e^{λ^{T} (f (x) - μ)})$

That's an unconstrained optimization problem, it's convex in the happy direction, and standard optimizers (e.g. scipy using jax for derivatives) can usually handle it very quickly.

I would guess that process generalizes just fine to other f-divergences, since it's basically just relying on convex optimization tricks. And that should yield quite fast numerics.

Your notation is clear to me! It can be shown that:

Even though $f^{⋄}$ and $f^{⋄ #}$ are both convex, their composition is not necessarily convex. Sorry, I don't have any clear counterexamples at the ready. (The Hessian determinants all vanish in the graphene/bit-string model.) It's likely that I've configured my numerical solvers badly for this example.

As far as binding to reality:

Power laws are ubiquitous in nature. Exponential family doesn't admit power laws; f-div does. (It admits many other important non-exponential distributions as well -- see the B-E example above.)
Symmetries in your data that are weaker than -- or just different from -- independence are ubiquitous in nature. Exponential family is only congruent with independence; not so, f-div.
KL / Gibbs entropy is inadequate for many observable systems. This typically appears under the heading nonextensive entropy (Gell-Mann & Tsallis w/ contributions by Touchette and others) or Tsallis entropy. This paper is far enough outside my field that I can only loosely validate it, but it's an attempt to explain dark energy by way of applying observational data to determine Bayes factors of different nonextensive entropy models of our cosmological horizon.
I consider Jensen's, DPI, and coarse-graining monotonicity to be pretty important desiderata (see Nielsen). f-div is the biggest family that supports these, uniquely determined.
[I could write several more posts worth of material]

Because like... that's a lot of math! But also... it had good jokes!

Not long after, three independent (and presumably identically distributed) authors... discovered an obstruction to sufficiency.

Dad jokes?? In my math reading? Its more likely than you think! ;-)

This part caught my eye at the beginning because it feels like it goes with "uniquely correct ways to think" like Bayes and VNM and so on...

Once we accept KPD, we are stuck with the exponential family. Fortunately, the exponential family is actually pretty great!
Uniquely determined by KPD
Uniquely determined by Maxent
Fast parallelizable inference w/ conjugate priors
Used nearly everywhere in physics
Yet I will argue that we can do better than KPD. We can extend Maxent.

I get the sense that something very very deep and important is being talked about from bits like this...

Lest I've undersold the importance and ubiquity of this idea, though it's rarely (never?) described this way, the Standard Model is essentially a catalog of sufficient statistics for elementary particle interactions, i.e. sufficient statistics that we've experimentally uncovered and/or proposed on the basis of symmetry (or symmetry-breaking) arguments. From scattering cross-sections to thermodynamic equations of state, physicists have spent a century and a half developing fine-grained microscopic models from coarse-grained macroscopic measurements.

But also, I sorta skimmed the math? This bit jumps out at me as maybe close to the essential notation for the machinery you're playing with?

And then I can sorta get the sense that the rest of the math in your post is defining how they work, proving the method works, and illustrating how to use it?

And "π" is the frequentist way to do it? And "T" is the Jaynesian way to do it? Maybe?

I feel like this post deserves more interaction and so I'd like to request that you correct my errors, or brag about how this is useful, or otherwise link it to things I already sorta know about.

This was a cool post! I was familiar with f-divergences as a generalization of KL divergence, and of course familiar with maxent methods, but I hadn't seen the two put together before.

A practical roadblock is that the above numerical results for inference are terribly slow to compute...

maxent subject to $E [f (X)] = μ$

Transform that to

maxent $_{P [X]}$ subject to $E [f (X) - μ] = 0$

This gives a dual problem for the partition function, which I'll write in full:

min $_{λ}$ $l n (\sum_{x} e^{λ^{T} (f (x) - μ)})$

That's an unconstrained optimization problem, it's convex in the happy direction, and standard optimizers (e.g. scipy using jax for derivatives) can usually handle it very quickly.

I would guess that process generalizes just fine to other f-divergences, since it's basically just relying on convex optimization tricks. And that should yield quite fast numerics.

Your notation is clear to me! It can be shown that:

As far as binding to reality:

Power laws are ubiquitous in nature. Exponential family doesn't admit power laws; f-div does. (It admits many other important non-exponential distributions as well -- see the B-E example above.)
Symmetries in your data that are weaker than -- or just different from -- independence are ubiquitous in nature. Exponential family is only congruent with independence; not so, f-div.
KL / Gibbs entropy is inadequate for many observable systems. This typically appears under the heading nonextensive entropy (Gell-Mann & Tsallis w/ contributions by Touchette and others) or Tsallis entropy. This paper is far enough outside my field that I can only loosely validate it, but it's an attempt to explain dark energy by way of applying observational data to determine Bayes factors of different nonextensive entropy models of our cosmological horizon.
I consider Jensen's, DPI, and coarse-graining monotonicity to be pretty important desiderata (see Nielsen). f-div is the biggest family that supports these, uniquely determined.
[I could write several more posts worth of material]

LESSWRONG
LW

LESSWRONG
LW

13

KPD is a weak obstruction

13

Summary

Pt. 0 -- preliminaries

Why should an agent care about sufficiency?

What is sufficiency, really?

Pt. I -- sufficiency from symmetry

Whence KPD?

Time for T

Low-effort example

Next steps

Pt. II -- constructive non-iid models

f-divergence basic properties

Lagrangians are neat!

Remarks on $p^{⋆}$

Same macro, different micro

13

13

13

KPD is a weak obstruction

13

Summary

Pt. 0 -- preliminaries

Why should an agent care about sufficiency?

What is sufficiency, really?

Pt. I -- sufficiency from symmetry

Whence KPD?

Time for T

Low-effort example

Next steps

Pt. II -- constructive non-iid models

f-divergence basic properties

Lagrangians are neat!

Remarks on p⋆

Same macro, different micro

13

13

Remarks on $p^{⋆}$