What’s the difference between a map and the raw data used to generate that map?
Suppose, for example, that google sends a bunch of cars out driving around New York City, snapping photos every few feet, then uses the photos to reconstruct a streetmap. In an informational sense, what’s the difference between the giant database full of photos and the streetmap?
Obvious answer: the map throws away a huge amount of information which we don’t care about, at least for the map’s use-case.
Let’s try to formalize this intuition a little. We want the ability to run queries against the map - e.g. things like “How far apart are Times Square and the Met?” or questions about street network topology. Let’s call the whole class of queries we want to support . We also have some input data - e.g. the database full of images from the streets of NYC. The goal of the map-synthesizing process is to throw out as much information as possible from , while still keeping any and all information relevant to queries in .
Natural next question: given the query class and some data-generating process, can we characterize “optimal” map-making processes which maximize the amount of information “thrown out” (measured in an information-theoretic sense) while still keeping all information relevant to ?
For the very simplest query classes, we can answer this question quite neatly.
Suppose that our query class contains only a single query , and the query is a true/false question. For instance, maybe the only query we care about at all is = “Is the distance between Times Square and the Met greater than one mile?”, and we want to throw out as much information as possible without losing any info relevant to the query.
Claim: any solution to this problem is isomorphic to the probability . In other words, is itself a minimal “lossless” map (where by “lossless” we mean it does not throw out any info from relevant to the query, not that it perfectly predicts ), and any other minimal “lossless” map is a reversible transformation of .
Let’s prove it.
We’ll formalize our claim in terms of mutual information . We have two pieces:
- throws away no information relevant to :
- For any function , if throws away no information relevant to , then can be computed from : implies for some .
One important note: our notation is really shorthand for , so is not a function of the “outcome” of - it’s only a function of . Also note that the second condition implies that the entropy of is at least as high as , so is indeed minimal in terms of information content.
In the next two sections, we’ll separately prove our two pieces.
We want to show that throws away no information relevant to : .
We’ll start with a short side-quest to prove a lemma: . Here’s an interpretation: suppose you work late trying to calculate the probability of based on some giant pile of data . You finally compute the result: . You go to bed feeling victorious, but wake up to find that your computer has died and your giant pile of data is lost. However, you still remember the final result - even though you don’t know , you do know that . Based on this information, what probability should you assign to ? Presumably the answer should be .
We can directly prove this in a few lines:
Note that the indicator function in the main sum is zero whenever is not , so we can replace with in that sum:
We can also write this in the more compact form , which we’ll use later.
To build a bit more intuition, let’s expand on this result a bit. Imagine a circuit (aka deterministic causal diagram or computational DAG) which takes in data and outputs . We’ll think about some arbitrary cross-section of that circuit - some set of internal nodes which cut the circuit in two, so that mediates the interaction between and . We can write this as for some function .
Claim: for any cross-section , . Intuitively, we can imagine that we’re partway through calculating , we’ve calculated all the ’s but haven’t finished the full computation yet, when suddenly our computer dies and we lose all the original data . As long as we still have the ’s around, we’re good - we just shrug and continue the calculation, since the ’s were all we needed to finish up anyway.
I won’t write out the proof for this one, but it works exactly like the previous proof.
Notice that our cross-section theorem has two interesting corner cases: all of is a cross-section, and by itself is a cross-section. The former yields the trivial statement , while the latter yields our lemma from earlier: .
Armed with our lemma, let’s return to the main problem: show that throws away no information relevant to , formalized as .
We’ll start by applying an identity from information theory:
… where is the Shannon entropy. So to show that , we can show . Our lemma makes this trivial:
… so contains all the information in relevant to .
The Data Processing Inequality
Next, we’d like to show that is a minimal representation of the information in relevant to . Formally, for any , if throws away no information relevant to , then is a function of : implies for some .
In particular, we’ll show that , so we can choose . Intuitively, this says that if throws out no information from relevant to , then it yields the same probability for .
To prove this, we’ll use a theorem from information theory called the data processing inequality (DPI). Intuitively, the DPI says that if we start with the data , we cannot gain any new information about just by processing . Formally, for any function :
… with equality iff and are independent conditional on (so that tells us nothing else about once we know ). The proof and a bit more discussion can be found in Cover & Thomas’ textbook on page 34.
We’re interested in the equality case , so the DPI tells us that and are independent conditioned on :
But we can always just expand via the chain rule instead:
Equate these two expressions for , and we find
… which is what we wanted.
Why Is This Interesting?
We’ve shown that the probability summarizes all the information in relevant to , and throws out as much irrelevant information as possible. Why does this matter?
First, hopefully this provides some intuition for interpreting a probability as a representation of the information in relevant to . In short: probabilities directly represent information. This interpretation makes sense even in the absence of “agents” with “beliefs”, or “independent experiments” repeated infinitely many times. It directly talks about maps matching territories, and the role probability plays, without invoking any of the machinery of frequentist or subjectivist interpretations. That means we can potentially apply it in a broader variety of situations - we can talk about simple mechanical processes which produce “maps” of the world, and the probabilistic calculations embedded in those processes.
Second, this is a natural first step in characterizing map-making processes. The data processing inequality also suggests a natural next step: sufficient statistics, which serve as minimal “lossless” maps when our data are independent samples from a distribution, and our queries can ask anything at all about the parameters of the distribution. From there, we could further generalize to approximate sufficient statistics - maps which throw out a small amount of information relevant to the queries, in cases where there is no “lossless” map smaller than .
That said, this still only talks about half of the full map-making process: the data-processing half. There’s still the data-collection half to address - the process which generates in the first place, and somehow ties it to the territory. And of course, when map-making processes are embedded in the real world, there might not be a clean separation between data-collection and data-processing, so we need to deal with that somehow.
Third, the interpretation of probabilities as a minimal map - or a representation of information - suggests possible avenues to relax traditional probability to deal with things which are time-like separated from us, e.g. for handling self-reference and advanced decision-theoretic problems. In particular, it suggests strategically throwing away information to deal with self-referential tomfoolery - but continuing to use probabilities to represent the reduced information remaining.