Understanding the Merging of Opinions with Increasing Information theorem

ViktoriaMalyasova

Merging of opinions is a classical theorem in probability theory proved by Blackwell and Dubins in 1962. [1] It is often cited by researcher Vanessa Kosoy, so I decided to try and understand it as a first step to understanding the infrabayesianism theory.

Introduction

Intuitively, the theorem says that under certain conditions two people observing the same sequence of events and conditioning their beliefs on them will eventually reach consensus in their predictions of future events.

For example, suppose Ann and Bob observe an infinite sequence of coin tosses. They have prior probability measures on the space of infinite binary sequences and they make predictions about the future outcomes by conditioning on the observed outcomes. Then as long as Bob's prior probability Q is absolutely continuous with respect to Ann's prior probability P, (i.e. for all measurable sets ), Bob's predictions and Ann's predictions will grow close with Q-probability 1.

This theorem gives a philosophical justification to subjective Bayesianism, the branch of Bayesianism that holds that there is no objective, uniquely rational prior. Bayesian agents may start out with different priors, but as long as they agree on which events are possible at all, their predictions will merge as they keep recieving more evidence. Another possible interpretation is that Q describes the true data generation process, so as long as our prior probability distribution assigns a positive probability to the true hypothesis, we will converge to accurate predictions of the future.

Formulation of the theorem

To formulate the statement precisely, I will need the concepts of:

regular conditional probability
variational distance

Definition (kernel)

Given a measure space (Y, T, $μ$ ) and a measurable space (X, S), a kernel is a function $ν : Y \times S \to [0, 1]$ satisfying two conditions:

For any $y \in Y$ , $ν (y, \cdot) : S \to [0, 1]$ is a measure on S;
For any $F \in S, ν (\cdot, F) : Y \to [0, 1]$ is a (Borel) measurable function.

Definition (product regular conditional probability)

Let $(X \times Y, S \times T, λ)$ be a product measure space. A product regular condition probability (product rcp) is a kernel $ν : Y \times S \to [0, 1]$ such that $\forall$ $F \in S, E \in T$

$λ (F \times E) = \int_{E} ν (t, F) λ_{Y} (d t)$

where $λ_{Y}$ is the marginal distribution of $λ$ on Y: $λ_{Y} (E) = λ (X \times E)$ .

Definition (predictive probability).

Suppose we have an infinite sequence of measure spaces $(X_{i}, B_{i})$ . A probability measure $P$ defined on their product space $(X, B)$ = $(X_{1} \times X_{2} \times \dots, B_{1} \times B_{2} \times \dots)$ is called predictive if for every $n \in N$ there exists a product regular conditional probability $P^{n}$ : $(X_{1} \times \dots \times X_{n}) \times (B_{n + 1} \times \dots) \to [0, 1]$ .

The Blackwell-Dubins theorem requires that the probability measure P is predictive. (Then from absolute continuity of Q w.r.t P it follows that Q is also predictive - prove it!) In our coin tossing example any probability measure will be predictive, because regular conditional probabilites always exist when conditioning on discrete measure spaces [2]. Disintegration theorem proves that regular conditional probabilities always exist for Borel measures when the product and the space we condition on are both Radon spaces. However, they do not always exist, and at the end of the post I will provide an example of a product measure with no regular conditional probabilities.

Definition (variational distance)

Variational distance between two probability measures P and Q defined on the same sigma-algebra F is defined as d(P, Q) = ${sup}_{D \in F} | P (D) - Q (D) |$ .

Exercise

Prove that it is a metric.

Blackwell-Dubins theorem

Let $(X_{i}, B_{i})_{i \in N}$ be an infinite sequence of measurable spaces, and P a predictive probability on the product space $(X, B) = (X_{1} \times X_{2} \times \dots, B_{1} \times B_{2} \times \dots)$ . Suppose Q is a probability measure absolutely continuous with respect to P. Then for every regular conditional probability $P^{n} : (X_{1} \times \dots \times X_{n}) \times (B_{n + 1} \times \dots) \to [0, 1]$ there exists a regular conditional probability $Q^{n} : (X_{1} \times \dots \times X_{n}) \times (B_{n + 1} \times \dots) \to [0, 1]$ such the variational distance $d (P^{n}, Q^{n})$ $\to 0$ as $n \to \infty$ everywhere except for a set of Q-measure zero.

Example

Let's consider an example. Suppose Ann believes that the coin is biased and the probability of heads is 2/3, and Bob believes the coin is fair (probability of heads is 1/2). These measures have regular conditional probabilities $P^{n} = P$ and $Q^{n} = Q$ , and they do not get any closer. Why not? (Scroll down for answer).

That is because the probabilites P and Q are not absolutely continuous with respect to each other. Suppose $(x_{1}, x_{2}, \dots)$ is an infinite binary sequence of i.i.d random variables with $E (x_{i})$ = m. It follows from the strong law of large numbers that $h_{n}$ = $(x_{1} + \dots + x_{n}) / n$ converges to m almost surely, so P assigns measure 0 to the set of all sequences with ${lim}_{n \to \infty} h_{n} = 1 / 2$ , while Q assigns measure 1 to it.

Discussion

The absolute continuity of prior distributions is not a trivial assumption. There is generally no probability measure with respect to which all other measures are absolutely continuous. For instance, if the measurable space has uncountably many atoms, then $\forall$ $n \in N$ , at most $n$ of them can have measure > 1/n, so at most countably many can have a positive measure. And if S is an atom with $μ$ -measure 0, the measure $ν$ defined as " $ν$ (E): = 1 if $S \subset E$ , 0 otherwise" will not be absolutely continuous w.r.t $μ$ .

Universal Prior, the formalization of Occam's razor, assigns positive probabilities to all computable hypotheses. But what if the universe is uncomputable? Is all the Bayesian updating we do bringing us any closer to the truth? Well, I am not yet sure how to answer that. As gedymin points out, the Universal Prior is still useful, because many physical processes are known to be at least approximately computable. But the computability of the universe itself is an open problem, known as the Physical Computability Thesis or the physical Church-Turing thesis.

Another assumption the theorem makes is that the agents are certain about the evidence they recieve. But in real life we are often uncertain about the evidence because our measurements are noisy or we use models to interpret the results of experiments and we are not confident in the models. Simon Huttegger investigates updating on uncertain evidence in "Merging of opinions and probability kinematics"[3], which also has a super accessible introduction to the merging of opinions theorem and its implications.

Now, as promised:

Example of a product measure with no product rcp.

Notation:

If E is a subset of a set X, $E^{c}$ denotes its complement, $E^{c} = X ∖ E$ .

Consider the product space of the segment [0, 1] with Lebesgue sigma-algebra L and the real line with Borel sigma-algebra B: $([0, 1] \times R, L \times B)$ . Define measure $λ$ on this product as the pushforward measure of the Lebesgue measure m on the segment [0, 1] under the map $h : [0, 1] \to [0, 1] \times R : x \to (x, x)$ : $\forall S \in L \times B$ , $λ (S) = m (h^{- 1} (S))$ . This measure has no product rcp.

Proof.

Suppose that $ν : R \times L \to [0, 1]$ is a product rcp, so for any sets $F \in L$ , $E \in B$ :

$m (F \cap E)$ = $λ (F \times E) = \int_{E} ν (t, F) λ_{B} (d t)$

where $λ_{B}$ , the marginal of $λ$ , is the Borel measure on $R$ . For the injection function f: $[0, 1] \to R$ , we proved that $ν$ is a quotient rcp, defined below:

Definition

Given a measure space $(X, S, μ)$ and a Borel measurable function $f : X \to R$ , a quotient rcp is a kernel $ν : R \times S \to [0, 1]$ such that for all $F \in S, E \in B$ ,

$μ (F \cap f^{- 1} (E)) = \int_{E} ν (t, F) μ_{R} (d t)$ ,

where $μ_{R}$ is the pushforward of $μ$ : $\forall E \in B, μ_{R} (E) = μ (f^{- 1} (E)) .$

Lemma.

Given a probability space $(X, S, μ)$ , a Borel measurable function $f : X \to R$ and a quotient rcp $ν : R \times S \to [0, 1]$ , there exists a Borel set B with $μ (f^{- 1} (B)) = 1$ , such that $ν (t, f^{- 1} (t)) = 1$ for all $t \in B$ .

There is a proof in [2] (lemma 1), or you can prove it yourself as an exercise:

Step 1. Prove that for any Borel set E, there exists a Borel set B such that

$ν (t, f^{- 1} (E))$ = 1 almost surely on $E \cap B$ ;

$ν (t, f^{- 1} (E)$ = 0 almost surely on $E^{c} \cap B$ ;

$μ (B)$ = 1.

Step 2. Let $E_{n}$ be the enumeration of all intervals with rational endpoints, $B_{n}$ the corresponding Borel sets from step 1. Let $B = \cap B_{n}$ . Then $μ (B) = 1$ and $ν (t, f^{- 1} (t)) = 1$ for all $t \in B$ .

By lemma, there exists a Borel set B of measure 1 such that $\forall$ $t \in B$ , $ν (t, {t}) = 1$ . Then for any $t \in B$ and any Lebesgue measurable set $F ∋ t, ν (t, F) \geq ν (t, {t}) = 1$ , so $ν (t, F) = 1$ . Then for any Lebesgue measurable set $F ∌ t$ , $ν (t, F^{c})$ = $ν (t, R) - ν (t, F^{c})$ = 0. Now for any set $F \subset B$ that is Lebesgue measurable but not Borel measurable, $ν (\cdot, F)^{- 1} ({1}) \cap B = F$ , therefore the function $ν (\cdot, F)$ is not measurable.

It remains to prove that any set of positive Borel measure has a subset that is Lebesgue measurable but not Borel measurable. The real numbers are a Polish space (a separable completely metrizable topological space). A set of positive Borel measure must be uncountable. The Cantor-Bendixson theorem implies that every uncountable Polish space has a nonempty perfect subset (a closed subset with no isolated points). For every perfect Polish space M, there is a continuous injection of the Cantor set into M. ([4], 1.3G) The Cantor set has cardinality continuum. It has measure 0, so all of its $2^{c}$ subsets are Lebesgue measurable. The cardinality of Borel sigma-algebra is continuum. Therefore the Cantor set, and hence our set, has a subset that is Lebesgue but not Borel measurable. QED.

[1] Blackwell, David, and Lester Dubins. “Merging of Opinions with Increasing Information.” The Annals of Mathematical Statistics, vol. 33, no. 3, 1962, pp. 882–86, http://www.jstor.org/stable/2237864

[2] Arnold M. Faden. "The Existence of Regular Conditional Probabilities: Necessary and Sufficient Conditions." Ann. Probab. 13 (1) 288 - 298, February, 1985. https://doi.org/10.1214/aop/1176993081

[3] Huttegger, Simon M.. “MERGING OF OPINIONS AND PROBABILITY KINEMATICS.” The Review of Symbolic Logic 8 (2015): 611 - 648.

[4] Moschovakis, Yiannis N.. “Descriptive Set Theory.” (1980).

[-]TLW2y30

Another case that violates the preconditions is if the information source is not considered to be perfectly reliable.

Imagine the following scenario:

Charlie repeatedly flips a coin, and tells person A and B the results.

Alice and Bon are choosing between the following hypotheses:

The coin is fair.
The coin always comes up heads.
The coin is fair, but person C only reports when the coin comes up heads.

Alice has a prior of 40% / 40% / 20%. Bob has a prior of 40% / 20% / 40%.

Now, imagine that Charlie repeatedly reports 'heads'. What happens?

Answer: Alice asymptotes towards 0% / 66.7% / 33.3%; Bob asymptotes towards 0% / 33.3% / 66.7%. Their opinions remain distinct.

There was a good simulation of a more complicated scenario with many agents exhibiting much the same effect somewhere on this site, but I can't find it. Admittedly, I did not look particularly hard.

LESSWRONG
LW