Proposition 39: Given a crisp infradistribution ζ□ over N, an infrakernel K from N to infradistributions over X, and suggestively abbreviating K(i) as Hi(hypothesis i) and K∗(ζ□) as Eζ□Hi (your infraprior where you have Knightian uncertainty over how to mix the hypotheses), then((Eζ□Hi)|gL)(f)=Eζ□(PgHi(L)⋅(Hi|gL)(f)+Hi(0★Lg))−Eζ□(Hi(0★Lg))Eζ□(Hi(1★Lg))−Eζ□(Hi(0★Lg))
Proof: Assume that L and g are functions of type X→[0,1] and X→R respectively, ie, likeliehood and utility doesn't depend on which hypothesis you're in, just what happens. First, unpack our abbreviations and what an update means.
Then use the definition of an infrakernel pushforward.
For the next thing, we're just making types a bit more explicit, f,L,g only depend on x, not i.
Then we pack the semidirect product back up.
And pack the update back up.
At this point, we invoke the Infra-Disintegration Theorem.
We unpack what our new modified prior is, via the Infra-Disintegration Theorem.
and unpack the semidirect product.
Now we unpack α and β.
And unpack what K′ is
And reabbreviate K(i) as Hi,
And then pack it back up into a suggestive form as a sort of expectation.
And we're done.
Proposition 40: If a likelihood function L:X→[0,1] is 0 when f(x)<a, and f≥0 and a>0, then h(L⋅a)≤h(f)
And then we apply Markov's inequality, that for any probability distribution,
Also,1f(x)≥a≥L (because L is 0 when f(x)<a), so monotonicity means that
So, we can get:
Proposition 41: The IKR-metric is a metric.
So, symmetry is obvious, as is one direction of identity of indiscernibles (that the distance from an infradistribution to itself is 0). That just leaves the triangle inequality and the other direction of identity of indiscernibles. For the triangle inequality, observe that for any particular f (instead of the supremum), it would fulfill the triangle inequality, and then it's an easy exercise for the reader to verify that the same property applies to the supremum, so the only tricky part is the reverse direction of identity of indiscernibles, that two infradistributions which have a distance of 0 are identical.
First, if dIKR(h,h′)=0, then h and h′ must perfectly agree on all the Lipschitz functions. And then, because uniformly continuous functions are the uniform limit of Lipschitz functions, h and h′ must perfectly agree on all the uniformly continuous functions.
Now, we're going to need a somewhat more sophisticated argument. Let's say that the sequence fn is uniformly bounded and limits to f in CB(X) equipped with the compact-open topology (ie, we get uniform convergence of fn to f on all compact sets). Then, for any infradistributions, h(fn) will limit to h(f). Here's why. For any ϵ, there's some compact set Cϵ that accounts for almost all of why a function inputted into an infradistribution has the value it does. Then, what we can do is realize that h(fn) will, in the limit, be incredibly close to h(f), due to fn and f disagreeing by a bounded amount outside the set Cϵ and only disagreeing by a tiny amount on the set Cϵ, and the Lipschitzness of h.
Further, according to this mathoverflow answer, uniformly continuous functions are dense in the space of all continuous functions when CB(X) is equipped with the compact-open topology, so given any function f, we can find a sequence of uniformly continuous functions fn limiting to f in the compact-open topology, and then,
And so, h and h′ agree on all continuous functions, and are identical, if they have a distance of 0, giving us our last piece needed to conclude that dIKR is a metric.
Proposition 42: The IKR-metric for infradistributions is strongly equivalent to the Hausdorff distance (w.r.t. the KR-metric) between their corresponding infradistribution sets.
Let's show both directions of this. For the first one, if the Hausdorff-distance between H,H′ is dhau(H,H′), then for all a-measures (m,b) in H, there's an a-measure (m′,b′) in H′ that's only dhau(H,H′) or less distance away, according to the KR-metric (on a-measures).
Now, by LF-duality, a-measures in H correspond to hyperplanes above h. Two a-measures being dhau(H,H′) apart means, by the definition of the KR-metric for a-measures, that they will assign values at most dhau(H,H′) distance apart for 1-Lipschitz functions in [−1,1].
So, translating to the concave functional view of things, H and H′ being dhau(H,H′) apart means that every hyperplane above h has another hyperplane above h′ that can only differ on the 1-Lipschitz 1-bounded functions by at most dhau(H,H′), and vice-versa.
Let's say we've got a Lipschitz function f. Fix an affine functional/hyperplane ψh that touches the graph of h at f. Let's try to set an upper bound on what h′(f) can be. If f is 1-Lipschitz and 1-bounded, then we can craft a ψh′ above h′ that's nearby, and
Symmetrically, we can swap h′ and h to get h(f)≤h′(f)+dhau(H,H′), and put them together to get:
For the 1-Lipschitz functions.
Let's tackle the case where f is either more than 1-Lipschitz, or strays outside of [−1,1]. In that case, fmax(Li(f),||f||) is 1-Lipschitz and bounded in [−1,1]. We can craft a ψh′ that only differs on 1-Lipschitz functions by dhau(H,H′) or less. Then, since, for affine functionals, ψ(ax)=a(ψ(x)−ψ(0))+ψ(0) and using that ψh′ and ψh are close on 1-Lipschitz functions, which fmax(Li(f),||f||) and 0 are, we can go:
And then we swap out ψh′ for ψh with a known penalty in value, we're taking an overestimate at this point.
This argument works for all h. And, even though we just got an upper bound, to rule out h′(f) being significantly below h(f), we could run through the same upper bound argument with h′ instead of h, to show that h(f) can't be more than 2dhau(H,H′)⋅(max(Li(f),||f||)) above h′(f).
So, for all Lipschitz f, |h(f)−h′(f)|≤2dhau(H,H′)⋅(max(Li(f),||f||,1)). Thus, for all Lipschitz f,
This establishes one part of our inequalities. Now for the other direction.
Here's how things are going to work. Let's say we know the IKR-distance between h and h′. Our task will be to stick an upper bound on the Hausdorff-distance between H and H′. Remember that the Hausdorff-distance being low is equivalent to "any hyperplane above h has a corresponding hyperplane above h′ that attains similar values on the 1-or-less-Lipschitz functions".
So, let's say we've got h, and a ψh≥h. Our task is, knowing h′, to craft a hyperplane above h′ that's close to ψ on the 1-Lipschitz functions. Then we can just swap h′ and h, and since every hyperplane above h is close (on the 1-Lipschitz functions) to a hyperplane above h′, and vice-versa, H and H′ can be shown to be close. We'll use Hahn-Banach separation for this one.
Accordingly, let the set A be the set of f,b where (f,b)=p(f′,b′)+(1−p)(f∗,b∗), and:
That's... quite a mess. It can be thought of as the convex hull of the hypograph of h′, and the hypograph of ψh restricted to the 1-Lipschitz functions in [−1,1] and shifted down a bit. If there was a ψh′ that cuts into h′ and scores lower than it, ie ψh′(f∗)<h′(f∗), we could have p=0, and b∗=ψh′(f∗)<h′(f∗) to observe that ψh′ cuts into the set A. Conversely, if an affine functional doesn't cut into the set A, then it lies on-or-above the graph of h′.
Similarly, if ψh′ undershoots ψh−dIKR(h,h′) over the 1-or-less-Lipschitz functions in [−1,1], it'd also cut into A. Conversely, if the hyperplane ψh′ doesn't cut into A, then it sticks close to ψh over the 1-or-less-Lipschitz functions.
This is pretty much what A is doing. If we don't cut into it, we're above h′ and not too low on the functions with a Lipschitz norm of 1 or less.
For Hahn-Banach separation, we must verify that A is convex and open. Convexity is pretty easy.
First verification: Those numbers at the front add up to 1 (easy to verify), are both in [0,1] (this is trivial to verify), and qp1+(1−q)p2 isn't 1 (this is a mix of two numbers that are both below 1, so this is easy). Ok, that condition is down. Next up: Is our mix of f′1 and f′2 1-Lipschitz and in [−1,1]? Yes, the mix of 1-Lipschitz functions in that range is 1-Lipschitz and in that range too. Also, is our mix of f∗1 and f∗2 still in CB(X)? Yup.
That leaves the conditions on the b terms. For the first one, just observe that mixing two points that lie strictly below ψh′−dIKR(h,h′) (a hyperplane) lies strictly below it as well. For the second one, since h′ is concave, mixing two points that lie strictly below its graph also lies strictly below its graph. Admittedly, there may be divide-by-zero errors, but only when qp1+(1−q)p2 is 0, in which case, we can have our new f′ and b′ be anything we want as long as it fulfills the conditions, it still defines the same point (because that term gets multiplied by 0 anyways). So A is convex.
But... is A open? Well, observe that the region under the graph of h on CB(X) is open, due to Lipschitzness of h. We can wiggle b and f around a tiny tiny little bit in any direction without matching or exceeding the graph of h. So, given a point in A, fix your tiny little open ball around (f∗,b∗). Since p can't be 1, when you mix with (f′,b′), you can do the same mix with your little open ball instead of the center point, and it just gets scaled down (but doesn't collapse to a point), making a little tiny open ball around your arbitrarily chosen point in A. So A is open.
Now, let's define a B that should be convex, so we can get Hahn-Banach separation going (as long as we can show that A and B are disjoint). It should be chosen to forbid our separating hyperplane being too much above ψh over the 1-or-less Lipschitz functions. So, let B be:
Obviously, cutting into this means your hyperplane is too far above ψh over the 1-or-less-Lipschitz functions in [−1,1]. And it's obviously convex, because 1-or-less-Lipschitz functions in [−1,1] are a convex set, and so is the region above a hyperplane (ψh+dIKR(h,h′)).
All we need to do now for Hahn-Banach separation is show that the two sets are disjoint. We'll assume there's a point in both of them and derive a contradiction. So, let's say that (f,b) is in both A and B. Since it's in B,
But also, (f,b)=p(f′,b′)+(1−p)(f∗,b∗) with the f's and b's and p fulfilling the appropriate properties, because it's in A. Since b∗<h′(f∗) and b′<ψh(f′)−dIKR(h,h′), we'll write b∗ as h′(f∗)−δ∗ and b′ as ψh(f′)−dIKR(h,h′)−δ′, where δ∗ and δ′ are nonzero. Thus, we rewrite as:
We'll be folding −pδ′−(1−p)δ∗ into a single −δ term so I don't have to write as much stuff. Also, ψh is an affine function, so we can split things up with that, and make:
Remember, ψh(f∗)≥h(f∗) because ψh≥h. So, we get:
And, if h(f∗)≥h′(f∗), we get a contradiction straightaway because the left side is negative, and the right side is nonnegative. Therefore, h′(f∗)>h(f∗), and we can rewrite as:
And now, we should notice something really really important. Since p can't be 1, f∗ does consistute a nonzero part of f, because f=pf′+(1−p)f∗.
However, f is a 1-or-less Lipschitz function, and bounded in [−1,1], due to being in B! If f∗ wasn't Lipschitz, then given any slope, you could find areas where it's ascending faster than that rate. This still happens when it's scaled down, and f′ can only ascend or descend at a rate of 1 or slower there since it's 1-Lipschitz as well. So, in order for f to be 1-or-less Lipschitz, f∗ must be Lipschitz as well. Actually, we get something stronger, if f∗ has a really high Lipschitz constant, then p needs to be pretty high. Otherwise, again, f wouldn't be 1-or-less Lipschitz, since 1−p of it is composed of f∗, which has areas of big slope. Further, if f∗ has a norm sufficiently far away from 0, then p needs to be pretty high, because otherwise f wouldn't be in [−1,1], since 1−p of it is composed of f∗ which has areas distant from 0.
Our most recent inequality (derived under the assumption that there's a point in A and B) was:
Assuming hypothetically were were able to show that
then because δ>0, we'd get a contradiction, showing that A and B are disjoint. So let's shift our proof target to trying to show
Let's begin. So, our first order of business is that
This should be trivial to verify, remember that p∈[0,1).
Now, f=pf′+(1−p)f∗, and f is 1-Lipschitz, and so is f′. Our goal now is to impose an upper bound on the Lipschitz constant of f∗. Let us assume that said Lipschitz constant of f∗ is above 1. We can find a pair of points where the rise of f∗ from the first point to the next, divided by the distance between the points is exceptionally close to the Lipschitz constant of f∗, or equal. If we're trying to have f∗ slope up as hard as it possibly can while mixing to make f, which is 1-Lipschitz, then the best case for that is one where f′ is sloping down as hard as it can, at a rate of -1. Therefore, we have that
Ie, mixing f∗ sloping up as hard as possible and f′ sloping down as hard as possible had better make something that slopes up at a rate of 1 or less. Rearranging this equation, we get:
We can run through almost the same exact argument, but with the norm of f∗. Let us assume that said norm is above 1. We can find a point where f∗ attains its maximum/minimum, whichever is further from 0. Now, if you're trying to have f∗ be as negative/positive as it possibly can be, while mixing to make f, which lies in [−1,1], then the best case for that is one where f′ is as positive/negative as it can possibly be there, ie, has a value of -1 or 1. In both cases, we have:
Now we can proceed. Since we established that all three of these quantities (1, Lipschitz constant, and norm) are upper bounded by 1+p1−p, we have:
And we have exactly our critical
inequality necessary to force a contradiction. Therefore, A and B must be disjoint. Since A is open and convex, and B is convex, we can do Hahn-Banach separation to get something that touches B and doesn't cut into A.
Therefore, we've crafted a ψh′ that lies above h′, and is within dIKR(h,h′) of ψh over the 1-or-less-Lipschitz functions in [−1,1], because it doesn't cut into A and touches B.
This same argument works for any ψh≥h, and it works if we swap h′ and h. Thus, since hyperplanes above the graph of an infradistribution function h or h′ correspond to points in the corresponding H and H′, and we can take any point in H/affine functional above h and make a point in H′/affine functional above h′ (and same if the two are swapped) that approximately agree on C1−lip(X,[−1,1]), there's always a point in the other infradistribution set that's close in KR-distance and so H and H′ have
And with that, we get
And we're done! Hausdorff distance between sets is within a factor of 2 of the IKR-distance between their corresponding infradistributions.
Proposition 43: A Cauchy sequence of infradistributions converges to an infradistribution, ie, the space □X is complete under dIKR.
So, the space of closed subsets of Ma(X) is complete under the Hausdorff metric. Pretty much, by proposition 42, a Cauchy sequence of infradistributions hn in the IKR-distance corresponds to a Cauchy sequence of infradistribution sets Hn converging in Hausdorff-distance, so to verify completeness, we merely need to double-check that the Hausdorff-limit of the Hn sets fulfills the various different properties of an infradistribution. Every point in H∞, the limiting set, has the property that there exists some Cauchy sequence of points from the Hn sets that limit to it, and also every Cauchy sequence of points from the Hn sets has its limit point be in H∞.
So, for nonemptiness, you have a sequence of nonempty sets of a-measures limiting to each other in Hausdorff-distance, so the limit is going to be nonempty.
For upper completion, given any point (m,b)∈H∞, and any (0,b′) a-measure, you can fix a Cauchy sequence (mn,bn)∈Hn limiting to (m,b), and then consider the sequence (mn,bn+b′), which is obviously Cauchy (you're just adding the same amount to everything, which doesn't affect the KR-distance), and limits to (m,b+b′), certifying that (m,b)+(0,b′)∈H∞, so H∞ is upper-complete.
For closure, the Hausdorff limit of a sequence of closed sets is closed.
For convexity, given any two points (m,b) and (m′,b′) in H∞, and any p∈[0,1], we can fix a Cauchy sequence (mn,bn)∈Hn and (m′n,b′n)∈Hn converging to those two points, respectively, and then consider the sequence p(mn,bn)+(1−p)(m′n,b′n), which lies in Hn (due to convexity of all the Hn), and converges to p(m,b)+(1−p)(m′,b′), witnessing that this point is in H∞, and we've just shown convexity.
For normalization, it's most convenient to work with the positive functionals, and observe that, because all the hn(0)=0 and all the hn(1)=1 because of normalization, the same property must apply to the limit, and this transfers over to get normalization for your infradistribution set.
Finally, there's the compact-projection property. We will observe that the projection of the a-measures in Hn to just their measure components, call the set pr(Hn), must converge in Hausdorff-distance. The reason for this is because if they didn't, then you could find some ϵ and arbitrarily late pairs of inframeasures where pr(Hn) and pr(Hm) have Hausdorff-distance >ϵ, and then pick a point in pr(Hn) (or pr(Hm)) that's >ϵ KR-distance away from the other projection. Then you can pair that measure with some gigantic b term to get a point in Hn (or Hm, depending on which one you're picking from), and there'd be no point in Hm (or Hn) within ϵ distance of it, because the measure component would only be able to change by ϵ if you moved that far, and you need to change the measure component by >ϵ to land within Hm (or Hn).
Because this situation occurs infinitely often, it contradicts the Cauchy-sequence-ness of the Hn sequence, so the projections pr(Hn) must converge in Hausdorff distance on the space of measures over X. Further, they're precompact by the compact-projection property for the Hn (which are infradistributions), so their closures are compact. Further, the Hausdorff-limit of a series of compact sets is compact, so the Hausdorff limit of the projections pr(Hn) (technically, their closures) is a compact set of measures. Further, any sequence (mn,bn) which converges to some (m,b)∈H∞, has its projection being mn∈pr(Hn), which limits to show that m is in this Hausdorff limit. Thus, all points in H∞ project down to be in a compact set of measures, and we have compact-projection for H∞, which is the last condition we need to check to see if it's an infradistribution.
So, the Hausdorff-limit of a Cauchy sequence of infradistribution sets is an infradistribution set, and by the strong equivalence of the infra-KR metric and Hausdorff-distance, a Cauchy limit of the infra-KR metric must be an infradistribution, and the space □X is complete under the infra-KR metric.
Proposition 44: If a sequence of infradistributions converges in the IKR distance for one complete metric that X is equipped with, it will converge in the IKR distance for all complete metrics that X could be equipped with.
So, as a brief recap, X could be equipped with many different complete metrics that produce the relevant topology. Each choice of metric affects what counts as a Lipschitz function, affecting the infra-KR metric on infradistributions, as well as the KR-distance between a-measures, and the Hausdorff-distance. So, we need to show that regardless of the metric on X, a sequence of convergent infradistributions will still converge. Use d1 for the original metric on X and d2 for the modified metric on X, and similarly, dKR1 and dKR2 for the KR-metrics on measures, and dhaus1,dhaus2 for the Hausdorff distance induced by the two measures.
Remember, our infradistribution sets are closed under adding +b to them, and converge according to dhaus1 to the set H∞.
What we'll be doing is slicing up the sets in a particular way. In order to do this, the first result we'll need is that, for all b∗≥1, the set
converges, according to dhaus1, to the set
So, here's the argument for this. We know that the projection sets
are precompact, ie, have compact closure, and Hausdorff-limit according to dhau1 to the set
(well, actually, they limit to the closure of that set)
According to our Lemma 3, this means that the set
(well, actually, its closure) is a compact set in the space of measures. Thus, it must have some maximal amount of measure present, call that quantity λ⊙, the maximal Lipschitz constant of any of the infradistributions in the sequence. It doesn't depend on the distance metric X is equipped with.
Now, fix any ϵ. There's some timestep n where, for all greater timesteps, dhau1(Hn,H∞)≤ϵ.
Now, picking a point (mn,bn) in Hn with bn≤b∗−ϵ, we can travel ϵ distance according to dKR1 and get a point in H∞, and the b term can only change by ϵ or less when we move our a-measure a little bit, so we know that our nearby point lies in
But, what if our point (mn,bn) in Hn has b∗−ϵ≤bn≤b∗? Well then, we can pick some arbitrary point (mlon,0)∈Hn (by normalization for Hn), and go:
And then we have to be a little careful. bn≤b∗ by assumption. Also, we can unpack the distance to get
And the worst-case for distance, since all the measures have their total amount of measure bounded above by λ⊙, would be f being 1 on one of the measures and -1 on another one of the measures, producing:
So, the distance from (mn,bn) to
according to dKR1 is at most 2ϵλ⊙b∗+ϵ
And then, because this point has a b value of at most
Because bn≤b∗, the b value upper bound turns into b∗−ϵ
Which is a sufficient condition for that mix of two points to be only ϵ distance from a point in H∞ with a b∗ upper bound on the b term, so we have that the distance from
is at most
Conversely, we can flip Hn and H∞, to get this upper bound on the Hausdorff distance between these two sets according to dhau1.
And, since b∗ and λ⊙