Our task in this post will be to develop the basic theory and notation for inframeasures and sa-measures. The proofs and concepts require some topology and functional analysis. We assume the reader is familiar with topology and linear algebra but not functional analysis, and will explain the functional analysis concepts more. If you wish to read through these posts, PM me to get a link to MIRIxDiscord, we'll be doing a group readthrough where I or Vanessa can answer questions. Here's the previous post and here are the proof sections. Beware, the proof sections are hard.
Feel free to skip this segment and refer back to it when needed. Duplicate the tab, keep one of them on this section, and you can look up notation here.
: some compact metric space. Points in this are denoted by or .
: Some distance metric, the usual one is the KR-metric between signed measures, defined as where is a function that has a Lipschitz constant of 1 or less. In other words, what's the biggest distance between the values that the measures output when you feed in a function that's limited in its ability to distinguish measures that are close to each other due to having a small Lipschitz constant?
: The Banach space of finite signed measures over equipped with the KR-metric/norm, defined as above (the norm is derived from the metric by assessing the distance between the signed measure and the 0 measure). Elements are denoted by . By Jordan decomposition, we can uniquely split it into where the former is all-positive and the latter is all-negative.
: The Banach space of continuous functions . The latter one is the space of continuous functions bounded in . Elements of are typically denoted by .
: We can interpret signed measures as continuous linear functionals on . This is given by . If was actually a probability distribution, this would just be . They're generalized expectations.
: used to refer to the number component of an a-measure or sa-measure.
: The closed convex cone of sa-measures. An sa-meaure is a pair where . Elements of this (sa-measures) are denoted by .
: A positive functional. A continuous linear functional that's nonnegative for all sa-measures.
: A set of sa-measures.
: The expectation of a function relative to a set of sa-measures. Defined as .
: The set of minimal points of (points that can't be written as a different point in plus a nonzero sa-measure), and the upper completion of (), respectively.
The closed convex cone of a-measures. An a-measure is a pair where is a measure (no negative component) and . It can also be written as where , and is a probability distribution.
: Given an a-measure, its value is the from writing the a-measure as . At the end, it's used for lambda-notation to describe complicated functions. Context distinguishes.
: Either the minimal upper bound on the values of the minimal points of a set , or the Lipschitz constant of a function (there's a close link between the two).
: The closure and convex hull of a set, respectively.
: An infradistribution. A set of sa-measures fulfilling the properties of nonemptiness, closure, convexity, upper-completeness, positive-minimals, (weak)-bounded minimals, and normalization.
: The set of infradistributions over . is the set of bounded infradistributions over .
: The space of probability distributions over some set .
: The function induced by an that goes . Or, just a function that's concave, monotone, uniformly continuous, and normalized, there's a link with infradistributions.
: If you're seeing it in the context of a pushforward, is a continuous function , and is the induced function . If you're seeing in the context of updating, it's a continuous function in .
: Used to denote probability distributions used for mixing stuff together, a probability distribution over the natural numbers or a finite subset of them in all cases. is the probability on the number .
: The index variable for mixing, like indexing infradistributions or points or functions.
: A mix of sets, defined as the set of every point that can be constructed by selecting a point from each and mixing according to .
: A function in , thought of as the indicator function for a fuzzy set, that we use for updating.
: The function made by gluing and together via , defined as .
: The support of a function , the set of where .
: The indicator function that's 1 on set , 0 everywhere else.
: The probability of relative to the function according to the infradistribution , it's defined as
: The infradistribution updated on the fuzzy set relative to the function .
: An infrakernel, a function fulfilling some continuity properties, of type signature .
Time to start laying our mathematical groundwork, like the spaces we'll be working in.
is some compact metric space, equipped with the Borel -algebra (which, in this case, is the same as the Baire -algebra) We could probably generalize this further to complete metric spaces, but things get significantly trickier (one of many directions for future research), and compact metric spaces are quite well-behaved.
Concrete examples of compact metric spaces include the space of infinite bitstrings, the color space for a mantis shrimp, the surface of a sphere, a set of finitely many points, the unit interval, the space of probability distributions over a compact metric space, and countable products of compact metric spaces.
Let's recap some functional analysis terminology for those seeing it for the first time, and a bit of the notation we're using, you can skip this part if you already know it. Vector spaces in full generality may lack much of the nice structure present in that's used in Linear Algebra. Going from the strongest and most useful structure to the weakest, there's a chain of implication that goes inner product, norm, metric, topology. If you have an inner product, you can get a norm. If you have a norm, you can get a metric from that via , and if you have a metric, you can get a topology from that (with a basis of open balls centered at points). The structure must be imposed on the vector space, and there may be some freedom in doing so, like how can have the , , or norm.
A Banach space is a vector space equipped with a norm (a notion of size for vectors), that's also closed under taking limits, just like is.
The term "functional" is used for a function to . So, a continuous linear functional on a vector space is a function that's: linear, continuous, and has type signature .
The term "dual space of " is used for the vector space of continuous linear functionals .
The space is the Banach space of continuous functions . We'll also use to denote the subset that just consists of continuous functions bounded in . The dual space of is , the Banach space of finite signed measures over .
Moving on from the functional analysis terminology, let's consider finite signed measures , an element of . A signed measure can be uniquely split into a positive part and a negative part , by the Jordan Decomposition Theorem. The "finite" part just means that doesn't assign any set measure and doesn't asssign any set measure.
Now, we said that the space of finite signed measures was the dual space of (continuous functions ). So... how does a signed measure correspond to a continuous linear functional over ? Well, that corresponds to
If was a probability distribution , then would be , so this is just like taking the expected value, but generalizing to negative regions in . We'll be using this notation a whole lot. Because finite signed measures perfectly match up with continuous linear functionals , we can toggle back and forth between whichever view is the most convenient in the moment, viewing continuous linear functionals on as finite signed measures, and vice-versa.
Our ambient vector space we work in is , a pair of a signed measure and a number. is the direct sum, which is basically Cartesian product for vector spaces. The direct sum of Banach spaces is a Banach space, with the norm defined in the obvious way as .
We should take a moment to talk about what norm/metric we're using. The norm/metric we're using on is the KR(Kantorovich-Rubinstein)-norm/metric.
Definition 1: KR-metric
The metric defined by where is a a continuous function with a Lipschitz constant of 1 or less. The KR-norm is
But why are we using this unfamiliar KR-metric, instead of the trusty old total variation distance? Well, the KR-metric, since it can only query the measure with functions that aren't too steep or big, says that two distributions are close if, y'know, they're close in the intuitive sense. If we have a bunch of Dirac-delta distributions at 0.9, 0.99, 0.999..., then according to the KR-metric, they'd limit to a Dirac-delta distribution at 1. According to total variation distance, all these distributions at distance 2 from each other and don't converge at all. Similarly, if we've got two probability distributions over histories for an environment we're in, and they behave very similarly and then start diverging after the gazillionth timestep, the KR-metric would go "hm, those two distributions are very close to each other", while total variation distance says they're very different. Also, if we've got finitely many points at distance 1 from each other, the KR-metric and total variation distance match up with each other (up to a constant). But total variation distance is a bit too restrictive for the continuous case.
There's a sense in which convergence in total variation distance too strict because it requires a "perfect match" to exist in your hypothesis space, while convergence in KR-distance is just right for nonrealizability because, instead of requiring a "perfect match", it just requires that you get sufficiently close. Instead of getting accurate predictions for the rest of forever, it's a requirement that's something more like "you'll be accurate for the next zillion timesteps, and the time horizon where you start being inaccurate gets further and further away over time". You can draw an analogy to how utility functions with the time discount don't care that much about what happens at very late times.
Going with the KR-metric means we've got very nice dual spaces and compactness properties, while with total variation distance, Wikipedia doesn't even know what the dual space is.
So, tl;dr, the KR-metric is a much better choice for our setting, and we're working in as our vector space, which is equipped with the KR-norm and is closed under limits.
Definition 2: Sa-Measure
A point , which, when written as a pair of a signed measure and a number , has . The set of sa-measures is denoted by .
Definition 3: A-Measure
A point , which, when written as a pair of a signed measure and a number , has as a measure, and . The set of a-measures is denoted by .
Note that and are both closed convex cones. A closed convex cone is a subset of a vector space that: Is closed under multiplication by any , is closed under addition, and is closed under limits. For visual intuition, imagine a literal cone with its point at 0 in that's heading off in some direction, and see how it fulfills those 3 properties.
Basic Inframeasure Conditions
Before proceeding further, we should mention that Theorems 1, 2, and 3 are fairly elementary and have probably been proved in more generality in some different paper on convex analysis. We just call them theorems because they're important, not necessarily original. Past that, things get more novel. Sets of distributions instead of distributions have been considered before under the name "Imprecise Probability", as have nonlinear expectations and some analogues to probability theory, Shige Peng wrote a book on the latter. We found out about this after coming up with it independently. The main innovations that have not been found elsewhere are augmenting the sets of probability distributions with extra data (ie, our sa-measures) to get a dynamically consistent update rule, and how to deal with environments/link the setting to reinforcement learning. Let's move on.
Let be some arbitrary set of sa-measures. We're obviously nowhere near calling it an infradistribution, because we haven't imposed any properties on it. And different may have the same behavior, we're nowhere near our second desideratum of collapsing equivalence classes of 's with the same behavior. Well, nonemptiness should be a fairly obvious property to add.
Condition 1: Nonemptiness:
From here, let's see how we can enlarge without affecting its behavior. But hang on, what do we even mean by "behavior"??
Definition 4: Expectation w.r.t. a Set of Sa-Measures
where and is nonempty.
This is what we mean by "behavior", all these values should be left unchanged, regardless of .
Proposition 1: If then is a positive functional for .
A positive functional for is a continuous linear function that is nonnegative everywhere on .
This suggests two more conditions besides Nonemptiness.
Condition 2: Closure:
Condition 3: Convexity:
Why can we impose closure and convexity? Taking the closure of wouldn't affect any expectation values, it'd only swap for in some cases because is continuous. Also, since everything we're querying our set with is of a linear functional by Proposition 1, we can take the convex hull without changing anything. So, swapping out for its closed convex hull, no expectation values change at all.
But wait, we aren't querying with all linear functionals, we're only querying it with positive functionals that are constructed from a in . Or does this class of positive functionals go further than we think? Yes, it does, actually.
Theorem 1, Functionals To Functions: Every positive functional on can be written as , where , and
Nice! We actually are querying our set with all positive functionals, because we've pretty much got everything with just , and everything else is just a scalar multiple of that.
Upper Completion and Minimal Points
If you have a point , and some other point that's an sa-measure, we might as well add to . Why? Well, given some positive functional (and everything we're querying our set with is a positive functional by Proposition 1,
By linearity and positive functionals being nonnegative on sa-measures, your new point has equal or greater value than , so when we do , the addition of the new point didn't change anything at all, regardless of which positive functional/continuous function (by Theorem 1) we're using. So then, let's add in all the points like this! It's free. This would be done via Minkowski sum.
Definition 5: Upper Completion
The upper completion of a set , , is
Condition 4: Upper Completeness:
Ok, so we add in all those points. Since we're adding two nonempty convex sets, the result is also convex. As for closure...
Lemma 2: The upper completion of a closed set of sa-measures is closed.
However, isn't quite what we wanted. Maybe there's more points we could add! We want to add in every sa-measure we possibly can to our set as long as it doesn't affect the essential "behavior"/worst-case values. So, we should be able to add in a point if every positive functional/continuous function (Proposition 1 and Theorem 1) goes "the value of the point you're looking is undershot by this preexisting point over here". This more inclusive notion of adding points to make as big as possible (adding any more points would start affecting the "behavior" of our set) would be:
Add a point to if, for all in , there's a in where
Actually, this gets us nothing over just taking the upper completion/adding ! Check out the next result.
Proposition 2: For closed convex nonempty ,
Combining Proposition 2 and Theorem 1, our notion of upper closure is exactly the same as "add all the points you possibly can that don't affect the value for any "
Along with the the notion of the upper completion comes the notion of a minimal point.
Definition 6: Minimal Point
A minimal point of a closed nonempty set of sa-measures is a point where, if , and , and then . The set of minimal points is denoted
So, minimal points can't be written as a different point in the same set plus a nonzero sa-measure. It's something that can't possibly have been added by the upper-completion if it wasn't there originally. We'll show a picture of the upper completion and minimal points (these two notions generalize to any closed subset of any closed cone), to make things more concrete.
Theorem 2, Minimal Decomposition: Given a nonempty closed set of sa-measures , the set of minimal points is nonempty and all points in are above a minimal point.
This means that we can take any point and decompose it into , where , and is an sa-measure. We use this a whole lot in proofs. The proof of this uses the axiom of choice in the form of Zorn's Lemma, but separability may let us find some way to dodge the use of the full axiom of choice.
Proposition 3: Given a , and a that is nonempty closed,
So when evaluating , we can just minimize within the set of minimal points. Minimal points are the only thing that matters for the "behavior" of .
Proposition 4: Given a nonempty closed convex , and
The set of minimal points is left unchanged when you take the upper completion, and taking the upper completion of the set of minimal points equals taking the upper completion of . This is fairly intuitive from the picture.
Theorem 3, Full Characterization: If the nonempty closed convex sets and have , then there is some where
Corollary 1: If two nonempty closed convex upper-complete sets and are different, then there is some where
Looking back at our second desideratum, it says "Our notion of a hypothesis in this setting should collapse "secretly equivalent" sets, such that any two distinct hypotheses behave differently in some relevant aspect. This will require formalizing what it means for two sets to be "meaningfully different", finding a canonical form for an equivalence class of sets that "behave the same in all relevant ways", and then proving some theorem that says we got everything."
And we did exactly that. Also, Theorem 3 and the other results justify the view of the minimal points as the "unique identifier" of a closed convex set. If two closed convex sets have the same minimal-point ID, then taking the upper completion gets you the same set, and they behave the same w.r.t all the queries we can throw at them. If two sets have a different minimal-point ID, then when we take the upper completion, they're different, and there's some query that distinguishes them.
Minimal Point Conditions
Well, this is the basics. But we can impose some more conditions. We don't really like these signed measures, it'd be nice to work exclusively with positive measures, if possible. The minimal points are all we really need to care about by Proposition 3, so let's require that they're all in the smaller cone , which has no negative-measure shenanigans afoot. Renormalization may fail if there's minimal points that have negative parts, which is analogous to how you can renormalize a positive measure back to a probability distribution, but a signed measure may not be able to be renormalized back to 1.
Condition 5: Minimal-positivity:
Further, things can get a bit tricky in various places if the minimal points don't lie in some compact set. Applying compactness arguments lets you show that you don't have to close your set after updating, and show up a lot in our proofs of properties of belief functions. However, this next condition isn't essential, just convenient, and it's worthwhile looking at what happens when it's dropped, for future research.
Condition 6a: Minimal-boundedness: There is a compact set s.t. .
Proposition 5: Let denote an arbitrary probability distribution. If , then the condition "there is a where, " is equivalent to "there is a compact s.t. "
We mostly use this formulation of Minimal-boundedness instead of the compact set requirement. We only have to bound the scale-terms on the minimal points and we have this property. Again, it's not essential, but very convenient.
Is there a weakening of bounded-minimals? Yes there is. I haven't figured out what it means for the set of minimal points (EDIT 8/28: got a clean iff result about what it means for minimal points while trying to prove something else), but it's more mathematically essential, and I don't think it can be dropped if some other post wants to go further than we did. It can't be motivated at this point, we'll have to wait until we get to Legendre-Fenchel Duality.
Condition 6b: Weak minimal-boundedness: is uniformly continuous.
So, we have almost everything. Nonemptiness, closure, convexity, upper-completion, bounded-minimals, minimal-positivity, and minimal-boundedness/weak minimal-boundedness are our conditions so far.
However, there's one more condition. What's the analogue of renormalizing a measure back to 1 in this situation? Well, for standard probability distributions , , and . This can be cleanly ported over. We shall require that , , that's our analogue of normalization. Unpacking what the expectation means, this corresponds to: There's minimal points where is arbitrarily close to 0, and there's some minimal point where , and there's no points with a lower value.
Condition 7: Normalization:
Let's recap all the conditions. We'll be using for something fulfilling all of the following conditions except maybe 6a.
4: Upper Completeness:
6b: Weak Bounded-Minimals: The function is uniformly continuous.
Definition 7: (Bounded)-Infradistribution/Inframeasure
An inframeasure is a set of sa-measures that fulfills conditions 1-5 and 6b. An infradistribution is a set of sa-measures that fulfills conditions 1-5, 6b, and 7. The "bounded" prefix refers to fulfilling condition 6a. The set of infradistributions is denoted as , the set of bounded infradistributions is denoted as .
Now, how do we get normalization if it isn't already present? Closure, Convexity, and Upper Completeness can be introduced by closure, convex hull, and upper completion, respectively. How do we turn an inframeasure into an infradistribution?
Well, just take every in your set, and map it to:
This may seem a bit mysterious. This normalization can be thought of as analogous to rescaling a utility function to be in via scale-and-shift.
What we're doing first, is shifting everything down by , which is as much as we possibly can manage without making go negative anywhere. The utility function analogue would be, if your expected utilities are , this is like shifting them down to .
The second thing we do is go "ok, what's the lowest value at 1? Let's scale that back up to 1". Well, it'd be (remember, we shifted first), so we multiply everything by to get our set of interest.
Hang on, what if there's a divide-by-zero error? Well.. yes, that can happen. In order for it to happen, . Does this correspond to any sensible condition which hopefully doesn't happen often?
Proposition 6: occurs iff there's only one minimal a-measure, of the form .
Put another way, divide by zero errors occur exactly when Murphy is like "oh cool, no matter what function they pick, the worst thing I can do is give them value, and then nothing happens so they can't pick up any more value than that", so nothing matters at all. This is exactly like Bayesian renormalization failing when you condition on a probability-0 event (note that the measure component is 0 before rescaling). You give up and cry because, in the worst case, nothing you do matters.
Proposition 7: Renormalizing a (bounded)inframeasure produces a (bounded) infradistribution, if renormalization doesn't fail.
And we're done for now, we've made it up to infradistributions. Now, how can we analyze them?
There's an powerful way of transforming an infradistribution to another form, to look at the same thing in two completely different mathematical contexts, and we can build up a sort of dictionary of what various different concepts are in two completely different settings, or develop concepts in one setting and figure out what they correspond to in the other setting.
An example of this sort of thing is Stone Duality, where you can represent a topological space as its lattice of open sets, to translate a huge array of concepts back and forth between topology and lattice theory and work in whichever setting is more convenient. And, working with special lattices that can't always translate to topological spaces, you get locales and pointless topology! Duality theorems are highly fruitful.
This result will be inspired by the well-known fact that signed measures correspond to continuous linear functionals on via , and probability distributions, when translated over, have extra properties like monotonicity, being 1-Lipschitz, and .
Theorem 4, LF-duality, Sets to Functionals: If is an infradistribution/bounded infradistribution, then is concave, monotone, uniformly continuous/Lipschitz over , , and
So, expectation w.r.t an infradistribution is concave (not linear, as probability distributions are) and if then (monotonicity). Paired with normalization, this means every appropriate has .
You get concavity from convexity, the thing from upper-completeness, monotonicity matches up with "all minimal points are a-measures", Lipschitz corresponds to "all minimal points have ", uniform continuity corresponds to the weak-minimal-bound condition, and obviously corresponds to normalization. This is moderately suggestive, our conditions we're imposing on the set side are manifesting as natural conditions on the concave functional we get from .
Is there a reverse direction? How do we start with a that fulfills the conditions, and get an infradistribution from that?
Theorem 5, LF-Duality, Functionals to Sets: If is a function that is concave, monotone, uniformly-continuous/Lipschitz, , and , then it specifies a infradistribution/bounded infradistribution by:
Where is the function given by , and is the convex conjugate of . Also, going from a infradistribution to an and back recovers exactly the infradistribution, and going from an to a infradistribution and back recovers exactly .
Another name for the convex conjugate is the Legendre-Fenchel transformation, so that's where we get the term Legendre-Fenchel Duality from. If you want, you can shorten it as LF-duality.
So, (bounded)-infradistributions are isomorphic to concave monotone normalized Lipschitz/uniformly continuous functions . This is the LF-Duality. We can freely translate concepts back and forth between "sets of sa-measures that fulfill some conditions" and "concave functionals on that fulfill some conditions".
In particular, actual probability distributions correspond to a: linear, monotone, 1-Lipschitz, normalized functional . So, in the other half of the LF-duality, probability distributions and infradistributions are very nearly the same sort of thing, the only difference between them is that the former is linear and the latter is concave! This is the essential reason why so many probability theory concepts have analogues for infradistributions.
But... what does the LF-transformation actually do? Well, continuous linear functionals are equivalent to signed measures, and continuous linear functionals (with our KR-norm) correspond to continuous functions over . A hyperplane in corresponds to a point in and hyperplanes in the latter correspond to points in the former. Points in turn into hyperplanes above , points on-or-below the graph of turn into hyperplanes below .
What this transform basically does, is take each suitable , check its value w.r.t , convert the pair into a hyperplane in , and go "alright, whichever set came from must be above this hyperplane". Eventually all the hyperplanes are drawn and you've recovered your infradistribution as the region above the graph of the hyperplanes.
Viewing as suitable concave functionals on , we now have a natural notion of distance between infradistributions analogous to total variation distance
At this point, the use of our uniform-continuity condition is clearer. A uniform limit of Lipschitz functions may not be Lipschitz. However, a uniform limit of uniformly continuous functions is uniformly continuous, so the space is complete (has all its limit points).
A lot of probability-theory concepts carry over to infradistributions. We have analogues of products, markov kernels, pushforwards, semidirect products, expectations, probabilities, updating, and mixtures. These are most naturally defined in the concave-functional side of the duality first, and then you can conjecture how they work on sets, and you know you got the right set (modulo closure, convex hull, and upper completion) if the expectations w.r.t your set match up with the defining property in the concave functional picture. The post was getting long-enough as is, so we won't cover most of them in detail, and leave developing inframeasure theory more fully till later posts.
Pushforwards and Mixing
Let's look at the first one, pushforwards of a (bounded) infradistribution via a continuous . The standard probability theory analogue of a pushforward is starting with a probability distribution over , and going "what probability distribution over is generated by selecting a point from according to its probability distribution and applying ?"
On the positive functional level, this is defined by:
Let's take a guess as to what this is on the set level. The obvious candidate is: Take the in , do the usual pushforward of the measure via to get a signed measure over , and keep the term the same, getting a function . If is something that maps everything to the same point, our resulting measures will only be supported on one point, so we'd have to take upper-completion again in that case. But maybe if is surjective we don't need to do that?
Let be the set produced by applying to and taking the upper completion.
Proposition 8: If and is continuous, then
Proposition 9: is a (bounded) inframeasure if is, and the image of doesn't require upper completion if is surjective.
Proposition 8 certifies that our unique defining property is fulfilled by doing this to the set of sa-measures, so we have the proper set-analogue of the pushforward. And proposition 9 says we were right, the only thing that may go wrong is upper completion. Generally, doing operations like may not exactly make an inframeasure, but you're only a closed convex hull and upper completion away from your canonical form as long as you check that the expectations match up with what you'd expect on the concave functional side.
Alright now, what about mixing? Like, mixing hypotheses to make a prior, according to some distribution on the numbers. The concave functional analogue is
This works just fine with no extra conditions if we have uniform continuity, but for Lipschitzness, we need an extra condition. Letting be the Lipschitz constant for the functional , we need in order for the result to be Lipschitz. The set version should just be mixing of sets.
Definition 8: Mixing Sets
Given a countable family of inframeasures where , and a probability distribution , then
Ie, is the set of sa-measures that can be made by choosing one sa-measure from each and mixing them together w.r.t. .
Try sketching out two sets on a piece of paper and figure out what the 50/50 mix of them would be. This corresponds to "Murphy can pick whatever they want from each set, but they're constrained to play the points they selected according to the probability assigned to each "
We should note that for bounded inframeasures, letting be the bound on the value of the minimal points of by minimal-boundedness, we want to preserve our minimal-bounded condition for the mix.
Proposition 11: A mixture of infradistributions is an infradistribution. If it's a mixture of bounded infradistributions with minimal point bounds of , and , then the mixture is a bounded infradistribution.
Again, Proposition 10 certifies that we got the right notion on the set level, and Proposition 11 certifies that we don't need to do any closure or upper completion cleanup. Now, how do mixtures interact with pushforwards?
So, it doesn't matter whether you mix before or after pushing your infradistribution through a continuous function. The proof of this is quite nontrivial if we were to do it on the set level because of exhaustive verification of each of the conditions, but now that we've shown that we have the right set-level analogue of mixing and pushforwards, we can just work entirely in the concave-functional picture and knock the proof out in a couple of lines, so our duality is doing useful work.
Let's move on to updates. We should mention that we'll be taking a two-step view of updating. First, there is chopping down the measure so only measure remaining within the set we're updating on remains, and then there's renormalizing the measure back up to 1. Thinking of updating as a unitary process instead of separated into these two phases will confuse you.
First, what sorts of events can we update on? Well, we only have expectations of continuous functions, and in a classical setting, we can get probabilities from expectations (and then updates from that) by considering the expectation of an indicator function that's 1 on a set and 0 everywhere else. Sadly, in this setting, we can only take the expectation of a continuous function. A sharp indicator function for an event will be discontinuous unless the set we're updating on is clopen. Fortunately, the specific application of "conditioning on a finite history" (with discrete action and observation spaces) only involves conditioning on clopen sets, because the set of histories which have a certain finite history as a prefix are clopen. Similarly, if you've got a finite space of observations, observing "the true thing is in this subset of observations" is a clopen update.
This seems like a very harsh restriction, but we can generalize further to get something adequate for most applications. I had to rewrite this section, though, because the original motivation was too complicated and had too many moving parts, so I figured it'd be more sensible to motivate things from an entirely different direction.
Let's call our clopen set that we're updating on . Our task for the post-update infradistribution is to specify the expectation values of functions that are only defined on . And we only have expectation values for functions that are defined on all of . So, we need some way to extend a function that's only defined on to be defined on all of . The obvious way is to specify some single function that tells you what happens outside of , and look at the expectation of , ie, "look at the function that's on , and outside of ". Admittedly, the obvious way to do this is to have , and that produces the best analogies to probability theory for infradistributions, and is what standard updates do. However, it's worth looking at things in more generality and seeing what comes out.
So, our first stab (on the positive functional level) would be something like:
This is just"extend your function on to all of by extending it with " But, lamentably, this isn't normalized. This is more like a "raw update" without normalizing afterwards. But, we know how to renormalize. So, attempt 2:
That's the renormalized form. As a sanity check, we should make sure that it recovers standard Bayesian updates. So, let's say our infradistribution is actually just , a standard probability distribution. Then,
Ok, so this recovers standard updating for actual probability distributions. But why did we introduce that ? Why does it matter how we extend our function to outside the area we're updating on? Just use , we don't care about what's happening outside the area we're updating on. Well, that's exactly the problem. We actually do care about what's happening outside of what we're updating on. The whole reason dynamic inconsistency comes about for standard Bayesian updating on problems like counterfactual mugging is that past-you cares about what happens down both branches of the coinflip, but once Omega tells you what the coin comes up, you stop caring about what happens own the other branch of the coinflip and so you don't pay up. And that's why having can permit you to get dynamic consistency, past-you and future-you don't have to have mismatches in what you care about.
For a single probability distribution, the reason why doesn't really matter is because, for probability distributions, we can split into "expected value coming from " and "expected value coming from outside " and cleanly add them up. So, specifying how well you do outside just adds the same constant to everything, and then the shift part of scale-and-shift deletes it.
However, for infradistributions in general, you can't split things up like that! Your choice of affects how different 's score relative to each other, so it matters quite a bit how you extend functions from to all of .
So, there's the motivation for updates. Really, it's a question of "how do we extend expectations of functions in to a function over all of so we can take expectations", and it turns out that different choices of (to extend to all of ) produce different results, while this isn't the case with conventional probability distributions. And, while produces the best analogies to conventional probability theory and updating and has the most nice properties, pretty much corresponds to "I don't care about what happens outside of ", and dynamic consistency mandates that you do care about what happens outside of the events you update on.
But still, only being able to update on clopen sets really sucks. Can we generalize it? Well, using a theorem not shown here (because this section is a later-edited addition), we can go:
"Lets say we've got an infradistribution on the space , and a continuous function that gives the probability of a particular point producing the observation we got. We use and to get a joint infradistribution over (ie, did we see the observation or not). Then we update on 1 (ie, we saw the observation, this is a clopen update), and our function tells us how good things would be for a given conditional on not getting the observation. Our new infradistribution is over , which is isomorphic to . This new infradistribution has thus-and-such form."
With that, we can update on observations where different points in have different odds of producing your observation of interest. is your function telling you how likely it is that a point would produce your observation you saw, is your function telling you how well things do for a given point if you didn't see the observation you saw. This is basically updating on any fuzzy set with a continuous indicator function ( can be considered the indicator function for a fuzzy set), instead of just updating on clopen sets. Much more general! Also note how the logical induction paper had to use continuous indicator functions instead of sharp discontinuous ones, this is analogous. How do updates work when we're updating on fuzzy sets?
Well, a nifty feature of a continuous likelihood function is that it lets us glue two continuous functions into another continuous function, in a similar way to our original attempt of "f in the region we're restricting to, g outside it"
Definition 9: L-Gluing
Given three continuous functions , glued to via , , is the function defined as:
So, behaves like inside our region we're updating on, and outside of our region we're we're updating on, and get mixed in regions we're unsure of.
Going back to the positive functional picture with (our analogue of an expectation), we can now define the raw update relative to (likelihood function) and (off-observation value). The following form is derived from our theorem about reducing updates on fuzzy sets to sharp updates on "did I get this observation".
And then the full version of updating on fuzzy sets in all its generality (on the concave functional side of the duality) is:
Squinting at this, this is: "do a raw update. Then subtract the excess value off (first step of renormalizing), and then rescale according to which... well, ignoring the part, looks kinda like "expectation of 1 - expectation of 0"... oh, it's like using the gap between the measure on and the measure on to define how much you need to scale up a measure to be a probability distribution! That term on the bottom is the analogue of the probability of the event you're updating on, relative to the function !"
Let's flesh this out a bit more in preparation for defining updates on the set side of the duality.
Backing up to our definition of renormalization, we required that , and . If we pivot to viewing the function inside the expectation as our that gives an indicator function for something, the normalization condition says something like "probability of is 1, probability of is 0".
Let's generalize this and define probabilities of fuzzy events relative to a function .
Definition 10: Probabilities of Fuzzy Sets
Given two functions , and a nonempty set of sa-measures , the probability of (interpreted as a fuzzy set) w.r.t and is:
If is 0, then for an infradistribution , by normalization, and unpacking what that means, we get . This is much like how the probability of a set is the expectation of the indicator function for the set, just interpret the as the indicator function for a fuzzy set on the right, and as a fuzzy set on the left. So, for , it behaves like an actual probability.
However, when , then this is better interpreted as caring-measure. is the difference between the best score you can get vs Murphy and the worst score you can get vs Murphy if you know how things go outside of (interpreted as fuzzy set). This -dependent "probability" is actually "how much value is at stake here/how much do I care about what happens inside set given that I know what happens outside of it."
And, further, our scaling term for renormalizing an inframeasure that isn't an infradistribution yet was . So, using this reformulation, our rescaling term turns into regardless of . So, our renormalization term is "rescale according to the probability you assign to any event at all occuring"
Alright, we have enough tools to define updating on the set level. For the raw update (no rescaling yet), we need to chop the measure down according to . We should also fold in the off- value (requires specifying ) to the term, by our dynamic consistency example. And then we do appropriate scale-and-shift terms to subtract as much as we can off the term, and rescale according to the probability of relative to the we're also updating on.
Let's use to denote the original finite signed measure but scaled down by the function . If we view as an indicator function, this is just slicing out all of the measure outside of , ie, usual conditioning.
Definition 11: Updating
, updated on and , is the set made by mapping through the following function and taking closure and upper-completion.
Closure is unnecessary if is a bounded inframeasure, and upper-completion is uneccesary if your is the indicator function for a clopen set.
Roughly, this is: Chop down the measure according to your fuzzy set, is the fragment of expected value you get outside of your fuzzy set so you fold that into the term. For rescaling, when we unpack , it's just , so that's the maximum amount of value we can take away from the second vector component without ever making it go negative. And then rescale "how much do I care about this situation" () back up to 1.
Note that this updating process lands us in a different vector space. Our new vector space is . It still fulfills the nice properties we expect, because a closed subset of a compact space is compact so every nice property still carries over. And it still has a closed convex cone of sa-measures w.r.t. the new space, abbreviate that as .
What properties can we show about updates?
Proposition 13: When updating a bounded infradistribution over , if the renormalization doesn't fail, you get a bounded infradistribution over the set . (for infradistributions in general, you may have to take the closure)
Ok, so it's sensible on the set level. And for proposition 14, it means we can break down the expectation of two functions glued together by into the expectation of outside of , and the probability of relative to times the updated expectation of . We get something interesting when we reshuffle this. It rearranges to:
Further unpacking the probability, we get
And then, translating to concave functionals via LF-duality, we get:
So, this shows that we got the right update! What happens when we update twice?
So, updating twice in a row produces the same effect as one big modified update. It may be a bit clearer if we express it as:
Corollary 2: Regardless of and and ,
Corollary 3: If and are clopen sets, then, glossing over the difference between indicator functions and sets,
Now, what happens when you update a prior/mixture of infradistributions? We get something that recovers Bayesian updating.
Theorem 6, InfraBayes:
if the update doesn't fail.
This means that when we update a prior, it's the same as updating everything individually, and then mixing those with probabilities weighted by the probability the infradistribution assigned to , just like standard Bayes!
In particular, if some hypothesis goes "nothing matters anymore" and gives up and cries after seeing , then its probability term is 0, so it will drop out of the updated prior entirely, and now you're only listening to hypotheses that think what you do matters. Thus, with a fairly broad prior, we don't have to worry about the agent giving up on life because nothing matters post-update, just as long as some component in its prior gives it the will to continue living/says different policies have different values. Well, actually, we need to show an analogue of this for belief functions, but it pretty much works there too.
There are more probability theory analogues, but they are primarily material for a future post. We'll just give their forms in the concave functional view. If they look unmotivated, just note that they match up with the standard probability-theory notions if we use infradistributions corresponding to actual probability distributions. We'll be using lambda-notation for functions. If you haven't seen it before, is the function that takes in an , and returns . is the function that maps to the function that maps to .
Definition 12: Product
If and , the product is given by:
These products are noncommutative! The product of bounded infradistributions is a bounded infradistribution.
There's also infrakernels, the infra-analogue of Markov kernels. A Markov kernel is a function X\to\Delta Y that maps each to some probability distribution over . Concrete example: The function mapping income to a probability distribution over house size.
Definition 13: Infrakernel
An infrakernel is a function that is:
1: Pointwise convergent. For all sequences limiting to and all ,
2: Uniformly equicontinuous. For all , there is a where if , then
If there is some where property 2 works with , then it is a Lipschitz infrakernel.
The first two conditions let you preserve uniform continuity, and the strengthening of the second condition lets you preserve Lipschitzness/being a bounded inframeasure.
Now, we can define the semidirect product, . The semidirect product in the probability-theory case is... Consider the aforementioned Markov kernel mapping income to a probability distribution over house size. Given a starting probability distribution over income, the semidirect product of (income distribution) with (house size kernel) would be the joint distribution over income and house size. It's a critically important concept to know that isn't discussed much.
Definition 14: Semidirect Product
If , and is an infrakernel , the semidirect product is given by
Products are just a special case of this, where , regardless of . If is a bounded infradistribution and is a Lipschitz infrakernel, then the semidirect product is a bounded infradistribution.
The pushforward of a probability distribution is... given the house size Markov kernel, and a distribution over income, the pushforward is the induced distribution over house size. We earlier gave pushforwards for a continuous function . What's the analogue of the pushforward for an infrakernel? Or, can you do a pushforward w.r.t. a Markov kernel?
Definition 15: Pushforward
If , and is an infrakernel , the pushforward is given by And if is a continuous (in the KR-metric) Markov kernel , the pushforward is given by
An interesting note about this is, if is a bounded infradistribution, then we need Lipschitz infrakernels to preserve that property for pushforwards, but we do not need any additional condition on a Markov kernel to preserve boundedness besides continuity. Exercise: Try to figure out why.
Really, everything originates from the semidirect product. The product is the special case of a semidirect product for a constant infrakernel, the pushforward is a semidirect product that's projected down to the coordinate, the pushforward w.r.t a Markov kernel is a special case of pushforward w.r.t. an infrakernel, and the pushforward w.r.t. a continuous function is a special case of pushforward w.r.t. a Markov kernel.
And that's about it for now, see you in the next post!
For those like me who hadn't heard of the Kantorovich-Rubinstein metric, note that it's the same as the perhaps-familiar Wasserstein metric.
I got really confused by this in conjunction with proposition 1. A few points of confusion:
Yeah, looking back, I should probably fix the m- part and have the signs being consistent with the usual usage where it's a measure minus another one, instead of the addition of two signed measures, one a measure and one a negative measure. May be a bit of a pain to fix, though, the proof pages are extremely laggy to edit.
Wikipedia's definition can be matched up with our definition by fixing a partial order where (m′,b′)≥(m,b) iff there's a (m∗,b∗) that's a sa-measure s.t. (m,b)+(m∗,b∗)=(m′,b′), and this generalizes to any closed convex cone. I lifted the definition of "positive functional" from Vanessa, though, you'd have to chat with her about it.
We're talking about linear functions, not affine ones. (m,b)↦c(m(f)+b) is linear, not affine (regardless of f and c, as long as they're in C(X) and R, respectively). Observe that it maps the zero of M±(X)⊕R to 0.
Half baked confusion:
How does Parfit's Hitchiker fit into the Infra-Bayes formalism? I was hoping that disutility the agent receives from getting stuck in the desert would be easily representable as negative off-branch utility. I am stuck trying to reconcile that with the actual update rule:
Here, I interpret g as our utility function. Thus: m(0★Lg) gives us the expected utility tracked from the offbranch event. The probability and the expectation are just a scale and shift. This update is applied to each a-measure in a set, H, of a-measures. (is this right so far?)
Since the environment depends on what our agent chooses, we've got to at least have some Knightian uncertainty over the different decisions we could have made. The obvious thought is to give these names, say e1 is the environment corresponding to the hardcoded policy that the agent pays after getting carried to the city, and in e2 the opposite. After updating that we are in the city, e2 seems logically impossible. Does that mean we apply Nirvana to make it go to infinite utility on the basis that infx∈∅f(x)=∞?
We have two policies: we either pay or don't pay. But not paying would lead us to Nirvana in either case since it would contradict the hardcoded policy in e1 and e2 is impossible. Paying would lead us to losing some utility (paying $100) in e1, or contradiction in e2, so Murphy chooses e1. This is where my reasoning gets stuck, is there some way to translate this to a Nirvana free space, where the agent takes the only logically coherent action?
So, the flaw in your reasoning is after updating we're in the city, e2 doesn't go "logically impossible, infinite utility". We just go "alright, off-history measure gets converted to 0 utility", a perfectly standard update. So e2 updates to (0,0) (ie, there's 0 probability I'm in this situation in the first place, and my expected utility for not getting into this situation in the first place is 0, because of probably dying in the desert)
As for the proper way to do this analysis, it's a bit finicky. There's something called "acausal form", which is the fully general way of representing decision-theory problems. Basically, you just give an infrakernel Θ:Π→□(A×O)ω that tells you your uncertainty over which history will result, for each of your policies.
So, you'd have
Ie, if you pay, 99 percent chance of ending up alive but paying and 1 percent chance of dying in the desert, if you don't pay, 99 percent chance of dying in the desert and 1 percent chance of cheating them, no extra utility juice on either one.
You update on the event "I'm alive". The off-event utility function is like "being dead would suck, 0 utility". So, your infrakernel updates to (leaving off the scale-and-shift factors, which doesn't affect anything)
Because, the probability mass on "die in desert" got burned and turned into utility juice, 0 of it since it's the worst thing. Let's say your utility function assigns 0.5 utility to being alive and rich, and 0.4 utility to being alive and poor. So the utility of the first policy is 0.99⋅0.4=0.396, and the utility of the second policy is 0.01⋅0.5=0.005, so it returns the same answer of paying up. It's basically thinking "if I don't pay, I'm probably not in this situation in the first place, and the utility of "I'm not in this situation in the first place" is also about as low as possible."
There's a very mathematically natural way to translate any decision theory to "causal form", and as it turns out, the process which falls directly out of the math is that thing where you go "hard-code in all possible policies, go to Nirvana if I behave differently from the hard-coded policy". This has an advantage and a disadvantage. The advantage is that now your decision-theory problem is in the form of an infra-POMDP, a much more restrictive form, so you've got a much better shot at actually developing a practical algorithm for it. The disadvantage is that not all decision-theory problems survive the translation process unchanged. Speaking informally the "fairness criterion" to translate a decision-theory problem into causal form without too much loss in fidelity is something like "if I was mispredicted, would I actually have a good shot at entering the situation where I was mispredicted to prove the prediction wrong".
Counterfactual mugging fits this. If Omega flubs its prediction, you've got a 50 percent chance of being able to prove it wrong.
XOR blackmail fits this. If the blackmailer flubs its prediction and thinks you'll pay up, you've got like a 90 percent chance of being able to prove it wrong.
Newcomb's problem fits this. If Omega flubs its prediction and thinks you'll 2-box, you'll definitely be able to prove it wrong.
Transparent Newcomb and Parfait's Hitchiker don't fit this "fairness property" (especially for 100 percent accuracy), and so when you translate them to a causal problem, it ruins things. If the predictor screws up and thinks you'll 2-box on seeing a filled transparent box/won't pay up on seeing you got saved, then the transparent box is empty/you die in the desert, and you don't have a significant shot at proving them wrong.
Let's see what's going wrong. Our two a-environments are
(predicted to not pay, probably die in desert,0)
(predicted to pay, probably survive,0)
Update on the event "I didn't die in the desert". Then, neglecting scale-and-shift, our two a-environments are
(0.01in city, pay implies Nirvana,0)
(0.99in city, no-pay implies Nirvana,0)
Letting N be the utility of Nirvana,
If you pay up, then the expected utilities of these are 0.01⋅N and 0.99⋅0.4
If you don't pay up, then the expected utilities of these are 0.01⋅0.5 and 0.99⋅N
Now, if N is something big like 100, then the worst-case utilities of the policies are 0.396 vs 0.005, as expected, and you pay up. But if N is something like 1, then the worst-case utilities of the policies are 0.01 vs 0.005, which... well, it technicallygets the right answer, but those numbers are suspiciously close to each other, the agent isn't thinking properly. And so, without too much extra effort tweaking the problem setup, it's possible to generate decision-theory problems where the agent just straight-up makes the wrong decision after changing things to the causal setting.
Thank you so much for your detailed reply. I'm still thinking this through, but this is awesome. A couple things:
So, if you make Nirvana infinite utility, yes, the fairness criterion becomes "if you're mispredicted, you have any probability at all of entering the situation where you're mispredicted" instead of "have a significant probability of entering the situation where you're mispredicted", so a lot more decision-theory problems can be captured if you take Nirvana as infinite utility. But, I talk in another post in this sequence (I think it was "the many faces of infra-beliefs") about why you want to do Nirvana as 1 utility instead of infinite utility.
Parfit's Hitchiker with a perfect predictor is a perfectly fine acausal decision problem, we can still represent it, it just cannot be represented as an infra-POMDP/causal decision problem.
Yes, the fairness criterion is tightly linked to the pseudocausality condition. Basically, the acausal->pseudocausal translation is the part where the accuracy of the translation might break down, and once you've got something in pseudocausal form, translating it to causal form from there by adding in Nirvana won't change the utilities much.
should be 'an sa-measure' not 'an sa-meaure'.
This open paren seems to me to have no matching close paren.
Should be "Let's say"
Missing math formatting
I've been meaning to dive into this for-e-ver and only now find the time for it! This is really neat stuff, haven't enjoyed a framework this much since logical induction. Thank you for writing this!
(Reformatted Latex so the comment text editor won't reject it.)