This research was produced during SERI MATS. Thanks to Vanessa Kosoy for providing ideas and reading drafts.
Introduction
In this article, we investigate "infra-Bayesian logic" (IBL for short), which is an idea introduced by Vanessa Kosoy. The original idea is outlined in Vanessa Kosoy's Shortform. Here, we expand with some details and examples. We also try to fix the mentioned issues (e.g. lack of continuity for certain maps) by investigating several candidate categories for semantics. However, we fail to find something that works.
As a theory of beliefs and reasoning, infra-Bayesianism gives rise to semantics for a certain logic, which we might call "infra-Bayesian logic" (IBL for short). Formalizing the syntax for this logic has the advantage of providing a notion of description complexity for infra-POMDPs (partially observable Markov decision processes) expressed via this syntax, and hence could serve as a foundation for an infra-Bayesian analogue of AIXI. Infra-Bayesian logic is not the only way to define an infra-Bayesian analog of AIXI. Such an alternative could be to use oracle machines (with an extra input tape for a source of randomness): For a probabilistic oracle machine, the corresponding IB hypothesis is the closed convex hull of all stochastic environments that result from using the oracle machine with a particular oracle, and then one could use the description complexity of the oracle machine (which can be defined by choosing a universal oracle machine) as a basis for an infra-Bayesian AIXI. The analogue of oracle machines for (non-infra-Bayesian) AIXI would be (probabilistic) Turing machines.
However, infra-Bayesian logic might be an interesting alternative to describing infra-Bayesian hypotheses, and this approach would be specific to infra-Bayesianism (there is, as far as we know, no good analogue of infra-Bayesian logic in (ordinary) Bayesianism).
Note that we are currently unaware of a working semantics for the higher order logic. We discuss some candidates considered in Candidates for the Semantics, and the ways in which they fail. It's unclear whether these issues with the semantics are mostly technical in nature, or whether they point to some deeper phenomena. We can, however, restrict to a first order IBL on finite sets, which, albeit less powerful, is still expressive enough for many applications (in particular covers all the examples in Examples).
You can skim/skip the Notation section and use it as a reference instead. The main description of IBL is given in Infra-Bayesian Logic. To get an initial and informal idea what IBL looks like, read Less Formal Description of Infra-Bayesian Logic. For a technical description and a complete list of properties, read Formal Description of Infra-Bayesian Logic. Note that the latter description requires some familiarity with category theory.
In Constructions in IBL we list how we can combine basic elements of IBL to express more complicated things.
In Infra-POMDPs we talk about infra-POMDPs as an application of IBL. Some knowledge of infra-Bayesianism is beneficial here, but not required. We highlight the section on Examples, where we describe the IBL terms for concrete Infra-POMDPs.
In Candidates for the Semantics we discuss the various (failed) candidates we tried for the higher-order logic version of IBL. A typical reader is not expected to read this section.
The Appendix is for mostly technical proofs and counterexamples. We recommend to only look these up when you are interested in the technical details. For example, if you wish to understand why conjunction is not continuous in IBL, you can find the details there.
Notation
Let us describe notation needed for using concepts from infra-Bayesianism.
This notation section is partially copied from the Infra-Bayesian Physicalism article, but extended from finite sets to compact Polish spaces, wich requires a bit more measure theory. Also, we use the symbol □Xfor homogeneus ultracontributions (instead of □cX as in the IBP article).
The notation from the IBP post is a bit different from some of the other posts on infra-Bayesianism in that it uses ultradistributions/ultracontributions rather than infradistributions.
We denote R+:=[0,∞). We will work a lot with compact Polish spaces.
Compact Polish spaces are spaces that are topologically equivalent to some compact metric space (the metric is not fixed, however, only the topology). An important special case of compact Polish spaces are finite (discrete) sets.
Definition 1. Given a compact Polish space X, a contribution on X is a (Borel) measure θ on X with θ(X)≤1. Given a measurable function f:X→R and θ∈ΔcX, we denote θ(f):=∫Xf(x)dθ(x). The space of contributions is denoted ΔcX, equipped with the weak-∗ topology (i.e. the weakest topology such that the functions ΔcX∋θ↦θ(f)∈R are continuous for all continuous functions f:X→R).
Naturally, any (probability) distribution is in particular a contribution, so ΔX⊆ΔcX. There is a natural partial order on contributions: θ1≤θ2 when θ1(A)≤θ2(A) for all measurable A⊂X.
Definition 2. A homogeneous ultracontribution (HUC) on X is non-empty closed convex Θ⊆ΔcX which is downward closed w.r.t. the partial order on ΔcX. The space of HUCs on X is denoted □X.
If a HUC Θ∈□X has the property Θ s.t. Θ∩ΔX≠∅, we call it a homogeneuous ultradistribution.
Sometimes we need a topology on the space of HUCs □X. In this case we will use the topology induced by the Hausdorff metric, (where we use a metric on ΔcX that fits with the weak-∗ topology. The topology induced by the Hausdorff metric is also equivalent to the so-called Vietoris topology, which is a topology that only depends on the underlying toplogy on Δc(X), see here. See also Propositions 42 and 50 in Less Basic Inframeasure Theorey for the same result for infra-distributioins). One can show that □X is a compact Polish space with this topology when X is a compact Polish space, see Lemma 10. Let us introduce more notation.
For a set A⊂Δc(X), clA denotes its closure.
Given another compact Polish space Y and a measurable function s:X→Y, s∗:□X→□Y is the pushforward by s:
s∗Θ:=cl{s∗θ∣θ∈Θ},
where (s∗θ)(A):=θ(s−1(A)) for all measurable A⊂Y. We can omit the cl in the definition if s is continuous.
A map X→□Y is sometimes referred to as an infrakernel.
Given an infrakernel Ξ:X→□Y, Ξ∗:□X→□Y is the pushforward by Ξ:
for all measurable A⊂X, B⊂Y, which can be extended to arbitrary measurable sets C⊂X×Y in the usual measure-theoretic way.
We will also use the notation Ξ⋊Θ∈□(Y×X) for the same HUC with X and Y flipped. And, for Λ∈□Y, Θ⋉Λ∈□(X×Y) is the semidirect product of Θ with the constant ultrakernel whose value is Λ.
For a closed set A⊂X, we define ⊤A:={μ∈ΔcX∣μ(X∖A)=0}∈□X (we'll call these sharp ultradistributions). We also define ⊥∈□X via ⊥={0}.
Infra-Bayesian Logic
The formal description in this section assumes some familiarity with type theory and categorical logic. For an informal overview of the syntax, we point the reader to Less Formal Description of Infra-Bayesian Logic.
The infra-Bayesian notion of beliefs gives rise to a fairly weak form of logic, in the sense that the spaces of predicates in infra-Bayesian logic don't form Heyting algebras, and are generally non-distributive and even non-modular as lattices. Also, due to the lack of function types, we cannot use the full syntax of type theory.
We will therefore work with a simplified syntax, still maintaining versions of many of the usual constructions appearing in higher order logic. This language can be used to describe infra-POMDPs (see Infra-POMDPs), which in turn is a useful beginning to something like an infra-AIXI (see Application: an Infra-Bayesian AIXI Analogue).
Less Formal Description of Infra-Bayesian Logic
Infra-Bayesian logic is a logical language. The language has types. Types can come from a set of initial types Tι, or can be combined from other types using sums and products. Also, 0 and 1 are special types (where 0 contains no element, and 1 contains exactly one element, so to speak). There is also a more complicated construction of additional types: If α is a type, then we also consider predicates on α a type, which we denote by [α].
Infra-Bayesian logic also has maps between types. The set of maps between types α and β is denoted by Fα→β. We will not list every kind of map that we require for infra-Bayesian logic in this informal description. But some examples are ∧α∈F[α]×[α]→[α] and ∨α∈F[α]×[α]→[α], which correspond to a meaning of "and" and "or".
A map of type F1→α is basically the same as something of type α, but it allows us to compose this with other maps. Examples are ⊤∈F1→[1] and ⊥∈F1→[1] which correspond to "true" and "false". There are also maps for equality and predicate evaluation. All kinds of different maps can be combined and composed to yield expressions in infra-Bayesian logic.
How does this connect to infra-Bayesianism? A collection of sentences in infra-Bayesian logic can have a model M. There are lots of rules for a model to ensure that the symbols of infra-bayesian logic correspond to their intended meaning. A model maps types into topological spaces. In order to make sure the above-mentioned types behave reasonably under the model, we require that M(0) is the empty topological space, M(1) is the 1-point topological space, products and sums work as expected (M(α×β)=M(α)×M(β) and M(α+β)=M(α)⊔M(β)). A special case is the topological space for [α] (the type of predicates on α): We require that M([α]) is the topological space of homogeneous ultra-contributions over M(α), i.e. □M(α). We will also explore alternatives to HUCs over topological spaces, but it should be some general notion of infra-distribution. The model of a map f∈Fα→β is a map between the topological spaces M(α)→M(β). We will have conditions that the model of a map is consistent with the meaning. For example, we require that the model of "∧α" maps two HUCs to their intersection.
We will later investigate the precise conditions on toplogical spaces and maps that we would like to have for infra-Bayesian logic.
Formal Description of Infra-Bayesian Logic
Syntax
Given a set Tι of primitive types, we can require that the types form a category C(T), which should be the "freest" category such that:
We won't construct the category C(T) here (uniqueness is clear). We'll use the shorthand Fα→β=C(T)(α,β) for the set of morphisms, and write Vα=F1→α (where 1 is the terminal object).
Definition 3. We say that a category C with finite products and coproducts supports infra-Bayesian logic, if it satisfies the following requirements:
There is a functor [_]:C→Cop (recall that Cop is the opposite category of C). For α∈Ob(C(T)), [α] is the object intended to correspond to predicates on α. For f∈C(α,β), we write f∗:[β]→[α] instead of [f] to denote pullback of predicates.
There are morphisms ∨α,∧α∈F[α]×[α]→[α]. We require that these operations turn V[α] into a bounded lattice, and we require the functor [_] to induce a functor into the category of lattices. In particular, we have ⊤,⊥∈V[1].
We require that pr∗β:[β]→[α×β] have natural left and right adjoints with respect to the poset structure (of the lattice) above. We also require these adjoints to come from morphisms in C, denoted by ∃αβ,∀αβ∈F[α×β]→[β] respectively.
We require the existence of an equality predicate =α∈V[α×α], which is the minimal predicate such that (diag∗α∘=α)=⊤α (cf. A constant function with output top).
We require the existence of evα∈V[[α]×α] (predicate evaluation). Given f:β→[α], we can pull back evα via f×1α to get ^f∈V[β×α]. Note however that we cannot require universality here (i.e. that every predicate in V[β×α] arise this way), due to the failure of the disintegration theorem in infra-Bayesianism. It's not entirely clear currently what the appropriate condition to require from evα should be on the syntax side, even though the intended semantics can be described (cf. Item 5 in the semantics).
For q∈Q∩[0,1] we use δq∈V[1+1] as a syntactic constant symbol corresponding to a coin toss with probability q.
Note that in Item 6 we could require the following more general set of symbols, however, the relevant ones can likely be approximated via the existing syntax. For n∈N, let Dn⊂□n (treating n as an n-point discrete space) be the subset of "describable" ultracontributions under some such notion. We would then introduce constant symbols ┌μ┐∈V[n] for each μ∈Dn.
Remark 4. In order to construct the syntax for the first order theory, we can instead consider two categories C (base types) and D (predicate types), with the functor [_]:C→Dop. In this context the predicate functor [_] cannot be iterated, hence we are left with a first order theory. The requirements in Definition 3 remain the same, except for Item 5 being removed. The construction of a pushforward via a kernel in Pushforward via a kernel no longer makes sense in that generality, but we explain how to recover it for finite sets in the first order theory in Pushforward via kernel for finite sets, which is then used when unrolling the time evolution of an infra-POMDP in Process.
Semantics
We require a model to be a functor M:C(T)→P, which preserves finite products and coproducts.
Moreover, we require P to support infra-Bayesian logic (cf. Definition 3), and the model functor M to respect the logical structure.
In practice, we will want P to be some subcategory of Top. The motivating candidate is to take P to be the category of compact Polish spaces with continuous maps, and the predicate functor on P to be □. This choice however doesn't work for higher order logic.
We nevertheless spell out what the logical semantics translate to, using □ as a generic symbol denoting some appropriate notion of "ultracontributions".
M([α])=□M(α) (cf. Lemma 10), and M(f∗)=M(f)∗ is the pullback (see Continuity Counterexamples about issues with continuity in the case of HUCs).
We require M to induce a lattice homomorphism V[α]→□M(α), where
M(∨α):□M(α)×□M(α)→□M(α) is given by convex hull of the union
M(∧α):□M(α)×□M(α)→□M(α) is given by intersection (see Continuity Counterexamples about issues with continuity in the case of HUCs)
Predicates. Let X=M(α), Y=M(β), and prY:X×Y→Y be projection onto the second factor. The following follow from the adjunction
If X=M(α), then M(evα)=⊤□X⋉1□X, where ⊤□X∈□□X, and 1□X:□X→□X is considered as an infrakernel.
M(δq)∈□(pt+pt) is the crisp ultradistribution corresponding to a coin toss with probability q∈Q∩[0,1] (here pt is the one point space).
Note in the more general setting of describable ultracontributions, we would require M(┌μ┐)=μ.
Definition 5. A subset of V[1] (i.e. a set of sentences) is called a theory. We say that M models the theory T if M(ϕ)=⊤pt for all ϕ∈T.
Remark 6. Finding an appropriate category P turns out to be harder than first expected. We discuss various candidates considered in Candidates for the Semantics.
In general, we note that for infra-POMDPs with finite state spaces, the transition (infra-)kernels are always continuous, and some of the issues with continuity mentioned in Candidates for the Semantics are less of a problem.
Constructions in IBL
In the following we construct various useful components of IBL using the syntax outlined above. These constructions will also be used to build examples of infra-POMDPs in Infra-POMDPs.
Pushforward
Given f:X→Y, we can construct the pushforward f∗:[X]→[Y] as follows. First, consider the following two maps into [X×Y]:
1=Y−−→[Y×Y](f×1Y)∗−−−−−→[X×Y] (this represents the graph of f)
[X]pr∗X−−→[X×Y].
Composing the product of these two with ∧X×Y:[X×Y]×[X×Y]→[X×Y], we get [X]=1×[X]→[X×Y], and finally post-composing with [X×Y]∃XY−−→[Y], we get f∗:[X]→[Y].
Pushforward via a kernel
Given f:X→[Y], we can construct the pushforward f∗:[X]→[Y] as well. This is done similarly to Pushforward, except the element in [X×Y] is replaced with
1evY−−→[[Y]×Y](f×1Y)∗−−−−−→[X×Y],
where evY is predicate evaluation.
Pushforward via kernel for finite sets
If X is a finite set, then we can think of a map f:X→[Y] as a collection of points fx:1D→[Y], which is meaningful in the first order syntax too, as long as the predicate category has a final object 1D∈D. In this case we can construct the pushforward along the kernel f in the first order theory as follows. For each x∈X we have the composite
Gfx:1fx−→[Y](ιx×1Y)∗−−−−−→[X×Y],
where ιx is the inclusion of x∈X, and (ιx×1Y)∗ is the regular pushforward from Pushforward. Now, taking the product of these over x∈X, and taking (iterated) disjunction, we have
Given f,g:α→β, we want to construct a sentence sf=g∈V[1]. This can be done as follows. First, take the composite with the diagonal on α
φ:αdiagα−−−→α×αf×g−−→β×β,
then we have
sf=g:1=β−→[β×β]φ∗−→[α]∀α1−−→[1],
so sf=g∈V[1] is the desired sentence.
Mixing with probability 1/2
Assume that we have two terms s1∈F1→σ1, s2∈F1→σ2. Then we have s1+s2∈F1+1→σ1+σ2. Applying a pushforward yields (s1+s2)∗∈F[1+1]→[σ1+σ2]. Then we can compose with δ1/2∈F1→[1+1], which assigns a fair coin distribution on 1+1.
f:1δ1/2−−→[1+1](s1+s2)∗−−−−−→[σ1+σ2].
A constant function with output bottom
We can construct a constant function term f∈Fα→[β] whose model M(f) is a constant function that maps everything to ⊥M(β) as follows. Let tα∈Fα→1 be the terminal map. Then we can define the term f via
f:αtα−→1⊥→[1]t∗β−→[β].
This does the right thing: By factoring through M(1)=pt the function M(f) has to be constant, and by using ⊥, the output has to be ⊥M(β).
A constant function with output top
This works the same way, just using the symbol ⊤∈F1→[1] instead of ⊥∈F1→[1]:
f:αtα−→1⊤→[1]t∗β−→[β].
Components of infrakernels for sum types
For a function f∈Fα+β→[γ+δ], how do we get access to its components, so that we get a term fα,γ∈Fα→[γ]? We do this by composing with iαβ and i∗γδ, i.e.
fα,γ:αiα−→α+βf→[γ+δ]i∗γ−→[γ].
For the other components this would work essentially in the same way, which would give us functions fα,δ∈Fα→[δ], fβ,γ∈Fβ→[γ], fβ,δ∈Fβ→[δ].
Infra-POMDPs
Setup
We can describe infra-POMDPs in the language of infra-Bayesian logic in the following way. Infra-POMDPs consist of the following.
Finite sets of actions A and observations O
For each observation o∈O, a type σo of states producing the given observation. Let σ∗=∑o∈Oσo be the type of all states.
An initial state s0∈V[σ∗]
For each a∈A, a transition kernel Ka∈Fσ∗→[σ∗].
Process
Given a state at time t, st∈V[σ∗], and an action a∈A, we can apply the kernel pushforward (Ka)∗:[σ∗]→[σ∗], to end up in a new state st+1∈V[σ∗]. Once receiving an observation ot∈O, we can do an update on the observed state. In general, this update depends on the loss function, but we can obtain a "naive" update via pull back and push forward through iot:σot→∑o∈Oσo=σ∗ to get the observed state ^st+1=(iot)∗(iot)∗(st+1)∈V[σ∗].
Laws
In infra-Bayesianism, laws (also known as belief functions in earlier posts on infra-Bayesianism) are functions Θ:Π→□((A×O)ω), where Π is the space of possible (deterministic) policies of the IB agent (a technical note: the space (A×O)ω is a Polish space, when equipped with the topology generated by cylinder sets). Such a law describes infra-Bayesian beliefs how a policy will lead to certain histories.
Let us give an example of an infra-POMDP. The first example is basically just and ordinary MDP, rewritten to fit into the above notation for infra-POMDPs.
Example 7. We set O={0,1}, A={0,1}. This example has (functionally) no relevant hidden states, i.e. states are equivalent to observations. Taking action 0∈A will lead to the same observation as the previous observation. Taking action 1∈A will lead to the other observation as the previous observation. As for the initial conditions, we start out with probability 12 for each observation/corresponding state.
Let us explain how Example 7 can be described using an IBL theory T⊂V1. We have types σ0 and σ1 for the states corresponding to the two possible observations. First, we introduce non-logical symbols s1∈Vσ0, s2∈Vσ1. Then, using the construction in Mixing with probability 1/2, we can define a term in V[σ∗] that assigns probability 1/2 to each of σ0, σ1, as desired. Using function equality from Equality of functions, we can equate the initial state s0∈V[σ∗] with this mixture. We add the equality sentence to our theory T. This ensures that every model of the theory T satisfies M(s0)=12⊤M(σ0)+12⊤M(σ1).
Next, we will add sentences that describe the transition kernels. Note that we have two transition kernels K0,K1∈Fσ∗→[σ∗], one for each possible action. Then, for each transition kernel, we can decomposition it, like in Components of infrakernels for sum types. This gives us 8 possibilities. For example, (K0)σ0,σ1 describes our beliefs that we will be in state σ1 after we take action 0∈A and are in state σ0. For all the 8 possible combinations, the resulting function should be a constant function, with values ⊥ or ⊤σ0,1. Again, we can use the equality of functions as described in Equality of functions to ensure that these components of the transition kernels have the correct values, in accordance with the transition kernel as described in the example. We will add these sentences to the theory T, so that T now has 9 elements.
Example 8. The following is a "fully infra" POMDP.
O={o}, single observation
σo={A,B}, two states
A={a}, single action.
The credal set corresponding to
K(A) is the interval [13A+23B,B]
K(B) is the interval [A,23A+13B].
We can take downward closure to end up with a homogeneous ultradistribution.
So a single step in the Markov process swaps A and B, but with some Knightean uncertainty of landing up to 1/3 away from the target. As we iterate through multiple timesteps, we gain no information through observation (since |O|=1), so the intervals of Knightean uncertainty widen (the portion outside the Knightean interval shrinks as (23)t).
We can construct the above infra-POMDP via IBL as follows. Let
δ1/3,δ2/3∈V[2]
be coin tosses with probability of heads equal to 1/3 and 2/3 respectively. We could in principle approximate these with iterated fair coin tosses, but we assume them as given for convenience. Let
iA,iB:1→1+1=σ∗
be the two states. Then we have
⊤A:1⊤→[1]iA∗−−→[σ∗],
and similarly for ⊤B. We can construct the intervals of length 1/3 from A and B:
IA=⊤A∨δ2/3∈V[σ∗]
IB=⊤B∨δ1/3∈V[σ∗].
Finally, we construct the transition kernel Ka=IB+IA:σ∗→[σ∗].
Application: an Infra-Bayesian AIXI Analogue
A potential application for infra-Bayesian logic is to describe the hypotheses of an AIXI-like infra-Bayesian agent, by using IBL to describe infra-POMDPs. Let us expand a bit on this idea.
Let's assume an infra-Bayesian agent knows that the environment is an infra-POMDP and also happens to know the finitely many actions A the agent can take each time step, and the finitely many possible observations O the agent could be receiving. However, it does not not know how many states there are. (We will leave rewards out of the picture here. We could include rewards by encoding it in the observations, e.g., by adding a reward ro for each observation. But for now let's focus only on the agent's model of the environment.)
An infra-Bayesian agent would have a set of hypotheses of how the environment could be. In our setting, each hypothesis has the form of an infra-POMDP. Like an AIXI agent, we would like to have a variant of Occam's razor: simpler hypotheses should receive a higher weight.
There is a potential approach to use a finite theory in IBL to describe a hypothesis for our infra-POMDP, and then the weight of this hypothesis in the prior is based on the length of the description of the IBL theory, e.g. proportional to 2−2l when l is the description length. Note that some IBL theories might have no valid model, so these theories should be discarded from the hypotheses; and there might be other reason why an IBL theory might be excluded from the space of hypotheses, see below.
Each theory in infra-Bayesian logic can have (possibly none or multiple) models. If we use types σo, σ∗ and symbols s0∈V[σ∗], Ka∈Fσ∗→[σ∗] as in Setup, then each model M describes an infra-POMDP: The initial state can be described by M(s0), and for each a∈A, M(Ka) is the transition kernel.
The question arises what do we do if there are multiple models for the same IBL term. Let us describe two approaches, which both rely on converting the infra-POMDP into a corresponding law, see Laws.
A first approach could be to only allow theories in IBL that produce models which are unique up to equivalence, where we say that two models are equivalent if they produce infra-POMDPs that produce the same laws.
A second approach could be to consider all possible models in a first step, then convert these models into corresponding laws, and then take the disjunction (i.e. closed convex hull of all outputs of the laws) to turn it into the final law.
Candidates for the Semantics
Compact Polish Spaces with Continuous Maps
A first candidate for the category P would be compact Polish spaces with continuous maps, and '□' being the space of HUCs as defined in Notation.
However, we run into issues with continuity in this category. Namely, the maps M(∧), M(∀αβ), M(f)∗ are not always continuous, see Continuity Counterexamples.
Allowing Discontinuity
Another approach could be to bite the bullet on discontinuity and broaden the set of allowed morphisms in the semantics category. The problem with these approaches generally is that pullbacks really only behave nicely with respect to continuous maps, we typically lose functoriality when allowing discontinuity. That is, the desired property (g∘f)∗=f∗∘g∗ can fail to hold for maps f:X→Y, g:Y→Z. This is mostly because for discontinuous functions we have to take the closure in the definition of the pullback (see Notation).
Upper Semicontinuity
This approach is inspired by the fact that while intersection is not continuous on □X, it is upper semicontinuous (wrt to the set inclusion order) on □X.
For this approach, we would use the category of compact polish spaces with a closed partial order, and for the maps we require that they are monotone increasing and upper semicontinuous. Here, a function f:X→Y is upper semicontinuous when
xn→x∧f(xn)→y⟹y≤Yf(x)
holds for all sequences {xn}⊂X and points x∈X, y∈Y. As for the partial order on □X, one could consider the partial order based on set inclusions, or additionally combine this with the stochastic order mentioned in Definition 2.1 in IBP by requiring that elements in □X are also downwards closed with respect to the stochastic order.
We could try to broaden the set of morphisms even further, but naturally the issues with functoriality remain.
A Different Notion of HUCs
A possible remedy for the continuity issues could be to use an alternative definition of HUCs, in which the problematic maps become continuous. Roughly speaking, the high level idea here is to require elements of the modified HUCs to "widen" closer to ⊥, thus avoiding the discontinuities.
Unfortunately, we were not able to find a working semantics of this type, despite trying different approaches, as described below.
We can require HUCs to be closed under the finer partial order ⪯ in Definition 9, and modify Definition 2 accordingly. Let □L(X) denote the space of these alternative HUCs. One downside of this approach is that now the space □LX depends on the metric on X, while previously it only depended on the underlying topology.
Definition 9 (downward closure related to 1-Lipschitz functions). Let X be a compact metric space. For θ1,θ2∈Δc(X) we define the relation
θ1⪯θ2:⟺∀f:X→R+,f1-Lipschitz:θ1(1+f)≤θ2(1+f).
The space □L(X) is defined as the set of closed convex subsets of Δc(X) that are downward closed under the partial order ⪯.
To see that it achieves some of the desired behavior, we show that M(∧) is continuous in Lemma 14. This approach also requires that we work with compact metric spaces, instead of merely with compact polish spaces.
The Category of metric spaces is typically equipped with maps that are 1-Lipschitz continuous, and the metric on a product X×Y of metric spaces X,Y is given by
dX×Y((x1,y1),(x2,y2))=max(dX(x1,x2),dY(y1,y2)).
We do make some further modifications. First, we only consider compact metric spaces whose metric d only has values in [0,1]. This allows us to consider sums/coproducts of metric spaces (usually the category of metric spaces does not have coproducts). For the metric on a sum (disjoint union) of two metric spaces, we set the distance of two points from different components to 1.
Let us define the metric we will use on □L. As a metric on Δc(X) we will use the Kantorovich-Rubenstein metric
d(p,q):=sup{|p(g)−q(g)|:g:X→[−1,1]1-Lipschitz},
which is a metric that is compatible with the weak-∗ topology on Δc(X). Then we use the Hausdorff metric dH on □L. It can be seen that this Hausdorff metric has values in [0,1] again.
A problem with □L(X) is that it is not always preserved under the pullback, i.e. f∗(Θ)∉□L(X) can happen for some f:X→Y, Θ∈□L(Y). Thus, we will redefine the pullback in this context. For a function between metric spaces f:X→Y we define the modified pullback f∗L:□LY→□LX via f∗L(Θ):={μ∈Δc(Y)|∃ν∈f∗(Θ):μ⪯ν}. We will use this alternative pullback for the case of □L.
Under this setting, we run into the issue that M(∧) is not 1-Lipschitz continuous, see Example 16. This is despite the fact that M(∧) is continuous (and probably even Lipschitz continuous).
Allow higher Lipschitz constants for maps
One might wonder what happens, when we pick a higher Lipschitz constant c>1 for maps in our category (instead of c=1). If we have two functions with a Lipschitz constant c>1, then we can only guarantee a Lipschitz constant of c2 for the composition (e.g. with the function [0,1]∋x↦min(1,cx)∈[0,1]). If we repeat this argument, we cannot guarantee a finite Lipschitz constant, because cn→∞ as n→∞.
So, what happens if we have allow maps with Lipschitz constant 2? It turns out that then we run into issues with functoriality, as in Allowing Discontinuity. A concrete counterexample is described in Example 15.
A first problem with this choice is that this is incompatible with any finite bound on d, as is the case in Standard setting for the alternative HUC case. This issue can be avoided by considering extended metric spaces, which are metric spaces whose metric can take values in [0,∞] (in that case the distance between disjoint components in a sum is set to ∞).
A more difficult problem with this approach would be that the diagonal map diag:X→X×X is not 1-Lipschitz continuous for other natural choices of metric on products. We do need the diagonal map, see, e.g., Equality of functions.
Modifying the definition of alternative HUCs
We mention that as an extension, the definition for ⪯ in Definition 9 could be modified to include a constant c>0 and use θ1(c+f)≤θ2(c+f). For smaller c>0, this would make these alternative HUCs sharper. The version with c=1 has some drawbacks as a lot of probability mass gets assigned to points nearby: For X=[0,1], we have 12δ0⪯δ1, so an alternative HUC which contains δ1 also has to contain 12δ0. In the context of infra-Bayesianism, this means we can use these kind of HUCs only in combination together with a smaller class of loss functions (in this case, loss functions that are of type 1+f for a nonnegative Lipschitz continuous f).
One might wonder whether increasing c might help us decrease the Lipschitz constant for M(∧) or even choose a c>1 such that M(∧) is 1-Lipschitz. While the Lipschitz constant does get improved, calculating through Example 16 would show that the Lipschitz constant M(∧) cannot reach 1 even if we use these modified versions of □L.
Appendix
Lemma 10. If X is a compact Polish space then so is □X.
Proof. It is known that the space of nonempty compact subsets of X is compact in the Hausdorff metric. One can also show that □X is a closed subset of the space of nonempty compact subsets of X.
Using the topology induced by the Hausdorff metric on □X (see below Definition 2), we find that in the 'logical' semantics (inverse) limit type constructions (in the category theoretical sense) are generally not continuous (see Continuity Counterexamples for details). This includes the semantics for ∧α, ∀αβ, and the pullback f∗.
If X itself is finite, then □X is nice enough that we don't run into these issues. This observation is sufficient for example in the construction of the infra-POMDPs in Examples when the state spaces are finite. However, we quickly run into issues if we're not careful with iterating the □ construction, since □X is not finite even if X is.
Below is some discussion on a possible approach to trying to fix these issues (we haven't found an entirely satisfactory fix though).
Continuity Counterexamples
Lemma 11. The function M(∧):□M(α)×□M(α)→□M(α), given by M(∧)(μ,ν)=μ∩ν is not always continuous.
Proof. Consider the case where M(α) is the space [0,1] with standard topology. This is a Polish space.
We consider sequences with xi=1/2−1/i and yi=1/2+1/i. We define x0=y0=1/2. Then ⊤{xi} and ⊤{yi} are in □M(α). (For a point z∈[0,1], I understand this notation to be ⊤{z}:={aδz∣a∈[0,1]}, where δz∈Δ([0,1]) is the measure that puts all weight on the single point z).
One can calculate that M(∧)(⊤{xi},⊤{yi})=⊥.
Because of the Hausdorff metric in □M(α) we have ⊤{xi}→⊤{x0} and ⊤{yi}→⊤{x0}.
Lemma 12. The function M(∀αβ) is not always continuous.
Proof. Consider the case that M(α)=M(β)=[0,1]. We use the notation f:=M(∀αβ). We choose the sequence μn=⊤[(0,0),(1,1/n)], where [,] denotes the line between points. Then one can show that f(μn)=⊥.
However, we also have μn→μ0:=⊤[(0,0),(1,0)]. One can show that f(μ0)=⊤{0}∈□M(β) holds. Thus, we have
f(μn)→⊥≠f(μ0),
i.e. f is not continuous. ◻
Lemma 13. There is a continuous function f:X→Y such that f∗:□Y→□X is not continuous.
Proof. Pick f:[−1,1]→[0,1], f(x)=max(0,x). Then f∗(⊤{1/n})=⊤{1/n}, but for the limit we have f∗(⊤{0})=⊤[−1,0], and not ⊤{1/n}→⊤[−1,0]. ◻
Semi-continuous Pullback is not Compositional
Let f,g:[0,1]→[0,1] be given by
g(x)={0if x<1/21if x≥1/2,
And
f(x)=⎧⎨⎩0if x<1/31/2if 1/3≤x<2/31if x≥2/3.
Both f and g are monotone increasing, upper semi-continuous. Let C=[0,1/2]⊂[0,1]. Then B=g−1(C)=[0,1/2), the closure ¯¯¯¯B=[0,1/2]. Hence f−1(B)=[0,1/3), while f−1(¯¯¯¯B)=[0,2/3), so
For the first term, let pk∈μk∩νk be such that d(pk,μ0∩ν0) is maximized. By compactness, there exists a convergent subsequence of pk. Wlog let pk→p0. We have pk∈μk for all k, and taking limits implies p0∈μ0. The same argument also implies p0∈ν0. Thus, we have
supp∈μk∩νkd(p,μ0∩ν0)=d(pk,μ0∩ν0)→d(p0,μ0∩ν0)=0.
For the second term, let p0∈μ0∩ν0 be given. Due to μk→μ0, there exists a sequence pk with pk∈μk and pk→p0. Similarly, there exists a sequence qk with qk∈νk and qk→q0. The idea is now to construct a measure (i.e. a point in Δc(X)) rk that is close to pk and qk but in μk∩νk.
Wlog we assume p0(1)>0 and pk(1)>0, qk(1)>0 for all k. We define δk>0 via
δk:=d(pk,qk)(2+diamX)(pk+qk)(1),
where diamX:=supx,y∈Xd(x,y) is the diameter of X. Clearly, we have δk→0. We also define
rk:=1−δk2(qk+pk).
We want to show rk⪯qk and rk⪯pk. By symmetry we only need to show rk⪯pk. For that purpose, let f:X→R+ be an arbitrary 1-Lipschitz continuous function. We define g:=1+f. We have to show rk(g)≤pk(g).
We can decompose g=gmin+g0, where gmin:=minx∈Xg(x) is a constant, and g0 is a nonnegative function that satisfies g0(x0)=0 for some x0∈X. Since g0 is 1-Lipschitz continuous, this implies that g0(x)≤diamX for all x∈X. Thus, (1+diamX)−1g0 is a function that is 1-Lipschitz continuous and bounded by 1, i.e. admissible in the definition Eqn. (1). We can use this to obtain
This research was produced during SERI MATS. Thanks to Vanessa Kosoy for providing ideas and reading drafts.
Introduction
In this article, we investigate "infra-Bayesian logic" (IBL for short), which is an idea introduced by Vanessa Kosoy. The original idea is outlined in Vanessa Kosoy's Shortform. Here, we expand with some details and examples. We also try to fix the mentioned issues (e.g. lack of continuity for certain maps) by investigating several candidate categories for semantics. However, we fail to find something that works.
As a theory of beliefs and reasoning, infra-Bayesianism gives rise to semantics for a certain logic, which we might call "infra-Bayesian logic" (IBL for short). Formalizing the syntax for this logic has the advantage of providing a notion of description complexity for infra-POMDPs (partially observable Markov decision processes) expressed via this syntax, and hence could serve as a foundation for an infra-Bayesian analogue of AIXI. Infra-Bayesian logic is not the only way to define an infra-Bayesian analog of AIXI. Such an alternative could be to use oracle machines (with an extra input tape for a source of randomness): For a probabilistic oracle machine, the corresponding IB hypothesis is the closed convex hull of all stochastic environments that result from using the oracle machine with a particular oracle, and then one could use the description complexity of the oracle machine (which can be defined by choosing a universal oracle machine) as a basis for an infra-Bayesian AIXI. The analogue of oracle machines for (non-infra-Bayesian) AIXI would be (probabilistic) Turing machines.
However, infra-Bayesian logic might be an interesting alternative to describing infra-Bayesian hypotheses, and this approach would be specific to infra-Bayesianism (there is, as far as we know, no good analogue of infra-Bayesian logic in (ordinary) Bayesianism).
Note that we are currently unaware of a working semantics for the higher order logic. We discuss some candidates considered in Candidates for the Semantics, and the ways in which they fail. It's unclear whether these issues with the semantics are mostly technical in nature, or whether they point to some deeper phenomena. We can, however, restrict to a first order IBL on finite sets, which, albeit less powerful, is still expressive enough for many applications (in particular covers all the examples in Examples).
See also the paragraph on IBL in the overview of the learning theoretic Agenda.
Outline and Reading Guide
You can skim/skip the Notation section and use it as a reference instead. The main description of IBL is given in Infra-Bayesian Logic. To get an initial and informal idea what IBL looks like, read Less Formal Description of Infra-Bayesian Logic. For a technical description and a complete list of properties, read Formal Description of Infra-Bayesian Logic. Note that the latter description requires some familiarity with category theory.
In Constructions in IBL we list how we can combine basic elements of IBL to express more complicated things.
In Infra-POMDPs we talk about infra-POMDPs as an application of IBL. Some knowledge of infra-Bayesianism is beneficial here, but not required. We highlight the section on Examples, where we describe the IBL terms for concrete Infra-POMDPs.
In Candidates for the Semantics we discuss the various (failed) candidates we tried for the higher-order logic version of IBL. A typical reader is not expected to read this section.
The Appendix is for mostly technical proofs and counterexamples. We recommend to only look these up when you are interested in the technical details. For example, if you wish to understand why conjunction is not continuous in IBL, you can find the details there.
Notation
Let us describe notation needed for using concepts from infra-Bayesianism.
This notation section is partially copied from the Infra-Bayesian Physicalism article, but extended from finite sets to compact Polish spaces, wich requires a bit more measure theory. Also, we use the symbol □Xfor homogeneus ultracontributions (instead of □cX as in the IBP article).
The notation from the IBP post is a bit different from some of the other posts on infra-Bayesianism in that it uses ultradistributions/ultracontributions rather than infradistributions.
We denote R+:=[0,∞). We will work a lot with compact Polish spaces.
Compact Polish spaces are spaces that are topologically equivalent to some compact metric space (the metric is not fixed, however, only the topology). An important special case of compact Polish spaces are finite (discrete) sets.
Definition 1. Given a compact Polish space X, a contribution on X is a (Borel) measure θ on X with θ(X)≤1. Given a measurable function f:X→R and θ∈ΔcX, we denote θ(f):=∫Xf(x)dθ(x). The space of contributions is denoted ΔcX, equipped with the weak-∗ topology (i.e. the weakest topology such that the functions ΔcX∋θ↦θ(f)∈R are continuous for all continuous functions f:X→R).
Naturally, any (probability) distribution is in particular a contribution, so ΔX⊆ΔcX. There is a natural partial order on contributions: θ1≤θ2 when θ1(A)≤θ2(A) for all measurable A⊂X.
Definition 2. A homogeneous ultracontribution (HUC) on X is non-empty closed convex Θ⊆ΔcX which is downward closed w.r.t. the partial order on ΔcX. The space of HUCs on X is denoted □X.
If a HUC Θ∈□X has the property Θ s.t. Θ∩ΔX≠∅, we call it a homogeneuous ultradistribution.
Sometimes we need a topology on the space of HUCs □X. In this case we will use the topology induced by the Hausdorff metric, (where we use a metric on ΔcX that fits with the weak-∗ topology. The topology induced by the Hausdorff metric is also equivalent to the so-called Vietoris topology, which is a topology that only depends on the underlying toplogy on Δc(X), see here. See also Propositions 42 and 50 in Less Basic Inframeasure Theorey for the same result for infra-distributioins). One can show that □X is a compact Polish space with this topology when X is a compact Polish space, see Lemma 10. Let us introduce more notation.
For a set A⊂Δc(X), clA denotes its closure.
Given another compact Polish space Y and a measurable function s:X→Y, s∗:□X→□Y is the pushforward by s:
s∗Θ:=cl{s∗θ∣θ∈Θ},where (s∗θ)(A):=θ(s−1(A)) for all measurable A⊂Y. We can omit the cl in the definition if s is continuous.
A map X→□Y is sometimes referred to as an infrakernel.
Given an infrakernel Ξ:X→□Y, Ξ∗:□X→□Y is the pushforward by Ξ:
Ξ∗Θ:=cl{κ∗θ∣θ∈Θ,κ:X→ΔcY,κmeasurable,∀x∈X:κ(x)∈Ξ(x)},where (for measurable A⊂Y)
(κ∗θ)(A):=∫Xκ(x)(A)dθ(x).For a measurable function s:X→Y We also define the pullback operator s∗:□Y→□X via
s∗Θ:=cl{θ∈ΔcX∣s∗θ∈Θ}.We can omit cl here if s is continuous.
prX:X×Y→X is the projection mapping, iX:X→X∐Y is the inclusion map.
Given Θ∈□X and an infrakernel Ξ:X→□Y, Θ⋉Ξ∈□(X×Y) is the semidirect product:
Θ⋉Ξ:=cl{κ⋉θ∣θ∈Θ,κ:X→ΔcY,κmeasurable,∀x∈X:κ(x)∈Ξ(x)},where the measure κ⋉θ is defined by
(κ⋉θ)(A×B):=∫Aκ(x)(B)dθ(x)for all measurable A⊂X, B⊂Y, which can be extended to arbitrary measurable sets C⊂X×Y in the usual measure-theoretic way.
We will also use the notation Ξ⋊Θ∈□(Y×X) for the same HUC with X and Y flipped. And, for Λ∈□Y, Θ⋉Λ∈□(X×Y) is the semidirect product of Θ with the constant ultrakernel whose value is Λ.
For a closed set A⊂X, we define ⊤A:={μ∈ΔcX∣μ(X∖A)=0}∈□X (we'll call these sharp ultradistributions). We also define ⊥∈□X via ⊥={0}.
Infra-Bayesian Logic
The formal description in this section assumes some familiarity with type theory and categorical logic. For an informal overview of the syntax, we point the reader to Less Formal Description of Infra-Bayesian Logic.
The infra-Bayesian notion of beliefs gives rise to a fairly weak form of logic, in the sense that the spaces of predicates in infra-Bayesian logic don't form Heyting algebras, and are generally non-distributive and even non-modular as lattices. Also, due to the lack of function types, we cannot use the full syntax of type theory.
We will therefore work with a simplified syntax, still maintaining versions of many of the usual constructions appearing in higher order logic. This language can be used to describe infra-POMDPs (see Infra-POMDPs), which in turn is a useful beginning to something like an infra-AIXI (see Application: an Infra-Bayesian AIXI Analogue).
Less Formal Description of Infra-Bayesian Logic
Infra-Bayesian logic is a logical language. The language has types. Types can come from a set of initial types Tι, or can be combined from other types using sums and products. Also, 0 and 1 are special types (where 0 contains no element, and 1 contains exactly one element, so to speak). There is also a more complicated construction of additional types: If α is a type, then we also consider predicates on α a type, which we denote by [α].
Infra-Bayesian logic also has maps between types. The set of maps between types α and β is denoted by Fα→β. We will not list every kind of map that we require for infra-Bayesian logic in this informal description. But some examples are ∧α∈F[α]×[α]→[α] and ∨α∈F[α]×[α]→[α], which correspond to a meaning of "and" and "or".
A map of type F1→α is basically the same as something of type α, but it allows us to compose this with other maps. Examples are ⊤∈F1→[1] and ⊥∈F1→[1] which correspond to "true" and "false". There are also maps for equality and predicate evaluation. All kinds of different maps can be combined and composed to yield expressions in infra-Bayesian logic.
How does this connect to infra-Bayesianism? A collection of sentences in infra-Bayesian logic can have a model M. There are lots of rules for a model to ensure that the symbols of infra-bayesian logic correspond to their intended meaning. A model maps types into topological spaces. In order to make sure the above-mentioned types behave reasonably under the model, we require that M(0) is the empty topological space, M(1) is the 1-point topological space, products and sums work as expected (M(α×β)=M(α)×M(β) and M(α+β)=M(α)⊔M(β)). A special case is the topological space for [α] (the type of predicates on α): We require that M([α]) is the topological space of homogeneous ultra-contributions over M(α), i.e. □M(α). We will also explore alternatives to HUCs over topological spaces, but it should be some general notion of infra-distribution. The model of a map f∈Fα→β is a map between the topological spaces M(α)→M(β). We will have conditions that the model of a map is consistent with the meaning. For example, we require that the model of "∧α" maps two HUCs to their intersection.
We will later investigate the precise conditions on toplogical spaces and maps that we would like to have for infra-Bayesian logic.
Formal Description of Infra-Bayesian Logic
Syntax
Given a set Tι of primitive types, we can require that the types form a category C(T), which should be the "freest" category such that:
Tι⊂Ob(C(T))
C(T) has finite products and coproducts
C(T) supports infra-Bayesian logic (Definition 3).
We won't construct the category C(T) here (uniqueness is clear). We'll use the shorthand Fα→β=C(T)(α,β) for the set of morphisms, and write Vα=F1→α (where 1 is the terminal object).
Definition 3. We say that a category C with finite products and coproducts supports infra-Bayesian logic, if it satisfies the following requirements:
There is a functor [_]:C→Cop (recall that Cop is the opposite category of C). For α∈Ob(C(T)), [α] is the object intended to correspond to predicates on α. For f∈C(α,β), we write f∗:[β]→[α] instead of [f] to denote pullback of predicates.
There are morphisms ∨α,∧α∈F[α]×[α]→[α]. We require that these operations turn V[α] into a bounded lattice, and we require the functor [_] to induce a functor into the category of lattices. In particular, we have ⊤,⊥∈V[1].
We require that pr∗β:[β]→[α×β] have natural left and right adjoints with respect to the poset structure (of the lattice) above. We also require these adjoints to come from morphisms in C, denoted by ∃αβ,∀αβ∈F[α×β]→[β] respectively.
We require the existence of an equality predicate =α∈V[α×α], which is the minimal predicate such that (diag∗α∘=α)=⊤α (cf. A constant function with output top).
We require the existence of evα∈V[[α]×α] (predicate evaluation). Given f:β→[α], we can pull back evα via f×1α to get ^f∈V[β×α]. Note however that we cannot require universality here (i.e. that every predicate in V[β×α] arise this way), due to the failure of the disintegration theorem in infra-Bayesianism. It's not entirely clear currently what the appropriate condition to require from evα should be on the syntax side, even though the intended semantics can be described (cf. Item 5 in the semantics).
For q∈Q∩[0,1] we use δq∈V[1+1] as a syntactic constant symbol corresponding to a coin toss with probability q.
Note that in Item 6 we could require the following more general set of symbols, however, the relevant ones can likely be approximated via the existing syntax. For n∈N, let Dn⊂□n (treating n as an n-point discrete space) be the subset of "describable" ultracontributions under some such notion. We would then introduce constant symbols ┌μ┐∈V[n] for each μ∈Dn.
Remark 4. In order to construct the syntax for the first order theory, we can instead consider two categories C (base types) and D (predicate types), with the functor [_]:C→Dop. In this context the predicate functor [_] cannot be iterated, hence we are left with a first order theory. The requirements in Definition 3 remain the same, except for Item 5 being removed. The construction of a pushforward via a kernel in Pushforward via a kernel no longer makes sense in that generality, but we explain how to recover it for finite sets in the first order theory in Pushforward via kernel for finite sets, which is then used when unrolling the time evolution of an infra-POMDP in Process.
Semantics
We require a model to be a functor M:C(T)→P, which preserves finite products and coproducts.
Moreover, we require P to support infra-Bayesian logic (cf. Definition 3), and the model functor M to respect the logical structure.
In practice, we will want P to be some subcategory of Top. The motivating candidate is to take P to be the category of compact Polish spaces with continuous maps, and the predicate functor on P to be □. This choice however doesn't work for higher order logic.
We nevertheless spell out what the logical semantics translate to, using □ as a generic symbol denoting some appropriate notion of "ultracontributions".
M([α])=□M(α) (cf. Lemma 10), and M(f∗)=M(f)∗ is the pullback (see Continuity Counterexamples about issues with continuity in the case of HUCs).
We require M to induce a lattice homomorphism V[α]→□M(α), where
M(∨α):□M(α)×□M(α)→□M(α) is given by convex hull of the union
M(∧α):□M(α)×□M(α)→□M(α) is given by intersection (see Continuity Counterexamples about issues with continuity in the case of HUCs)
Predicates. Let X=M(α), Y=M(β), and prY:X×Y→Y be projection onto the second factor. The following follow from the adjunction
M(∃αβ)=prY∗ is the pushforward
For μ∈□(X×Y), we have
M(∀αβ)(μ)={p∈□Y| ∀q∈Δc(X×Y):(prY∗(q)=p⟹q∈μ)}(see Continuity Counterexamples about issues with continuity in the case of HUCs)
M(=α)=⊤diagM(α) (as a sharp ultradistribution)
If X=M(α), then M(evα)=⊤□X⋉1□X, where ⊤□X∈□□X, and 1□X:□X→□X is considered as an infrakernel.
M(δq)∈□(pt+pt) is the crisp ultradistribution corresponding to a coin toss with probability q∈Q∩[0,1] (here pt is the one point space).
Note in the more general setting of describable ultracontributions, we would require M(┌μ┐)=μ.
Definition 5. A subset of V[1] (i.e. a set of sentences) is called a theory. We say that M models the theory T if M(ϕ)=⊤pt for all ϕ∈T.
Remark 6. Finding an appropriate category P turns out to be harder than first expected. We discuss various candidates considered in Candidates for the Semantics.
In general, we note that for infra-POMDPs with finite state spaces, the transition (infra-)kernels are always continuous, and some of the issues with continuity mentioned in Candidates for the Semantics are less of a problem.
Constructions in IBL
In the following we construct various useful components of IBL using the syntax outlined above. These constructions will also be used to build examples of infra-POMDPs in Infra-POMDPs.
Pushforward
Given f:X→Y, we can construct the pushforward f∗:[X]→[Y] as follows. First, consider the following two maps into [X×Y]:
1=Y−−→[Y×Y](f×1Y)∗−−−−−→[X×Y] (this represents the graph of f)
[X]pr∗X−−→[X×Y].
Composing the product of these two with ∧X×Y:[X×Y]×[X×Y]→[X×Y], we get [X]=1×[X]→[X×Y], and finally post-composing with [X×Y]∃XY−−→[Y], we get f∗:[X]→[Y].
Pushforward via a kernel
Given f:X→[Y], we can construct the pushforward f∗:[X]→[Y] as well. This is done similarly to Pushforward, except the element in [X×Y] is replaced with
1evY−−→[[Y]×Y](f×1Y)∗−−−−−→[X×Y],where evY is predicate evaluation.
Pushforward via kernel for finite sets
If X is a finite set, then we can think of a map f:X→[Y] as a collection of points fx:1D→[Y], which is meaningful in the first order syntax too, as long as the predicate category has a final object 1D∈D. In this case we can construct the pushforward along the kernel f in the first order theory as follows. For each x∈X we have the composite
Gfx:1fx−→[Y](ιx×1Y)∗−−−−−→[X×Y],where ιx is the inclusion of x∈X, and (ιx×1Y)∗ is the regular pushforward from Pushforward. Now, taking the product of these over x∈X, and taking (iterated) disjunction, we have
1∏x∈XGfx−−−−−−→∏x∈X[X×Y]∨→[X×Y],which is the "graph of f", using which we construct the pushforward f∗:[X]→[Y] analogously to Pushforward and Pushforward via a kernel.
Equality of functions
Given f,g:α→β, we want to construct a sentence sf=g∈V[1]. This can be done as follows. First, take the composite with the diagonal on α
φ:αdiagα−−−→α×αf×g−−→β×β,then we have
sf=g:1=β−→[β×β]φ∗−→[α]∀α1−−→[1],so sf=g∈V[1] is the desired sentence.
Mixing with probability 1/2
Assume that we have two terms s1∈F1→σ1, s2∈F1→σ2. Then we have s1+s2∈F1+1→σ1+σ2. Applying a pushforward yields (s1+s2)∗∈F[1+1]→[σ1+σ2]. Then we can compose with δ1/2∈F1→[1+1], which assigns a fair coin distribution on 1+1.
f:1δ1/2−−→[1+1](s1+s2)∗−−−−−→[σ1+σ2].A constant function with output bottom
We can construct a constant function term f∈Fα→[β] whose model M(f) is a constant function that maps everything to ⊥M(β) as follows. Let tα∈Fα→1 be the terminal map. Then we can define the term f via
f:αtα−→1⊥→[1]t∗β−→[β].This does the right thing: By factoring through M(1)=pt the function M(f) has to be constant, and by using ⊥, the output has to be ⊥M(β).
A constant function with output top
This works the same way, just using the symbol ⊤∈F1→[1] instead of ⊥∈F1→[1]:
f:αtα−→1⊤→[1]t∗β−→[β].Components of infrakernels for sum types
For a function f∈Fα+β→[γ+δ], how do we get access to its components, so that we get a term fα,γ∈Fα→[γ]? We do this by composing with iαβ and i∗γδ, i.e.
fα,γ:αiα−→α+βf→[γ+δ]i∗γ−→[γ].For the other components this would work essentially in the same way, which would give us functions fα,δ∈Fα→[δ], fβ,γ∈Fβ→[γ], fβ,δ∈Fβ→[δ].
Infra-POMDPs
Setup
We can describe infra-POMDPs in the language of infra-Bayesian logic in the following way. Infra-POMDPs consist of the following.
Finite sets of actions A and observations O
For each observation o∈O, a type σo of states producing the given observation. Let σ∗=∑o∈Oσo be the type of all states.
An initial state s0∈V[σ∗]
For each a∈A, a transition kernel Ka∈Fσ∗→[σ∗].
Process
Given a state at time t, st∈V[σ∗], and an action a∈A, we can apply the kernel pushforward (Ka)∗:[σ∗]→[σ∗], to end up in a new state st+1∈V[σ∗]. Once receiving an observation ot∈O, we can do an update on the observed state. In general, this update depends on the loss function, but we can obtain a "naive" update via pull back and push forward through iot:σot→∑o∈Oσo=σ∗ to get the observed state ^st+1=(iot)∗(iot)∗(st+1)∈V[σ∗].
Laws
In infra-Bayesianism, laws (also known as belief functions in earlier posts on infra-Bayesianism) are functions Θ:Π→□((A×O)ω), where Π is the space of possible (deterministic) policies of the IB agent (a technical note: the space (A×O)ω is a Polish space, when equipped with the topology generated by cylinder sets). Such a law describes infra-Bayesian beliefs how a policy will lead to certain histories.
Under some conditions, one can convert an infra-POMDP into a corresponding law. We will not explain exactly how, and instead refer to Theorem 4 in "The many faces of infra-beliefs".
Examples
Let us give an example of an infra-POMDP. The first example is basically just and ordinary MDP, rewritten to fit into the above notation for infra-POMDPs.
Example 7. We set O={0,1}, A={0,1}. This example has (functionally) no relevant hidden states, i.e. states are equivalent to observations. Taking action 0∈A will lead to the same observation as the previous observation. Taking action 1∈A will lead to the other observation as the previous observation. As for the initial conditions, we start out with probability 12 for each observation/corresponding state.
Let us explain how Example 7 can be described using an IBL theory T⊂V1. We have types σ0 and σ1 for the states corresponding to the two possible observations. First, we introduce non-logical symbols s1∈Vσ0, s2∈Vσ1. Then, using the construction in Mixing with probability 1/2, we can define a term in V[σ∗] that assigns probability 1/2 to each of σ0, σ1, as desired. Using function equality from Equality of functions, we can equate the initial state s0∈V[σ∗] with this mixture. We add the equality sentence to our theory T. This ensures that every model of the theory T satisfies M(s0)=12⊤M(σ0)+12⊤M(σ1).
Next, we will add sentences that describe the transition kernels. Note that we have two transition kernels K0,K1∈Fσ∗→[σ∗], one for each possible action. Then, for each transition kernel, we can decomposition it, like in Components of infrakernels for sum types. This gives us 8 possibilities. For example, (K0)σ0,σ1 describes our beliefs that we will be in state σ1 after we take action 0∈A and are in state σ0. For all the 8 possible combinations, the resulting function should be a constant function, with values ⊥ or ⊤σ0,1. Again, we can use the equality of functions as described in Equality of functions to ensure that these components of the transition kernels have the correct values, in accordance with the transition kernel as described in the example. We will add these sentences to the theory T, so that T now has 9 elements.
Example 8. The following is a "fully infra" POMDP.
The credal set corresponding to
We can take downward closure to end up with a homogeneous ultradistribution.
So a single step in the Markov process swaps A and B, but with some Knightean uncertainty of landing up to 1/3 away from the target. As we iterate through multiple timesteps, we gain no information through observation (since |O|=1), so the intervals of Knightean uncertainty widen (the portion outside the Knightean interval shrinks as (23)t).
We can construct the above infra-POMDP via IBL as follows. Let
δ1/3,δ2/3∈V[2]be coin tosses with probability of heads equal to 1/3 and 2/3 respectively. We could in principle approximate these with iterated fair coin tosses, but we assume them as given for convenience. Let
iA,iB:1→1+1=σ∗be the two states. Then we have
⊤A:1⊤→[1]iA∗−−→[σ∗],and similarly for ⊤B. We can construct the intervals of length 1/3 from A and B:
IA=⊤A∨δ2/3∈V[σ∗]Finally, we construct the transition kernel Ka=IB+IA:σ∗→[σ∗].
Application: an Infra-Bayesian AIXI Analogue
A potential application for infra-Bayesian logic is to describe the hypotheses of an AIXI-like infra-Bayesian agent, by using IBL to describe infra-POMDPs. Let us expand a bit on this idea.
Let's assume an infra-Bayesian agent knows that the environment is an infra-POMDP and also happens to know the finitely many actions A the agent can take each time step, and the finitely many possible observations O the agent could be receiving. However, it does not not know how many states there are. (We will leave rewards out of the picture here. We could include rewards by encoding it in the observations, e.g., by adding a reward ro for each observation. But for now let's focus only on the agent's model of the environment.)
An infra-Bayesian agent would have a set of hypotheses of how the environment could be. In our setting, each hypothesis has the form of an infra-POMDP. Like an AIXI agent, we would like to have a variant of Occam's razor: simpler hypotheses should receive a higher weight.
There is a potential approach to use a finite theory in IBL to describe a hypothesis for our infra-POMDP, and then the weight of this hypothesis in the prior is based on the length of the description of the IBL theory, e.g. proportional to 2−2l when l is the description length. Note that some IBL theories might have no valid model, so these theories should be discarded from the hypotheses; and there might be other reason why an IBL theory might be excluded from the space of hypotheses, see below.
Each theory in infra-Bayesian logic can have (possibly none or multiple) models. If we use types σo, σ∗ and symbols s0∈V[σ∗], Ka∈Fσ∗→[σ∗] as in Setup, then each model M describes an infra-POMDP: The initial state can be described by M(s0), and for each a∈A, M(Ka) is the transition kernel.
The question arises what do we do if there are multiple models for the same IBL term. Let us describe two approaches, which both rely on converting the infra-POMDP into a corresponding law, see Laws.
A first approach could be to only allow theories in IBL that produce models which are unique up to equivalence, where we say that two models are equivalent if they produce infra-POMDPs that produce the same laws.
A second approach could be to consider all possible models in a first step, then convert these models into corresponding laws, and then take the disjunction (i.e. closed convex hull of all outputs of the laws) to turn it into the final law.
Candidates for the Semantics
Compact Polish Spaces with Continuous Maps
A first candidate for the category P would be compact Polish spaces with continuous maps, and '□' being the space of HUCs as defined in Notation.
However, we run into issues with continuity in this category. Namely, the maps M(∧), M(∀αβ), M(f)∗ are not always continuous, see Continuity Counterexamples.
Allowing Discontinuity
Another approach could be to bite the bullet on discontinuity and broaden the set of allowed morphisms in the semantics category. The problem with these approaches generally is that pullbacks really only behave nicely with respect to continuous maps, we typically lose functoriality when allowing discontinuity. That is, the desired property (g∘f)∗=f∗∘g∗ can fail to hold for maps f:X→Y, g:Y→Z. This is mostly because for discontinuous functions we have to take the closure in the definition of the pullback (see Notation).
Upper Semicontinuity
This approach is inspired by the fact that while intersection is not continuous on □X, it is upper semicontinuous (wrt to the set inclusion order) on □X.
For this approach, we would use the category of compact polish spaces with a closed partial order, and for the maps we require that they are monotone increasing and upper semicontinuous. Here, a function f:X→Y is upper semicontinuous when
xn→x∧f(xn)→y⟹y≤Yf(x)holds for all sequences {xn}⊂X and points x∈X, y∈Y. As for the partial order on □X, one could consider the partial order based on set inclusions, or additionally combine this with the stochastic order mentioned in Definition 2.1 in IBP by requiring that elements in □X are also downwards closed with respect to the stochastic order.
In both cases, the approach with upper semicontinuous maps, runs into the functoriality issues mentioned above, see Semi-continuous Pullback is not Compositional.
Measurable Maps
We could try to broaden the set of morphisms even further, but naturally the issues with functoriality remain.
A Different Notion of HUCs
A possible remedy for the continuity issues could be to use an alternative definition of HUCs, in which the problematic maps become continuous. Roughly speaking, the high level idea here is to require elements of the modified HUCs to "widen" closer to ⊥, thus avoiding the discontinuities.
Unfortunately, we were not able to find a working semantics of this type, despite trying different approaches, as described below.
We can require HUCs to be closed under the finer partial order ⪯ in Definition 9, and modify Definition 2 accordingly. Let □L(X) denote the space of these alternative HUCs. One downside of this approach is that now the space □LX depends on the metric on X, while previously it only depended on the underlying topology.
Definition 9 (downward closure related to 1-Lipschitz functions). Let X be a compact metric space. For θ1,θ2∈Δc(X) we define the relation
θ1⪯θ2:⟺∀f:X→R+,f1-Lipschitz:θ1(1+f)≤θ2(1+f).The space □L(X) is defined as the set of closed convex subsets of Δc(X) that are downward closed under the partial order ⪯.
To see that it achieves some of the desired behavior, we show that M(∧) is continuous in Lemma 14. This approach also requires that we work with compact metric spaces, instead of merely with compact polish spaces.
We give some technical details in Standard setting for the alternative HUC case. Later we will look at alternatives, mostly to justify the choices made in Standard setting for the alternative HUC case.
Standard setting for the alternative HUC case
The Category of metric spaces is typically equipped with maps that are 1-Lipschitz continuous, and the metric on a product X×Y of metric spaces X,Y is given by
dX×Y((x1,y1),(x2,y2))=max(dX(x1,x2),dY(y1,y2)).We do make some further modifications. First, we only consider compact metric spaces whose metric d only has values in [0,1]. This allows us to consider sums/coproducts of metric spaces (usually the category of metric spaces does not have coproducts). For the metric on a sum (disjoint union) of two metric spaces, we set the distance of two points from different components to 1.
Let us define the metric we will use on □L. As a metric on Δc(X) we will use the Kantorovich-Rubenstein metric
d(p,q):=sup{|p(g)−q(g)|:g:X→[−1,1]1-Lipschitz},which is a metric that is compatible with the weak-∗ topology on Δc(X). Then we use the Hausdorff metric dH on □L. It can be seen that this Hausdorff metric has values in [0,1] again.
A problem with □L(X) is that it is not always preserved under the pullback, i.e. f∗(Θ)∉□L(X) can happen for some f:X→Y, Θ∈□L(Y). Thus, we will redefine the pullback in this context. For a function between metric spaces f:X→Y we define the modified pullback f∗L:□LY→□LX via f∗L(Θ):={μ∈Δc(Y)|∃ν∈f∗(Θ):μ⪯ν}. We will use this alternative pullback for the case of □L.
Under this setting, we run into the issue that M(∧) is not 1-Lipschitz continuous, see Example 16. This is despite the fact that M(∧) is continuous (and probably even Lipschitz continuous).
Allow higher Lipschitz constants for maps
One might wonder what happens, when we pick a higher Lipschitz constant c>1 for maps in our category (instead of c=1). If we have two functions with a Lipschitz constant c>1, then we can only guarantee a Lipschitz constant of c2 for the composition (e.g. with the function [0,1]∋x↦min(1,cx)∈[0,1]). If we repeat this argument, we cannot guarantee a finite Lipschitz constant, because cn→∞ as n→∞.
So, what happens if we have allow maps with Lipschitz constant 2? It turns out that then we run into issues with functoriality, as in Allowing Discontinuity. A concrete counterexample is described in Example 15.
Another metric on products
Instead of the choice for dX×Y in Standard setting for the alternative HUC case, we could consider using the metric
dX×Y((x1,y1),(x2,y2))=dX(x1,x2)+dY(y1,y2).A first problem with this choice is that this is incompatible with any finite bound on d, as is the case in Standard setting for the alternative HUC case. This issue can be avoided by considering extended metric spaces, which are metric spaces whose metric can take values in [0,∞] (in that case the distance between disjoint components in a sum is set to ∞).
A more difficult problem with this approach would be that the diagonal map diag:X→X×X is not 1-Lipschitz continuous for other natural choices of metric on products. We do need the diagonal map, see, e.g., Equality of functions.
Modifying the definition of alternative HUCs
We mention that as an extension, the definition for ⪯ in Definition 9 could be modified to include a constant c>0 and use θ1(c+f)≤θ2(c+f). For smaller c>0, this would make these alternative HUCs sharper. The version with c=1 has some drawbacks as a lot of probability mass gets assigned to points nearby: For X=[0,1], we have 12δ0⪯δ1, so an alternative HUC which contains δ1 also has to contain 12δ0. In the context of infra-Bayesianism, this means we can use these kind of HUCs only in combination together with a smaller class of loss functions (in this case, loss functions that are of type 1+f for a nonnegative Lipschitz continuous f).
One might wonder whether increasing c might help us decrease the Lipschitz constant for M(∧) or even choose a c>1 such that M(∧) is 1-Lipschitz. While the Lipschitz constant does get improved, calculating through Example 16 would show that the Lipschitz constant M(∧) cannot reach 1 even if we use these modified versions of □L.
Appendix
Lemma 10. If X is a compact Polish space then so is □X.
Proof. It is known that the space of nonempty compact subsets of X is compact in the Hausdorff metric. One can also show that □X is a closed subset of the space of nonempty compact subsets of X.
For the related concepts (but not HUCs) in IB theory, this result can also be found in Proposition 47 in "Less Basic Inframeasure Theory" ◻
Issues with Continuity
Using the topology induced by the Hausdorff metric on □X (see below Definition 2), we find that in the 'logical' semantics (inverse) limit type constructions (in the category theoretical sense) are generally not continuous (see Continuity Counterexamples for details). This includes the semantics for ∧α, ∀αβ, and the pullback f∗.
If X itself is finite, then □X is nice enough that we don't run into these issues. This observation is sufficient for example in the construction of the infra-POMDPs in Examples when the state spaces are finite. However, we quickly run into issues if we're not careful with iterating the □ construction, since □X is not finite even if X is.
Below is some discussion on a possible approach to trying to fix these issues (we haven't found an entirely satisfactory fix though).
Continuity Counterexamples
Lemma 11. The function M(∧):□M(α)×□M(α)→□M(α), given by M(∧)(μ,ν)=μ∩ν is not always continuous.
Proof. Consider the case where M(α) is the space [0,1] with standard topology. This is a Polish space.
We consider sequences with xi=1/2−1/i and yi=1/2+1/i. We define x0=y0=1/2. Then ⊤{xi} and ⊤{yi} are in □M(α). (For a point z∈[0,1], I understand this notation to be ⊤{z}:={aδz∣a∈[0,1]}, where δz∈Δ([0,1]) is the measure that puts all weight on the single point z).
One can calculate that M(∧)(⊤{xi},⊤{yi})=⊥.
Because of the Hausdorff metric in □M(α) we have ⊤{xi}→⊤{x0} and ⊤{yi}→⊤{x0}.
To put it together, we have
limM(∧)(⊤{xi},⊤{yi})=⊥≠⊤{x0}=M(∧)(⊤{x0},⊤{y0})=M(∧)(lim⊤{xi},lim⊤{yi}).Thus, M(∧) is not continuous. ◻
Lemma 12. The function M(∀αβ) is not always continuous.
Proof. Consider the case that M(α)=M(β)=[0,1]. We use the notation f:=M(∀αβ). We choose the sequence μn=⊤[(0,0),(1,1/n)], where [,] denotes the line between points. Then one can show that f(μn)=⊥.
However, we also have μn→μ0:=⊤[(0,0),(1,0)]. One can show that f(μ0)=⊤{0}∈□M(β) holds. Thus, we have
f(μn)→⊥≠f(μ0),i.e. f is not continuous. ◻
Lemma 13. There is a continuous function f:X→Y such that f∗:□Y→□X is not continuous.
Proof. Pick f:[−1,1]→[0,1], f(x)=max(0,x). Then f∗(⊤{1/n})=⊤{1/n}, but for the limit we have f∗(⊤{0})=⊤[−1,0], and not ⊤{1/n}→⊤[−1,0]. ◻
Semi-continuous Pullback is not Compositional
Let f,g:[0,1]→[0,1] be given by
g(x)={0if x<1/21if x≥1/2,And
f(x)=⎧⎨⎩0if x<1/31/2if 1/3≤x<2/31if x≥2/3.Both f and g are monotone increasing, upper semi-continuous. Let C=[0,1/2]⊂[0,1]. Then B=g−1(C)=[0,1/2), the closure ¯¯¯¯B=[0,1/2]. Hence f−1(B)=[0,1/3), while f−1(¯¯¯¯B)=[0,2/3), so
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯f−1(¯¯¯¯B)≠¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯f−1(B).We can construct a corresponding counterexample where (g∘f)∗≠g∗∘f∗ by considering Θ∈□[0,1], defined as follows. Let Θ=⊤C be the sharp HUC on C. Then
g−1∗(Θ)={ρ∈Δc[0,1]:supp(ρ)⊂[0,1/2)}so
g∗(Θ)=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯g−1∗(Θ)={ρ∈Δc[0,1]:supp(ρ)⊂[0,1/2]}.Then
(g∘f)∗(Θ)={ρ∈Δc[0,1]:supp(ρ)⊂[0,1/3]},but
(f∗∘g∗)(Θ)=f∗(g∗(Θ))={ρ∈Δc[0,1]:supp(ρ)⊂[0,2/3]},which is different.
Stricter Homogeneity
We collect some technical results for A Different Notion of HUCs
With Definition 9 we can prove continuity for M(∧), which does not hold for ordinary HUCs.
Lemma 14. The function M(∧):□LM(α)×□LM(α)→□LM(α), given by M(∧)(μ,ν)=μ∩ν is continuous.
Proof. We set X:=M(α). As a metric on Δc(X) we will use the Kantorovich-Rubenstein metric
d(p,q):=sup{|p(g)−q(g)|:g:X→[−1,1]1-Lipschitz}.(1)Let μk, νk be sequences in □L(X) with μk→μ0, νk→ν0. We want to show that μk∩νk→μ0∩ν0. We have
dH(μk∩νk,μ0∩ν0)≤supp∈μk∩νkd(p,μ0∩ν0)+supp∈μ0∩ν0d(p,μk∩νk).(2)For the first term, let pk∈μk∩νk be such that d(pk,μ0∩ν0) is maximized. By compactness, there exists a convergent subsequence of pk. Wlog let pk→p0. We have pk∈μk for all k, and taking limits implies p0∈μ0. The same argument also implies p0∈ν0. Thus, we have
supp∈μk∩νkd(p,μ0∩ν0)=d(pk,μ0∩ν0)→d(p0,μ0∩ν0)=0.For the second term, let p0∈μ0∩ν0 be given. Due to μk→μ0, there exists a sequence pk with pk∈μk and pk→p0. Similarly, there exists a sequence qk with qk∈νk and qk→q0. The idea is now to construct a measure (i.e. a point in Δc(X)) rk that is close to pk and qk but in μk∩νk.
Wlog we assume p0(1)>0 and pk(1)>0, qk(1)>0 for all k. We define δk>0 via
δk:=d(pk,qk)(2+diamX)(pk+qk)(1),where diamX:=supx,y∈Xd(x,y) is the diameter of X. Clearly, we have δk→0. We also define
rk:=1−δk2(qk+pk).We want to show rk⪯qk and rk⪯pk. By symmetry we only need to show rk⪯pk. For that purpose, let f:X→R+ be an arbitrary 1-Lipschitz continuous function. We define g:=1+f. We have to show rk(g)≤pk(g).
We can decompose g=gmin+g0, where gmin:=minx∈Xg(x) is a constant, and g0 is a nonnegative function that satisfies g0(x0)=0 for some x0∈X. Since g0 is 1-Lipschitz continuous, this implies that g0(x)≤diamX for all x∈X. Thus, (1+diamX)−1g0 is a function that is 1-Lipschitz continuous and bounded by 1, i.e. admissible in the definition Eqn. (1). We can use this to obtain
(qk−pk)(g0)=(1+diamX)(qk−pk)((1+diamX)−1g0)≤(1+diamX)d(qk,pk)≤d(pk,qk)(1+diamX)(pk+qk)(1)(pk+qk)(g).On the other hand, we have
(qk−pk)(gmin)=gmin(qk−pk)(1)≤gmind(pk,qk)≤d(