Introduction To The Infra-Bayesianism Sequence

by Diffractor13 min read27th Aug 202024 comments


Ω 35

Infra-BayesianismBayes' TheoremDecision TheoryAI
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

TLDR: Infra-Bayesianism is a new approach to epistemology / decision theory / reinforcement learning theory, which builds on "imprecise probability" to solve the problem of prior misspecification / grain-of-truth / nonrealizability which plagues Bayesianism and Bayesian reinforcement learning. Infra-Bayesianism also naturally leads to an implementation of UDT, and (more speculatively at this stage) has applications to multi-agent theory, embedded agency and reflection. This post is the first in a sequence which lays down the foundation of the approach.


Diffractor and Vanessa proudly present: The thing we've been working on for the past five months. I initially decided that Vanessa's scattered posts about incomplete models were interesting, and could benefit from being written up in a short centralized post. But as we dug into the mathematical details, it turned out it didn't really work, and then Vanessa ran across the true mathematical thing (which had previous ideas as special cases) and scope creep happened. 

This now looks like a new, large, and unusually tractable vein of research. Accordingly, this sequence supersedes all previous posts about incomplete models, and by now we've managed to get quite a few interesting results, and have ideas for several new research directions.

Diffractor typed everything up and fleshed out the proof sketches, Vanessa originated almost all of the ideas and theorems. It was a true joint effort, this sequence would not exist if either of us were absent. Alex Mennen provided feedback on drafts to make it much more comprehensible than it would otherwise be, and Turntrout and John Maxwell also helped a bit in editing.

Be aware this sequence of posts has the math textbook issue where it requires loading a tower of novel concepts that build on each other into your head, and cannot be read in a single sitting. We will be doing a group readthrough on MIRIxDiscord where we can answer questions and hopefully get collaborators, PM me to get a link.



Learning theory traditionally deals with two kinds of setting: "realizable" and "agnostic" or "non-realizable". In realizable settings, we assume that the environment can be described perfectly by a hypothesis inside our hypothesis space. (AIXI is an example of this) We then expect the algorithm to converge to acting as if it already knew the correct hypothesis. In non-realizable settings, we make no such assumption. We then expect the algorithm to converge to the best approximation of the true environment within the available hypothesis space.

As long as the computational complexity of the environment is greater than the computational complexity of the learning algorithm, the algorithm cannot use an easy-to-compute hypothesis that would describe the environment perfectly, so we are in the nonrealizable setting. When we discuss AGI, this is necessarily the case, since the environment is the entire world: a world that, in particular, contains the agent itself and can support other agents that are even more complex, much like how halting oracles (which you need to run Solomonoff Induction) are nowhere in the hypotheses which Solomonoff considers. Therefore, the realizable setting is usually only a toy model. So, instead of seeking guarantees of good behavior assuming the environment is easy to compute, we'd like to get good behavior simply assuming that the environment has some easy-to-compute properties that can be exploited.

For offline and online learning there are classical results in the non-realizable setting, in particular VC theory naturally extends to the non-realizable setting. However, for reinforcement learning there are few analogous results. Even for passive Bayesian inference, the best non-realizable result found in our literature search is Shalizi's which relies on ergodicity assumptions about the true environment. Since reinforcement learning is the relevant setting for AGI and alignment theory, this poses a problem.

Logical inductors operate in the nonrealizable setting, and the general reformulation of them in Forecasting Using Incomplete Models is of interest for broader lessons applicable to acting in an unknown environment. In said paper, reality can be drawn from any point in the space of probability distributions over infinite sequences of observations, . Almost all of the points in this space aren't computable, and because of that, we shouldn't expect convergence to the true environment, as occurs in the realizable setting where the true environment lies in your hypothesis space.

However, even if we can't hope to learn the true environment, we can at least hope to learn some property of the true environment, like "every other bit is a 0", and have our predictions reflect that if it holds. A hypothesis in this setting is a closed convex subset of  which can be thought of as "I don't know what the true environment is, but it lies within this set". The result obtained in the above-linked paper was, if we fix a countable family of properties that reality may satisfy, and define the inductor based on them, then for all of those which reality fulfills, the predictions of the inductor converge to that closed convex set and so fulfill the property in the limit.


What About Environments?

However, this just involves sequence prediction. Ideally, we'd want some space that corresponds to environments that you can interact with, instead of an environment that just outputs bits. And then, given a suitable set  in it... Well, we don't have a fixed environment to play against. The environment could be anything, even a worst-case one within . We have Knightian uncertainty over our set of environments, it is not a probability distribution over environments. So, we might as well go with the maximin policy.

Where  is the distribution over histories produced by policy  interacting with environment  is just some utility function.

When we refer to "Murphy", this is referring to whatever force is picking the worst-case environment to be interacting with. Of course, if you aren't playing against an adversary, you'll do better than the worst-case utility that you're guaranteed. Any provable guarantees come in the form of establishing lower bounds on expected utility if a policy is selected.

The problem of generating a suitable space of environments was solved in Reinforcement Learning With Imperceptible Rewards. If two environments are indistinguishable by any policy they are identified, a mixture of environments corresponds to picking one of the component environments with the appropriate probability at the start of time, and there was a notion of update.

However, this isn't good enough. We could find no good update rule for a set of environments, we had to go further.


Which desiderata should be fulfilled to make maximin policy selection over a set of environments (actually, we'll have to generalize further than this) to work successfully? We'll have three starting desiderata.

Desideratum 1: There should be a sensible notion of what it means to update a set of environments or a set of distributions, which should also give us dynamic consistency. Let's say we've got two policies,  and  which are identical except they differ after history . If, after updating on history , the continuation of  looks better than the continuation of , then it had better be the case that, viewed from the start,  outperforms .


Desideratum 2: Our notion of a hypothesis (set of environments) in this setting should collapse "secretly equivalent" sets, such that any two distinct hypotheses behave differently in some relevant aspect. This will require formalizing what it means for two sets to be "meaningfully different", finding a canonical form for an equivalence class of sets that "behave the same in all relevant ways", and then proving some theorem that says we got everything.


Desideratum 3: We should be able to formalize the "Nirvana trick" (elaborated below) and cram any UDT problem where the environment cares about what you would do, into this setting. The problem is that we're just dealing with sets of environments which only depend on what you do, not what your policy is, which hampers our ability to capture policy-dependent problems in this framework. However, since Murphy looks at your policy and then picks which environment you're in, there is an acausal channel available for the choice of policy to influence which environment you end up in.

The "Nirvana trick" is as follows. Consider a policy-dependent environment, a function  (Ie, the probability distribution over the next observation depends on the history so far, the action you selected, and your policy). We can encode a policy-dependent environment as a set of policy-independent environments that don't care about your policy, by hard-coding every possible deterministic policy into the policy slot, making a family of functions of type , which is the type of policy-independent environments. It's similar to taking a function , and plugging in all possible  to get a family of functions that only depend on .

Also, we will impose a rule that, if your action ever violates what the hard-coded policy predicts you do, you attain Nirvana (a state of high or infinite reward). Then, Murphy, when given this set of environments, will go "it'd be bad if they got high or infinite reward, thus I need to pick an environment where the hard-coded policy matches their actual policy". When playing against Murphy, you'll act like you're selecting a policy for an environment that does pay attention to what policy you pick. As-stated, this doesn't quite work, but it can be repaired.

There's two options. One is making Nirvana count as infinite reward. We will advance this to a point where we can capture any UDT/policy-selection problem, at the cost of some mathematical ugliness. The other option is making Nirvana count as 1 reward forever afterward, which makes things more elegant, and it is much more closely tied to learning theory, but that comes at the cost of only capturing a smaller (but still fairly broad) class of decision-theory problems. We will defer developing that avenue further until a later post.


A Digression on Deterministic Policies

We'll be using deterministic policies throughout. The reason for using deterministic policies instead of probabilistic policies (despite the latter being a larger class), is that the Nirvana trick (with infinite reward) doesn't work with probabilistic policies. Also, probabilistic policies don't interact well with embeddedness, because it implicitly assumes that you have a source of random bits that the rest of the environment can never interact with (except via your induced action) or observe.

Deterministic policies can emulate probabilistic policies by viewing probabilistic choice as deterministically choosing a finite bitstring to enter into a random number generator (RNG) in the environment, and then you get some bits back and act accordingly.

However, we aren't assuming that the RNG is a good one. It could be insecure or biased or nonexistent. Thus, we can model cases like Death In Damascus or Absent-Minded Driver where you left your trusty coin at home and don't trust yourself to randomize effectively. Or a nanobot that's too small to have a high bitrate RNG in it, so it uses a fast insecure PRNG (pseudorandom number generator). Or game theory against a mindreader that can't see your RNG, just the probability distribution over actions you're using the RNG to select from, like an ideal CDT opponent. It can also handle cases where plugging certain numbers into your RNG chip cause lots of heat to be released, or maybe the RNG is biased towards outputting 0's in strong magnetic fields. Assuming you have a source of true randomness that the environment can't read isn't general enough!


Motivating Sa-Measures

Sets of probability distributions or environments aren't enough, we need to add in some extra data. This can be best motivated by thinking about how updates should work in order to get dynamic consistency.

Throughout, we'll be using a two-step view of updating, where first, we chop down the measures accordingly (the "raw update"), and then we renormalize back up to 1.

So, let's say we have a set of two probability distributions  and . We have Knightian uncertainty within this set, we genuinely don't know which one will be selected, it may even be adversarial.  says observation  has 0.5 probability,  says observation  has 0.01 probability. And then you see observation ! The wrong way to update would be to go "well, both probability distributions are consistent with observed data, I guess I'll update them individually and resume being completely uncertain about which one I'm in", you don't want to ignore that one of them assigns 50x higher probability to the thing you just saw.

However, neglecting renormalization, we can do the "raw update" to each of them individually, and get  and  (finite measures, not probability distributions), where  has 0.5 measure and  has 0.01 measure.

Ok, so instead of a set of probability distributions, since that's insufficient for updates, let's consider a set of measures , instead. Each individual measure in that set can be viewed as , where  is a probability distribution, and  is a scaling term. Note that  is not uniform across your set, it varies depending on which point you're looking at.


However, this still isn't enough. Let's look at a toy example for how to design updating to get dynamic consistency. We'll see we need to add one more piece of data. Consider two environments where a fair coin is flipped, you see it and then say "heads" or "tails", and then you get some reward. The COPY Environment gives you 0 reward if you say something different than what the coin shows, and 1 reward if you match it. The REVERSE HEADS Environment always you 0.5 reward if the coin comes up tails, but it comes up heads, saying "tails" gets you 1 reward and "heads" gets you 0 reward. We have Knightian uncertainty between the two environments.

For finding the optimal policy, we can observe that saying "tails" when the coin is tails helps out in COPY and doesn't harm you in REVERSE HEADS, so that's a component of an optimal policy.

Saying "tails" no matter what the coin shows means you get  utility on COPY, and  utility on REVERSE HEADS. Saying "tails" when the coin is tails and "heads" when the coin is heads means you get  utility on COPY and  utility on REVERSE HEADS. Saying "tails" no matter what has a better worst-case value, so it's the optimal maximin policy.

Now, if we see the coin come up heads, how should we update? The wrong way to do it would be to go "well, both environments are equally likely to give this observation, so I've got Knightian uncertainty re: whether saying heads or tails gives me 1 or 0 utility, both options look equally good". This is because, according to past-you, regardless of what you did upon seeing the coin come up "tails", the maximin expected values of saying "heads" when the coin comes up heads, and saying "tails" when the coin comes up heads, are unequal. Past-you is yelling at you from the sidelines not to just shrug and view the two options as equally good.

Well, let's say you already know that you would say "tails" when the coin comes up tails and are trying to figure out what to do now that the coin came up heads. The proper way to reason through it is going "I have Knightian uncertainty between COPY which has 0.5 expected utility assured off-history since I say "tails" on tails, and REVERSE HEADS, which has 0.25 expected utility assured off-history. Saying "heads" now that I see the coin on heads would get me  expected utility in COPY and  utility in REVERSE HEADS, saying "tails" would get me  utility in COPY and  utility in REVERSE HEADS, I get higher worst-case value by saying "tails"." And then you agree with your past self re: how good the various decisions are.

Huh, the proper way of doing this update to get dynamic consistency requires keeping track of the fragment of expected utility we get off-history.

Similarly, if you messed up and precommitted to saying "heads" when the coin comes up tails (a bad move), we can run through a similar analysis and show that keeping track of the expected utility off-history leads you to take the action that past-you would advise, after seeing the coin come up heads.


So, with the need to keep track of that fragment of expected utility off-history to get dynamic consistency, it isn't enough to deal with finite measures , that still isn't keeping track of the information we need. What we need is , where  is a finite measure, and  is a number . That  term keeps track of the expected value off-history so we make the right decision after updating. (We're glossing over the distinction between probability distributions and environments here, but it's inessential)

We will call such a  pair an "affine measure", or "a-measure" for short. The reason for this terminology is because a measure can be thought of as a linear function from the space of continuous functions to . But then there's this  term stuck on that acts as utility, and a linear function plus a constant is an affine function. So, that's an a-measure. A pair of a finite measure and a  term where .


But wait, we can go even further! Let's say our utility function of interest is bounded. Then we can do a scale-and-shift until it's in .

Since our utility function is bounded in ... what would happen if you let in measures with negative parts, but only if they're paired with a sufficiently large  term? Such a thing is called an sa-measure, for signed affine measure. It's a pair of a finite signed measure and a  term that's as-large-or-larger than the amount of negative measure present. No matter your utility function, even if it assigns 0 reward to outcomes with positive measure and 1 reward to outcomes with negative measure, you're still assured nonnegative expected value because of that  term. It turns out we actually do need to expand in this direction to keep track of equivalence between sets of a-measures, get a good tie-in with convex analysis because signed measures are dual to continuous functions, and have elegant formulations of concepts like minimal points and the upper completion.

Negative measures may be a bit odd, but as we'll eventually see, we can ignore them and they only show up in intermediate steps, not final results, much like negative probabilities in quantum mechanics. And if negative measures ever become relevant for an application, it's effortless to include them.


Belief Function Motivation

Also, we'll have to drop the framework we set up at the beginning where we're considering sets of environments, because working with sets of environments has redundant information. As an example, consider two environments where you pick one of two actions, and get one of two outcomes. In environment , regardless of action, you get outcome 0. In environment , regardless of action, you get outcome 1. Then, we should be able to freely add an environment , where action 0 implies outcome 0, and where action 1 implies outcome 1. Why?

Well, if your policy is to take action 0,  and  behave identically. And if your policy is to take action 1,  and  behave identically. So, adding an environment like this doesn't affect anything, because it's a "chameleon environment" that will perfectly mimic some preexisting environment regardless of which policy you select. However, if you consider the function mapping an action to the set of possible probability distributions over outcomes, adding  didn't change that at all. Put another way, if it's impossible to distinguish in any way whether an environment was added to a set of environments because no matter what you do it mimics a preexisting environment, we might as well add it, and seek some alternate formulation instead of "set of environments" that doesn't have the unobservable degrees of freedom in it.

To eliminate this redundancy, the true thing we should be looking at isn't a set of environments, but the "belief function" from policies to sets of probability distributions over histories. This is the function produced by having a policy interact with your set of environments and plotting the probability distributions you could get. Given certain conditions on a belief function, it is possible to recover a set of environments from it, but belief functions are more fundamental. We'll provide tools for taking a wide range of belief functions and turning them into sets of environments, if it is desired.

Well, actually, from our previous discussion, sets of probability distributions are insufficient, we need a function from policies to sets of sa-measures. But that's material for later.



So, our fundamental mathematical object that we're studying to get a good link to decision theory is not sets of probability distributions, but sets of sa-measures. And instead of sets of environments, we have functions from policies to sets of sa-measures over histories. This is because probability distributions alone aren't flexible enough for the sort of updating we need to get dynamic consistency, and in addition to this issue, sets of environments have the problem where adding a new environment to your set can be undetectable in any way.


In the next post, we build up the basic mathematical details of the setting, until we get to a duality theorem that reveals a tight parallel between sets of sa-measures fulfilling certain special properties, and probability distributions, allowing us to take the first steps towards building up a version of probability theory fit for dealing with nonrealizability. There are analogues of expectation values, updates, renormalizing back to 1, priors, Bayes' Theorem, Markov kernels, and more. We use the "infra" prefix to refer to this setting. An infradistribution is the analogue of a probability distribution. An infrakernel is the analogue of a Markov kernel. And so on.


The post after that consists of extensive work on belief functions and the Nirvana trick to get the decision-theory tie-ins, such as UDT behavior while still having an update rule, and the update rule is dynamically consistent. Other components of that section include being able to specify your entire belief function with only part of its data, and developing the concept of Causal, Pseudocausal, and Acausal hypotheses. We show that you can encode almost any belief function as an Acausal hypothesis, and you can translate Pseudocausal and Acausal hypotheses to Causal ones by adding Nirvana appropriately (kinda). And Causal hypotheses correspond to actual sets of environments (kinda). Further, we can mix belief functions to make a prior, and there's an analogue of Bayes for updating a mix of belief functions. We cap it off by showing that the starting concepts of learning theory work appropriately, and show our setting's version of the Complete Class Theorem.


Later posts (not written yet) will be about the "1 reward forever" variant of Nirvana and InfraPOMDP's, developing inframeasure theory more, applications to various areas of alignment research, the internal logic which infradistributions are models of, unrealizable bandits, game theory, attempting to apply this to other areas of alignment research, and... look, we've got a lot of areas to work on, alright? 

If you've got the relevant math skills, as previously mentioned, you should PM me to get a link to the MIRIxDiscord server and participate in the group readthrough, and you're more likely than usual to be able to contribute to advancing research further, there's a lot of shovel-ready work available.



Ω 35

23 comments, sorted by Highlighting new comments since Today at 11:11 PM
New Comment

much like how halting oracles (which you need to run Solomonoff Induction) are nowhere in the hypotheses which Solomonoff considers

The Solomonoff prior is a mixture over semi-measures[*] that are lower semi-computable: that is, you can compute increasingly good approximations of the semi-measure from below that converge eventually to the actual semi-measure, but at finite time you don't know how close you are to the right answer. The Solomonoff prior itself is also a lower semi-computable semi-measure. Therefore, there is a real sense in which its hypothesis class includes things as difficult to compute as it is. That being said, my guess is that halting oracles would indeed let you compute more than just the lower semi-computable functions, and it's also true that being able to run Solomonoff induction would also let you build a halting oracle.

[*] semi-measures are probability distributions that have 'missing density', where the probability of a 0 and then a 0, plus the probability of a 0 and then a 1, is less than or equal to the probability of a 0, even though there aren't any other options in the space for what happens next.

The problem with lower semicomputable functions is that it's a class not closed under natural operations. For example, taking minus such a function we get an upper semicomputable function that can fail to be lower semicomputable. So, given a Solomonoff induction oracle we can very easily (i.e. using a very efficient oracle machine) construct measures that are not absolutely continuous w.r.t. the Solomonoff prior.

In fact, for any prior this can be achieved by constructing an "anti-inductive" sequence: a sequence that contains at a given place if and only if the prior, conditional on the sequence before this place, assigns probability less than to . Such a sequence cannot be accurately predicted by the prior (and, by the merging-of-opinions theorem, a delta-function at this sequence it is not absolutely continuous w.r.t. the prior).

Therefore, there is a real sense in which its hypothesis class includes things as difficult to compute as it is. That being said, my guess is that halting oracles would indeed let you compute more than just the lower semi-computable functions, and it's also true that being able to run Solomonoff induction would also let you build a halting oracle.

I guess the way to reconcile this is to think that there's a difference between what you can lower semi-compute, and what you could compute if you could compute lower semi-computable things? But it's been a while since I had a good understanding of this type of thing.

Thanks, I'll make sure to read it!

I've started thinking in this direction already back in 2016, and more in 2018 but only this year Alex and I nailed the precise definitions that make everything come together, and derived some key foundational theorems. Of course, much work yet remains.

After reading some of your paper, I think that they are actually very different. IIUC, you are talking about pessimism as a method to avoid traps, but you assume realizability. On the other hand, infra-Bayesianism is (to first approximation) orthogonal to dealing with traps, instead it allows dealing with nonrealizability.

The results I prove assume realizability, and some of the results are about traps, but independent of the results, the algorithm for picking actions resembles infra-Bayesianism. So I think we're taking similar objects and proving very different sorts of things.

Well, I agree that both formalisms use maximin so there might be some way to merge them. It's definitely something to think about.

Maybe I'm understanding this wrong... but isn't there another kind of dynamic inconsistency here?

Suppose you walk into an empty room. Then you're told you're in either WORLD A, which gives reward 1 for choosing heads and reward 0 for choosing tails, or WORLD B, which gives reward 0 for choosing heads and reward 1 for choosing tails. You have Knightian uncertainty between the worlds, so you minmax and end up indifferent between them; you arbitrarily choose heads.

Then you hear a voice:

"Actually, I flipped a fair coin that came up heads. If it had instead come up tails, in WORLD A you would have gotten reward 1 for choosing tails and reward 0 for saying heads, but in WORLD B you would have gotten 0.5 reward no matter what. (And my monologue would have changed appropriately.)"

Should you reconsider your decision to choose heads? What you learned doesn't actually change the decision you're faced with, does it?

Do you think this problem is essentially different from "suppose Omega asks you for 10 bucks. You say no. Then Omega says "actually I flipped a fair coin that came up tails, if it had come up heads, I would have given you 100 dollars if I predicted you'd give me 10 dollars on tails"?

(I think I can motivate "reconsider choosing heads" if you're like "yeah, this is just counterfactual mugging with belated notification of what situation you're in, and I'd pay up in that circumstance")

Why are you minmaxing over expected values of policies, instead of over outcomes? Isn't the worst case for the "tails only" policy "I'm in COPY and the coin is heads", not "'I'm in COPY"?

Basically I don't understand why "past me, who is screaming at me from the sidelines that it matters whether I pick tails or not" once I see that the coin comes up heads is actually correct and the "me" who's indifferent is wrong; one man's modus ponens is another man's modus tollens.

Here's another example that makes my intuition go "ouch" - suppose that choosing heads in REVERSE HEADS when the coin is heads gives 0.1 utility. Then the "match the coin" policy has an expected value in REVERSE HEADS of 0.3 instead of 0.25 and the minmax rule you picked still tells you to "always pick tails", but conditioning on heads, "pick heads if you see heads" gives you 0.1 utility or 1 utility, while "always pick tails" gives you 1 utility or 0 utility, so isn't "pick heads" a better strategy?

Basically I don't understand why "past me, who is screaming at me from the sidelines that it matters whether I pick tails or not" once I see that the coin comes up heads is actually correct and the "me" who's indifferent is wrong; one man's modus ponens is another man's modus tollens.

You could say the same thing for Bayesianism. Priors are subjective, so why should my beliefs be related to past-me beliefs by the Bayes rule? Indeed, some claim they shouldn't be. But it's still interesting to ask what happens if past-me has the power to enforce eir opinions. What if I'm able to make sure that my descendant agents will act optimally from my subjective point of view? Then you need dynamic consistency: for classical Bayesianism it's the Bayes rule, and for infra-Bayesianism it's our new updating rule.

Certainly if you're interested in learning algorithms, then dynamic consistency seems like a very useful property. Our learning desiderata (regret bounds) are defined from the point of view of the prior, so an algorithm designed for that purpose should remain consistent with this starting point.

On the other hand, we can also imagine situations where past-me has a reason to trust present-me's reasoning better than eir own reasoning, in which case some kind of "radical probabilism" is called for. For example, in Turing reinforcement learning, the agent can update on evidence coming from computational experiments. If we consider the beliefs of such an agent about the external environment only, they would change in a way inconsistent with the usual rule. But it's still true that the updates are not systematically biased: if you already knew where you will update, you would already have updated. And ofc if we do treat the "virtual evidence" explicitly, we return to the standard update rule.

Maximin over outcomes would lead to the agent devoting all its efforts towards avoiding the worst outcomes, sacrificing overall utility, while maximin over expected value pushes towards policies that do acceptably on average in all of the environments that it may find itself in.

Regarding "why listen to past me", I guess to answer this question I'd need to ask about your intuitions on Counterfactual mugging. What would you do if it's one-shot? What would you do if it's repeated? If you were told about the problem beforehand, would you pay money for a commitment mechanism to make future-you pay up the money if asked? (for +EV)

Could you defend worst-case reasoning a little more? Worst cases can be arbitrarily different from the average case - so maybe having worst-case guarantees can be reassuring, but actually choosing policies by explicit reference to the worst case seems suspicious. (In the human context, we might suppose that worst case, I have a stroke in the next few seconds and die. But I'm not in the business of picking policies by how they do in that case.)

You might say "we don't have an average case," but if there are possible hypotheses outside your considered space you don't have the worst case either - the problem of estimating a property of a non-realizable hypothesis space is simplified, but not gone.

Anyhow, still looking forward to working my way through this series :)

Infra-Bayesianism doesn't consider the worst case, since, even though each hypothesis is treated using the maximin decision rule, there is still a prior over many hypotheses[1]. One such hypothesis can upper bound the probability you will get a stroke in the next few seconds. An infra-Bayesian agent would learn this hypothesis and plan accordingly.

We might say that infra-Bayesianism assumes the worst only of that which is not only unknown but unknowable. To make a somewhat informal analogy with logic, we assume the worst model of the theory and thereby make any gain that can be gained provably.

One justification often given for Solomonoff induction is: we live in a simple universe. However, Solomonoff induction is uncomputable, so a simple universe cannot contain it. Instead, it might contain something like bounded Solomonoff induction. However, in order to justify bounded Solomonoff induction, we would need to assume that the universe is simple and cheap, which is false. In other words, postulating an "average-case" entails postulating a false dogmatic belief. Bounded "infra-Solomonoff" induction solves the problem by relying instead on the following assumption: the universe has some simple and cheap properties that can be exploited.

  1. Like in the Bayesian case, you can alternatively think of the prior as just a single infradistribution, which is the mixture of all the hypotheses it is comprised of. This is an equivalent view. ↩︎

The REVERSE HEADS Environment always you 0.5 reward if the coin comes up tails, but [if] it comes up heads, saying "tails" gets you 1 reward and "heads" gets you 0 reward. We have Knightian uncertainty between the two environments.


In the next post, (#2)

The post after that (#3)

2 more posts to look forward to.

Later posts (not written yet) will be about the "1 reward forever" variant of Nirvana and InfraPOMDP's (~#4), developing inframeasure theory more(~#5), applications to various areas of alignment research(~#6), the internal logic which infradistributions are models of (~#7), unrealizable bandits (~#8), game theory (~#9), attempting to apply this to other areas of alignment research (~#10), and... look, we've got a lot of areas to work on, alright? (*)

Plus a speculative/possible 7 more after that assuming no overlap or multi-post topics. (~#6 and ~#10 already being counted as 2 posts.)

*More leaning on the unenumerated possibilities.


I look forward to seeing more of this!

We have Knightian uncertainty over our set of environments, it is not a probability distribution over environments. So, we might as well go with the maximin policy.

For any fixed , there are computations which can't be correctly predicted in steps.

Logical induction will consider all possibilities equally likely in the absence of a pattern.

Logical induction will consider a sufficiently good psudorandom algorithm as being random.

Any kind of Knightian uncertainty agent will consider psudorandom numbers to be an adversarial superintelligence unless proved otherwise.

Logical induction doesn't depend on your utility function. Knightian uncertainty does.

There is a phenomena whereby any sufficiently broad set of hypothesis doesn't influence actions. Under the set of all hypothesis, anything could happen whatever you do,

However, there are sets of possibilities that are sufficiently narrow to be winnable, yet sufficiently broad to need to expend resources combating the hypothetical adversary. If it understands most of reality, but not some fundamental particle, it will assume that the particle is behaving in an adversarial manor.

If someone takes data from a (not understood) particle physics experiment, and processes it on a badly coded insecure computer, this agent will assume that the computer is now running an adversarial superintelligence. It would respond with some extreme measure like blowing the whole physics lab up.

Logical induction doesn't have interesting guarantees in reinforcement learning, and doesn't reproduce UDT in any non-trivial way. It just doesn't solve the problems infra-Bayesianism sets out to solve.

Logical induction will consider a sufficiently good pseudorandom algorithm as being random.

A pseudorandom sequence is (by definition) indistinguishable from random by any cheap algorithm, not only logical induction, including a bounded infra-Bayesian.

If it understands most of reality, but not some fundamental particle, it will assume that the particle is behaving in an adversarial manor.

No. Infra-Bayesian agents have priors over infra-hypotheses. They don't start with complete Knightian uncertainty over everything and gradually reduce it. The Knightian uncertainty might "grow" or "shrink" as a result of the updates.

I am confused about how the mechanisms and desiderata you lay out here can give meaningful differences of prediction over complete spaces of environments. Maybe it is possible to address this problem separately.

In particular, imagine the following environments:

E1: the outcome is deterministically 0 at even time steps and 1 at odd time steps.

E2: the outcome is deterministically 0 at even time steps up to step 100 and 1 at odd time steps up to step 100, then starts to be drawn randomly based on some uncomputable process.

E3: the outcome is drawn deterministically based on the action taken in a way which happens to give 0 for the first 100 even step actions and 1 for the odd step actions.

All of these deterministically predict all of the first 200 observations with probability 1. I have an intuition that if you get that set of 200 observations, you should be favoring E1, but I don't see how your update rule makes that possible without some prior measure over environments or some notion of Occam's Razor.

In the examples you give there are systemic differences between the environments but it isn't clear to me how the update is handled "locally" for environments that give the same predictions for all observed actions but diverge in the future, which seems sticky to me in practice.

I think I see what I was confused about, which is that there is a specific countable family of properties, and these properties are discrete, so you aren't worried about locally distinguishing between hypotheses.

Can you elaborate on what you meant by locally distinguishing between hypotheses?

I mean distinguishing between hypotheses that give very similar predictions--like the difference between a coin coming up heads 50% vs. 51% of the time.

As I said in my other comment, I think the assumption that you have discrete hypotheses is what I was missing.

Though for any countable set of hypotheses, you can expand that set by prepending some finite number of deterministic outcomes for the first several actions. The limit of this expansion is still countable, and the set of hypotheses that assign probability 1 to your observations is the same at every time step. I'm confused in this case about (1) whether or not this set of hypotheses is discrete and (2) whether hypotheses with shorter deterministic prefixes assign enough probability to allow meaningful inference in this case anyway.

I may mostly be confused about more basic statistical inference things that don't have to do with this setting.

[+][comment deleted]5mo 1Ω1