FUNDAMENTALS OF INFRA-BAYESIANISM
This sequence introduces the fundamentals of infra-Bayesianism, a mathematical framework for studying artificial learning and intelligence that developed from Vanessa Kosoy’s Learning Theoretic AI Alignment Research Agenda.
In part one, we describe the motivations for this framework including a description of the thought experiment Dutch-booking Bayesians.
In part two, we summarize basic reinforcement learning theory, including the notions of environments, histories, destinies, policies, expected loss, Bayes mixture of environments, optimal and Bayes-optimal policies, regret, Bayesian regret, traps, and learnability.
In part three, we generalize the concepts of the second post using infra-Bayesianism. For mathematical simplicity, we stay in the setting of credal sets (closed, convex sets of probability distributions). In particular, we discuss infra-kernels, various important topologies, crisp causal laws, the minimax decision rule, infra-regret, and infra-Bayes learnability and optimality.
In part four, we describe a generalization of Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs) called crisp supra-MDPs and supra-POMDPs. The new feature of these decision processes is that the stochastic transition dynamics are multivalued, i.e. specified by credal sets. We discuss:
- Connections between these decision processes and crisp causal laws, the hypotheses of infra-Bayesian reinforcement learning
- How supra-MDPs can approximate MDPs by a coarsening of the state space that allows an agent to be agnostic about the detailed dynamics while still having performance guarantees for the full MDP
- An algorithm to compute a Markov optimal policy for supra-MDPs with finite time horizons
- The existence of a stationary optimal policy for supra-MDPs with infinite time horizons and geometric time discount
- The connection between supra-MDPs and stochastic games
- Known regret bounds for supra-MDPs
In part five (not yet published), we introduce contributions and supracontributions, which are basic objects from infra-Bayesianism that go beyond the crisp case (the case of credal sets). We then define supra-POMDPs, a generalization of POMDPs. We use supra-POMDPs to formalize various Newcombian problems in the context of learning theory where an agent repeatedly encounters the problem. In particular, we discuss the opaque, transparent, and epsilon-noisy versions of Newcomb's problem, XOR blackmail, and counterfactual mugging. We conclude by stating a theorem that describes how under a condition known as pseudocausality, optimality for the supra-POMDP relates to optimality for the Newcombian problem.