Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The Learning-Theoretic Agenda: Status 2023

5Marcus Ogren

3harfe

3Vanessa Kosoy

3Frank_R

3Vanessa Kosoy

9harfe

3matolcsid

2RogerDearnaley

2Frank_R

2Vanessa Kosoy

1Frank_R

2Mikhail Samin

1Review Bot

New Comment

(Disclosure: Vanessa is my wife.)

I want to share my thoughts on how the LTA can have a large impact. I think the main plan - to understand agency and intelligence fully enough to construct a provably aligned AI (perhaps modulo a few reasonable assumptions about the real world) - is a good plan. It’s how a competent civilization would go about solving the alignment problem, and a non-negligible chunk of the expected impact of the LTA comes from it working about as planned. But there are also plenty of other, less glamorous ways for it to make a big difference as well.

The LTA gives us tools to think about AI better. Even without the greater edifice of LTA and without a concentrated effort to complete the LTA and build an aligned AI in accordance with it, it can yield insights that help other alignment researchers. The LTA can identify possible problems that need to be solved and currently-unknown pitfalls that could make an AI unsafe (along the lines of the problem of privilege and acausal attack). It can also produce “tools” for solving certain aspects of the alignment problem that could be applied in an ad-hoc manner (such as the individual components of PSI). While this is decidedly inferior to creating a provably aligned AI, it is also far more likely to happen.

As for PSI, I think it’s a promising plan for creating an aligned AI in and of itself; it doesn’t appear to require greatly reduced capabilities and gives the AI an unhackable pointer to human values. But its main significance is as a proof of concept: the LTA has delivered this alignment proposal, and the LTA isn’t even close to being finished. My best guess is that, given enough time, some variant of PSI could be created as a provably aligned AI (modulo assumptions about human psychology, etc.). But I also expect better ideas in the future. PSI demonstrates that considering the fundamental questions of agency can lead to novel and elegant solutions to the alignment problem. Before Vanessa came up with PSI, I thought the main value of her research lay in solving weird-sounding problems (like acausal attack) that originally sounded more like an AI being stupid than like the AI being misaligned. PSI shows that the LTA is much, much more than this.

Regarding direction 17: There might be some potential drawbacks to ADAM. I think its possible that some very agentic programs have relatively low score. This is due to explicit optimization algorithms being low complexity.

(Disclaimer: the following argument is not a proof, and appeals to some heuristics/etc. We fix for these considerations too.) Consider an utility function . Further, consider a computable approximation of the optimal policy (AIXI that explicitly optimizes for ) and has an approximation parameter n (this could be AIXI-tl, plus some approximation of ; higher is better approximation). We will call this approximation of the optimal policy . This approximation algorithm has complexity , where is a constant needed to describe the general algorithm (this should not be too large).

We can get better approximation by using a quickly growing function, such as the Ackermann function with . Then we have .

What is the score of this policy? We have . Let be maximal in this expression. If , then .

For the other case, let us assume that if , the policy is at least as good at maximizing than . Then, we have .

I don't think that the assumption ( maximizes better than ) is true for all and , but plausibly we can select such that this is the case (exceptions, if they exist, would be a bit weird, and if ADAM working well due to these weird exceptions feels a bit disappointing to me). A thing that is not captured by approximations such as AIXI-tl are programs that halt but have insane runtime (longer than ). Again, it would feel weird to me if ADAM sort of works because of low-complexity extremely-long-running halting programs.

To summarize, maybe there exist policies which strongly optimize a non-trivial utility function with approximation parameter , but where is relatively small.

Yes, this is an important point, of which I am well aware. This is why I expect unbounded-ADAM to only be a toy model. A more realistic ADAM would use a complexity measure that takes *computational complexity* into account instead of . For example, you can look at the measure I defined here. More realistically, this measure should be based on the frugal universal prior.

I have a question about the conjecture at the end of Direction 17.5. Let be a utility function with values in and let be a strictly monotonous function. Then and have the same maxima. can be non-linear, e.g. . Therefore, I wonder if the condition should be weaker.

Moreover, I ask myself if it is possible to modify by a small amount at a place far away from the optimal policy such that is still optimal for the modified utility function. This would weaken the statement about the uniqueness of the utility function even more. Think of an AI playing Go. If a weird position on the board has the utility -1.01 instead of -1, this should not change the winning strategy. I have to go through all of the definitions to see if I can actually produce a more mathematical example. Nevertheless, you may have a quick opinion if this could happen.

I have a question about the conjecture at the end of Direction 17.5. Let be a utility function with values in and let be a strictly monotonous function. Then and have the same maxima. can be non-linear, e.g. . Therefore, I wonder if the condition should be weaker.

No, because it changes the *expected value* of the utility function under various distributions.

Moreover, I ask myself if it is possible to modify by a small amount at a place far away from the optimal policy such that is still optimal for the modified utility function.

Good catch, the conjecture as stated is obviously false. Because, we can e.g. take to be the same as everywhere except after some action which doesn't actually take, in which case make it identically 0. Some possible ways to fix it:

- Require the utility function to be of the form (i.e. not depend on actions).
- Use (strictly) instrumental reward functions.
- Weaken the conclusion so that we're only comparing and on-policy (but this might be insufficient for superimitation).
- Require to be optimal off-policy (but it's unclear how can this generalize to finite ).

I think the conjecture is also false in the case that utility functions map from to .

Let us consider the case of and . We use , where is the largest integer such that starts with (and ). As for , we use , where is the largest integer such that starts with (and ). Both and are computable, but they are not locally equivalent. Under reasonable assumptions on the Solomonoff prior, the policy that always picks action is the optimal policy for both and (see proof below).

Note that since the policy is computable and very simple, is not true, and we have instead. I suspect that the issues are still present even with an additional condition, but finding a concrete example with an uncomputable policy is challenging.

**proof**:
Suppose that and are locally equivalent. Let be an open neighborhood of the point and , be such that for all .

Since , we have . Because is an open neighborhood of , there is an integer such that for all . For such , we have This implies . However, this is not possible for all . Thus, our assumption that and are locally equivalent was wrong.

*Assumptions about the solomonoff prior*:
For all , the sequence of actions that produces the sequence of with the highest probability is (recall that we start with observations in this setting).
With this assumption, it can be seen that the policy that always picks action is among the best policies for both and .

*I think this is actually a natural behaviour for a reasonable Solomonoff prior*:
It is natural to expect that is more likely than . It is natural to expect that the sequence of actions that leads to over has low complexity. Always picking is low complexity.

*It is possible to construct an artificial UTM that ensures that
"always take " is the best policy for , *:
An UTM can be constructed such that the corresponding Solomonoff prior assigns 3/4 probability to the program/environment "start with o_1. after action a_i, output o_i". The rest of the probability mass gets distributed according to some other more natural UTM.

Then, for , in each situation with history the optimal policy has to pick (the actions outside of this history have no impact on the utility): With 3/4 probability it will get utility of at least . And with probability at least . Whereas, for the choice of , with probability it will have utility of , and with probability it can get at most . We calculate

TLDR:I give an overview of (i) the key problems that the learning-theoretic AI alignment research agenda is trying to solve, (ii) the insights we have so far about these problems and (iii) the research directions I know for attacking these problems. I also describe "Physicalist Superimitation" (previously knows as "PreDCA"): a hypothesized alignment protocol based on infra-Bayesian physicalism that is designed to reliably learn and act on the user's values.I wish to thank Steven Byrnes, Abram Demski, Daniel Filan, Felix Harder, Alexander Gietelink Oldenziel, John S. Wentworth^{[1]}and my spouse Marcus Ogren for reviewing a draft of this article, finding errors and making helpful comments and suggestions. Any remaining flaws are entirely my fault.## Preamble

I already described the learning-theoretic agenda (henceforth LTA) in 2018. While the overall spirit stayed similar, naturally LTA has evolved since then, with new results and research directions, and slightly different priorities. This calls for an updated exposition.

There are several reasons why I decided that writing this article is especially important at this point:

I am fairly optimistic that LTA leads to a solution to alignment. On the other hand, it's much harder to say whether the solution will arrive in time. I think that the rate of progress relative to person-hours invested was pretty good so far, and we didn't hit any obstacle that cast doubt on the entire endeavor. At the same time, in

absoluteterms we still have most of the work in front of us. In the coming years, I am planning to put considerable effort into scaling up: getting more researchers on board, and hopefully accelerating progress by a significant factor.## Philosophy

The philosophy of LTA wascoveredin the original article, I mostly stand behind what I wrote there and don't want to repeat it. The goal of this section is merely to recap and add a couple of points.The goal of LTA is creating a general mathematical theory of intelligent agents. There are a number of reasons why this is necessary for AI alignment:

assumptionsof the model. This is a much better situation than dealing with a sequence of heuristic steps, each of which might be erroneous and/or hiding assumptions that are not stated clearly. Of course, similarly to how in cybersecurity having a provably secure protocol doesn't protect you from e.g. someone using their birthday for a password, here too we will need to meticulously verify that the assumptions hold in the actual implementation. This might require knowledge from domains outside of theoretical computer science, e.g. physics, cognitive science, evolutionary biology etc.After I wrote the original article about LTA, more was written about the feasibility and importance of mathematics for alignment, both by me and by others.

Why is LTA concerned specifically with agents? Aren't there AI systems which are not agents? The reason is: the sort of risks I want to address are risks that arise from AIs displaying agentic behavior (i.e. building sophisticated models of the world and using these models to construct plans that pursue unaligned goals, with catastrophic consequences), and the sort of solution I envision also relies on AIs displaying agentic behavior (i.e. building sophisticated models of the world and using these models to construct plans that pursue an aligned goal, s.t. the result is protecting humanity from unaligned AI).

Finally, I want to address a common point of confusion. Sometimes I tell someone about subproblems in LTA and they respond by trying to understand why each individual subproblem (e.g. nonrealizability, or cartesian privilege, or value ontology) is relevant to alignment. Often there

aresome specific ways in which the subproblem can be connected to alignment. But the more important point is thatanyfundamental question about intelligent agency is relevant to alignment, just because without answering this question we cannot understand intelligent agency. For any aspect of intelligent agency that we don't understand and any AI that we design, one of the two will hold: either the AI lacks this aspect, in which case it is probably insufficient to protect humanity, or the AI has this aspect but we don't understand why or how, in which case it is probably unaligned (because alignment is a small target, and significant unaccounted for phenomena are likely to make you miss).## Key Problems

The starting point of LTA is Marcus Hutter's AIXI: a simplistic model of ideal agency.

Intuitively, an "intelligent agent" is a system that builds sophisticated models of the world and uses them to construct and execute plans that lead to particular goals (see also Yudkwosky). Building models implies starting from a state of ignorance and updating on observations, which can be formalized by Bayesian probability theory. Building sophisticated models requires using Occam's razor, which can be formalized by the Solomonoff prior. Constructing and executing plans can be formalized as maximizing expected utility. Putting all these ingredients together gives us AIXI.

However, there are both important gaps both in our understanding of AIXI-like agents, and significant weaknesses in the AIXI framework itself. Solving these problems is the natural framing for the entire foundational

^{[2]}part of LTA.## Problem 1: Computational Resource Constraints

AIXI is uncomputable. Obviously, real-world agents are not only computable but operate under strong computational resource constraints. This doesn't mean AIXI is not a useful

toy modelfor studying the interaction of Occam's razor with Bayesian optimization or reinforcement learning. However, itisa major limitation.One obvious attempted solution is to impose some computational resource constraints on the programs that appear in Solomonoff's prior, for example as in Schmiduber's speed prior or in some other way. This typically leads to agents that are computable but still computationally intractable. Relatedly, a recent line of research connects the hardness of time-bounded Kolmogorov complexity with the existence of one-way functions.

On the other hand, we know some priors for which asymptotic Bayes-optimality is possible in polynomial time, for example priors supported on Markov decision processes (MDPs) with a small (i.e. polynomial in the security parameter) number of states. Some stronger feasible models are also known, e.g. MDPs with linear features or kernel-MDPs. However, all such priors have significant shortcomings:

## Problem 2: Frequentist Guarantees

AIXI is Bayes-optimal (by definition) but is not necessarily optimal in any sense for any particular environment. In contrast, in statistical learning theory we usually demand a

frequentistguarantee, i.e. that a learning algorithm converges to optimality (in some sense) for any^{[3]}data source (subject to some assumptions which depend on the setting). Such guarantees are important for several reasons:Learningis a key ability of intelligent agents, and played a fundamental role in AI progress in recent decades. A natural way to operationalize learning is: an agent learned a fact when its behavior is optimal conditional on this fact. In other words, learning which of a set of hypotheses is true means converging to optimal behavior for the true hypothesis, i.e. precisely a frequentist guarantee. This means that a theory of frequentist guarantees tell us bothwhichfacts agents can learn and howfastthey learn them (or how much data they require to learn them). These are important questions that a theory of intelligent agents has to answer. In particular, it is required to analyze the feasibility of alignment protocols: e.g. if we expect the AI to learn human values, we need to make sure it has sufficient information to learn them within a practical time frame.Moreover, in addition to single-agent frequentist guarantees it is desirable to derive

multi-agentfrequentist guarantees. Interactions between agents are quite important in the real-world, and have significance specifically for AI alignment as well (AI-human, AI-other AI, AI-acausal trade partner, to some extent human-human too). A multi-agent frequentist guarantee can take the form of convergence to a particular game-theoretic solution concept, or an asymptotic lower bound on expected utilities corresponding to some notion of game value.There are multiple difficulties in deriving frequentist guarantees for AIXI-like agents. The first two difficulties ("traps" and "password guessing games" below) are not problems with the agent, but rather with the way conventional frequentist guarantees are defined (which are nonetheless non-trivial to solve). The third difficulty (nonrealizability) might be a problem with the way the agent is defined.

## Subproblem 2.1: Traps

The most common type of frequentist guarantee in reinforcement learning (RL) is regret bounds

^{[4]}. However, AIXI doesn't satisfy any regret bound w.r.t. the underlying hypothesis class (i.e. the class of all computable environments) because these hypotheses involve irreversible transitions (traps). For example, suppose that in environment 1 taking action A sends you to a sink state with reward −1, whereas taking action B sends you to a sink state with reward +1, and in environment 2 the opposite is true. Obviously a hypothesis class which contains both environments doesn't admit a regret bound.This problem is especially acute in the multi-agent setting, because other agents have memory. Either it's impossible to erase memory in which case the environment is irreversible by design, or it's possible to erase memory which breaks conventional learning theory in other ways.

## Subproblem 2.2: Password guessing games

What if we arbitrarily rule out traps? Consider an agent with action set A and observation set O. Suppose that the reward is a function of the last observation only r:O→R. We can specify an environment

σ:(O×A)∗×O→Sα:(O×A)∗→Bω:S→O^{[5]}μ:(O×A)∗→ΔO by providing a communicating MDP with state set S and action set B augmented by representation mappingsHere, α(ha) is required to be an onto mapping from A to B for any h∈(O×A)∗×O. Also, we require that for any h∈(O×A)∗ and o∈O

ω(σ(ho))=oHowever, this only enables a fairly weak regret bound. Naively, it seems reasonable to expect a regret bound of the form ~O(KατβTδ) where K is the Kolmogorov complexity of the MDP+representation, τ is the diameter

^{[6]}and T is the time horizon^{[7]}, and α,β,δ are constants s.t. δ<1. Indeed, K can be regarded as the amount of information that needs to be learned and e.g. Russo and Van Roy showed that in a bandit setting regret scales as the square root of the entropy of the prior (which also expresses the amount of information to be learned). However, this is impossible, as can be seen from the following example.Fix n∈N and p∈{0,1}n+1 (the "password"), let the action set be B:={0,1} and the state space be S:={0,1}≤n⊔⊤. We think of a state s∈{0,1}≤n as "the agent entered the string of bits s into the terminal". The initial state is ϵ (the empty string). The transition kernel works as follows:

In state s∈{0,1}<n, taking action i produces the state si (i.e. the agent enters the digit i). In state s∈{0,1}n, taking action i produces the state ⊤ if si=p (the agent entered the correct password) or state ϵ (the agent entered an incorrect password and has to start from the beginning). In state ⊤ taking action 0 stays in state ⊤ and taking action 1 produces state ϵ (restarts the game).

All states have reward 0 except for the state ⊤ which has reward 1.

Obviously the agent needs time 2n to learn the correct password, in expectation w.r.t. the uniform distribution over passwords. At the same time, K≈n and τ=n+1. This is incompatible with having a regret bound of the desired form.

## Subproblem 2.3: Nonrealizability

As we discussed in Problem 1, a realistic agent cannot be Bayes-optimal for a prior over all computable environments. Typically, each hypothesis in the prior has to admit a computationally feasible approximately optimal policy in order for it to be feasible to approximate Bayes-optimality (in particular this has to be the case if the prior is learnable), which usually implies the hypotheses are computationally feasible to simulate

^{[8]}. However, in this situation we can no longer assume the true environment is within the hypothesis class. Even for AIXI itself, it is not really fair to assume the environment is computable since that would exclude environments that contain e.g. other AIXIs.A setting in which the true environment is not in the hypothesis class is known in learning theory as "nonrealizable". For offline (classification/regression) and online learning, there is a rich theory of nonrealizable learning

^{[9]}. However, for RL (which, among classical learning theory frameworks, is the most relevant for agents), the nonrealizable setting is much less understood (although there are some results, e.g. Zanette et al).Therefore, even if we arbitrarily assume away traps, and are willing to accept a factor of 2K in the regret bound, a satisfactory theory of frequentist guarantees would require good understanding of nonrealizable RL.

This problem is also especially acute in the multi-agent case, because it's rarely the case that each agent is a realizable environment from the perspective of the other agent (roughly speaking, if Alice would simulate Bob simulating Alice, Alice would enter an infinite loop). This is known as the "grain of truth" problem

^{[10]}. One attempt to solve the problem is by Leike, Taylor and Fallenstein, using agents that are equipped with "reflective oracles". However, there are many different reflective oracles, and the solution relies on all agents in the system to use the same one, i.e. that the design of the agents has been essentially centrally coordinated to make them compatible.## Problem 3: Value Ontology

AIXI's utility function depends on its direct observations. This is also assumed in RL theory. While this might be a legitimate type of agent, it is insufficiently general. We can easily imagine agents that place value on parameters they don't directly observe (e.g. the number of paperclips in the observable universe, or the number of people who suffer from malaria). Arguably, humans are such agents.

The obvious workaround is to translate the utility function from its original domain to observations by taking a conditional expectation. That is, suppose the utility function is U:X→R for some space X, let D be the space of everything that is directly observed (e.g. action-observation time sequences) and suppose we have some prior over X×D. Then, we can define U′:D→R by

U′(d):=E[U(x)|d]However, there are problems with this:

Therefore, it is desirable to have a theory in which it is possible to explicitly talk about utility functions with domains other than observations. (See also de Blanc.)

## Problem 4: Cartesian Privilege

The formalization of Occam's razor in AIXI relates to the description complexity of hypotheses represented in terms of the agent's actions A and observations O (as functions (O×A)∗→ΔO). However, this is not how Occam's razor is used in science. When we talk about a "simple" or "elegant" scientific theory, we refer to the equations governing the fundamental degrees of freedom on that theory (e.g. particles, or fields or strings),

notto the equations that would be needed to describe the RGB values of points on a person's retina. Indeed, expressing physics via the latter equations would be horrifically complicated.In other words, AIXI-like agents believe they occupy a privileged position in the universe, which has various pathological consequences. See "infra-Bayesian physicalism" for more discussion and Demski and Garrabrant for related prior work.

## Problem 5: Descriptive Agent Theory

The framing of AIXI is: given certain subjective choices (e.g. universal Turing machine and utility function), what is an ideal agent? However, in the real-world agents are not ideal. Moreover, we want to be able to understand which systems are agents, and what kind of agents they are, without already assuming e.g. a specific utility function. In particular, such a theory would have applications to:

Legg and Hutter defined an intelligence measure that for every policy produces a number determining its intelligence, s.t. AIXI has maximal intelligence. The measure is essentially the policy's expected utility w.r.t. the Solomonoff prior. This allows us to talk how close any given system is to an ideal agent. However, it suffers from two major limitations (in addition to the other problems of AIXI discussed before):

isa choice (and a very restrictive one).Moreover, naive attempts to ascribe a utility function to a policy run into difficulties. Specifically, any policy can arguably be ascribed a utility function which rewards following precisely this policy. With respect to such a utility function, the policy is Bayes-optimal for any prior. Armstrong and Mindermann have argued on the basis of this type of reasoning that ascribing a utility function is essentially impossible in a purely behaviorist framework (but I argue the opposite, see Direction 17.5 below).

## Nonproblem: Expected Utility

Finally, I want to briefly discuss one issue which I

don'tconsider a key problem. Namely, the reliance of AIXI on expected utility maximization.Expected utility is often justified by coherence theorems such as von Neumann–Morgenstern (VNM) and Savage. These theorems are indeed important: they show that a small number of natural assumptions produce a fairly narrow mathematical object (expected utility). This should indeed increase our credence in expected utility as a useful model. Often, objections are put forward that argue with individual assumptions. IMO these objections miss the point: a coherence theorem is a piece of evidence, not a completely watertight proof that expected utility is philosophically correct. And, if expected utility is philosophically correct, such a watertight proof is still not something we should expect to exist (what would it even look like?)

The more important justification of expected utility is the large body of work it supports: control theory, game theory, reinforcement learning theory etc. The methodology I believe in is, start with

plausibleassumptions and try to build a theory. If the theory that results is rich, has synergy with other bodies of knowledge, has explanatory power, is useful in practice, each of these is evidence that it's on the right track. (And if we failed to build such a theory, we probably also learned something.)Another objection that is often raised is "humans have no utility function". This is ostensibly supported by research in behavioral economics which shows that humans behave irrationally. I consider this deeply unconvincing. For one thing, I suspect that a large part of that research doesn't replicate. But even putting that aside, the interesting thing about irrational behavior is that we

recognizeit as irrational: i.e. learning about it makes you behave differently (unless it's so minor that it isn't worth the effort). This already indicates that this irrationality is better viewed not as a fundamental fact about human preferences, but as some combination of:All of this is not to say that expected utility cannot be questioned. Rather that a

productiveobjection would be to either come up with a demonstrably superior alternative, or at least a good argument why some extremelyconcreteproblem cannot be solved without an alternative to expected utility. Lacking that, the best strategy is to press forward unless and until we encounter such a problem.Now, we actually

dohave examples where deviating from purist VNM in specific ways is useful, e.g.:maximizingexpected utility we sample a random choice out of some top fraction of choices.However, these examples came about from studying various concrete problems, not from arguing with the VNM model per se. Moreover, all of these examples can still be mathematically recast in VNM form: infra-Bayesianism can be regarded as a VNM zero-sum game (against "Murphy"), quantilization and Nash bargaining can be regarded as special cases of infra-Bayesianism (as will be discussed in an

~~upcoming~~article by Appel). Therefore, even here the theory built around the VNM model remains useful.## Research Directions

This section focuses entirely on the foundational part of the programme. I will mention some of the applied part in the next section (although different applications are also possible, e.g. studying quantilization or IDA), but for the most part I believe the foundational part to be the top priority.

In the following, I roughly grouped the research directions by the problems they are trying to address. However, the real relationship between directions and problems is many-to-many: a single direction can have implications on many problems. Moreover, there are many connections between the different directions. And, even for directions that are not especially connected at present, if they are successful, we will be faced with the task of merging them into a unified theory.

## Subproblem 1.1: Frugal Universal (Infra-)Prior

One part of solving Problem 1 (computational resource constraints) is finding a prior (more precisely a family of priors) with the following properties:

More generally, we might want an infra-prior (see Direction 3 below), or a metacognitive (infra-)prior (see Direction 6 below), or a physicalist infra-prior (see Direction 18 below) with analogous properties.

## Direction 1: Frugal Compositional Languages

The time complexity of finding the optimal policy for a generic (tabular) MDP scales with its number of states. The same is true of the sample complexity of learning an unknown generic MDP. However, the number of states in a sophisticated model of the real world has to be enormous. For example, if I reason about the world as it's comprised of n objects with m possible states each, the over number of states is already mn.

Therefore, any realistic RL algorithm has to exploit some

structurein its environment. For example, the environment might be comprised of spatially separated parts, or different processes happening on different spatial scales, or different processing happening on different temporal scales. We want to find an appropriatecompositional languagefor describing such structure.The compositional language used by AIXI is that of programs for some fixed UTM. This language is very powerful, but also unfeasible to work with, because programs can consume arbitrarily large amounts of computational resources or even fail to halt altogether. Imposing ad hoc computational resource bounds also doesn't work well, perhaps because such bounds are extraneous to the compositional properties of the language.

Arguably we need some new "frugal" compositional language. Here are the candidate starting points I currently know. It is possible that the right answer will involve combining several of them.

Direction 1.1: Compositional Polytope-MDPsCompositional polytope-MDPs is a language with two operations: union of asynchronous noninteracting parts and time hierarchy. Importantly, the optimal policy is computable in time polynomial in description length. The time hierarchy operation also offers a natural formalization for the notion of "instrumental values".

Further research might involve:

Direction 1.2: String Machines (for RL)String machines is a category-theoretic framework that allows constructing automata and transducers from string diagrams

^{[11]}(that can be probabilistic). In particular, it can be used to describe MDPs. In fact, there is a version of it that allows constructing arbitrary Turing machines, but there are natural ways to "throttle" to expressiveness of the language.Further research might involve:

EDIT:See also initial foray using filtered string machines.]Direction 1.3: Infra-Bayesian LogicInfra-Bayesian logic is a sort of higher-order logic where the semantics has infradistributions instead of sets corresponding to predicates, and can be viewed as a synthesis of logic and probability theory. In particular, it is fairly natural for describing infra-(PO)MDPs because the main part of the latter is the transition infrakernel.

However, the semantics are not fully defined because of some technical difficulties. There is also a first-order version which

isfully defined but seems less natural for describing RL hypotheses. Finally, there are hints that propositional infra-Bayesian logic might be more computationally tractable than classical propositional logic, but it's as yet unclear.## Direction 2: Deep Learning Theory

It seems plausible that deep learning algorithms (e.g. transformers) are already learning algorithms for the frugal universal prior. Therefore, building a rigorous theory of deep learning (in particular, its inductive biases and generalization properties) is another potential path to understanding this prior and its properties.

While this is a relevant research direction, I don't prioritize it for two reasons:

Nevertheless, it is a type of research that has implications on LTA and alignment researchers can reasonably consider to participate in it, if they feel especially well positioned for it and have the discipline and judgement required to self-censor if necessary.

## Subproblem 2.3: Nonrealizability

The research directions below are different ways to get frequentist guarantees in the nonrealizable setting, but they are not necessarily mutually incompatible. In addition to exploring each approach separately, it can be interesting to study various combinations.

## Direction 3: Infra-Bayesian RL

Infra-Bayesian RL (IBRL) is a way to combine reinforcement learning theory with imprecise probabilities which produces a specific kind of frequentist guarantee in the nonrealizable setting (infra-Bayesian regret). A rough summary of the results we got in this framework so far:

~~upcoming~~paper: "Imprecise Multi-Armed Bandits" (IMAB).In particular, the result about Newcombian environments and also infra-Bayesian physicalism (see Direction 18 below) are two products of infra-Bayesianism that I don't see, at present, how to reproduce with any of the alternative approaches to nonrealizability.

Directions for further research include:

## Direction 4: Exploiting properties of simplicity priors

In general, Bayesian inference has only fairly weak guarantees in the nonrealizable setting. However, simplicity priors might produce stronger guarantees. In particular, Lattimore, Hutter and Gavane (LHG) showed

^{[12]}that (a specific version of) Solomonoff induction converges to predicting a deterministic computable subsequence even when the full sequence is uncomputable.Directions for further research include:

## Direction 5: Agnostic RL

The conventional approach to nonrealizability in offline and online learning is to replace convergence to the true hypothesis with convergence to the hypothesis closest to the truth. Similarly, in RL, given some natural metric or divergence d on the space of environments, we can try to prove regret bounds of the form

Reg(T)=o(T)+infμ∈Hd(μ,μ∗)⋅THere, T is the time horizon, H is the hypothesis class and μ∗ is the true environment.

That is, we allow a sublinear term (as in the realizable case) and a linear term which scales with the distance of μ∗ from H. Indeed, a result of this form was proven in Zanette et al, although in a model-free setting (i.e. instead of the environments themselves they consider their corresponding state-action value functions, although their divergence also depends directly on the environment).

At least superficially, a potential drawback of this approach is that Bayes-optimality doesn't imply this "agnostic" regret bound (AFAICT), as opposed to ordinary regret bound of the realizable setting. There is also no immediately obvious agnostic analogue of Bayesian regret. However, we can rectify this by recasting the setting in IBRL language. Indeed, for any μ∈H it is possible to construct a (non-crisp) infra-Bayesianian law

EΛμπ[f]=minμ′(Eμ′π[f]+d(μ,μ′))^{[13]}Λμ s.t., for any policy π and f:(O×A)∗→[0,1]:Notice that when d is total variation distance, we can take Λμ=μ.

Then, learning the hypothesis class {Λμ∣μ∈H} is equivalent to agnostically learning H. Moreover, IBRL does have Bayesian regret, and a notion of Bayes-optimality that implies infra-Bayesian regret bounds, which thereby carries over to agnostic RL.

A few examples of directions for further research:

^{[14]}.## Subproblem 1.2/2.1: Traps

Allowing traps in the environment creates two different problems:

It makes some sense to consider these problems together, but different direction emphasize different sides.

## Direction 6: Metacognitive Agents

This research direction is motivated not only by traps, but also by producing especially strong frequentist guarantees via metacognition (a form of self-improvement). More speculatively, this model might be also applicable to humans (and related specifically to conscious reasoning), which makes it relevant to superimitation (see relevant section below). It was originally called Turing RL, but that name seems incorrect now because "RL" assumes learnability (no traps).

In a metacognitive agent, in addition to the usual "external" interfaces (actions and observations), there is also an "internal" interface that allows the agent to execute arbitrary programs and observe their outputs. This allows us to use a learning algorithm to form subjective beliefs about computations. The beliefs can be entangled with the agent's beliefs about the external world.

For example, let Λ be the set of all lambda terms. Then our agent might have a belief Θ over Γ:=2Λ×Λ that refers to the relation of β-equivalence on Λ. In a Bayesian framework Θ is a distribution, and the agent might also have a belief about the external world that takes the form of an RL environment that depends on a point in the support of Θ. In an infra-Bayesian framework, Θ is an infradistribution and there is an appropriate way to define a notion of "law" s.t. Θ is its marginal

^{[15]}.However, it might be more interesting to put all beliefs about the world "inside quotation". Indeed, let ζ be a prior about the external world that a program can sample from: for example the unnormalized Solomonoff prior. Given a β-equivalence oracle, we can sample this prior using some oracle machine M (since such an oracle allows to instantly know whether a given program produces given output). Since any point in Γ has the right type signature to serve as such an oracle, we can construct a kernel Z from Γ into the external world s.t. Z(y) is defined by plugging the oracle y into M. We can then combine any Θ with Z (i.e. form the semidirect product Θ⋉Z) to obtain the overall belief.

Notice that this setting is inherently nonrealizable, since the ground truth about Γ is uncomputable. On the other hand, the nonrealizability of the external world can in principle be circumvented, because we have computationally simple hypotheses of the form "the external world behaves according to program P". Moreover, even "the external world is sampled from the unnormalized Solomonoff prior" is a computationally simple hypothesis. Hence, as long as the world is computable, we can describe it exactly modulo uncertainty about computations.

While the external world can contain irreversible transitions, it seems reasonable to employ an approximation in which executing most

^{[16]}programs doesn't have irreversible harmful side-effects. As a consequence, if the agent survives long enough, it can learn most facts about computations. In the "quoted prior" setting, it will then use what it learned about computations to approximate Bayes-optimal behavior in the external world, which addresses subproblem 1.2. Moreover, since some hypotheses assert quoted priors aboutcomputations, it can arguably use what it learned to improve its own learning algorithm.This framework is closely related to MIRI's notion of "logical uncertainty". Hence, it makes sense to compare it Garrabrant's logical induction (LI). While LI might turn out to have a useful role in this framework, in itself, LI doesn't address some key relevant questions:

There are several potential starting points for research into metacognitive agents:

Direction 6.1: Bayesian planning via query-based learning of computationsArguably tree automata and more generally string machines are a natural class of hypotheses about computations. Therefore, we can try building on existing research in automata inference

^{[17]}. This field is infamous for its many impossibility results, but there is one model of learning that has been quite successful. Namely, learning automata assuming two types of allowed queries:That is, algorithms exist that guarantee convergence to a correct automaton with a number of queries that is polynomial in minimal automaton size.

In our setting, the first type of query can be implemented by running a specific program. The second type of query cannot be implemented directly, but is effectively available in various natural setups. For example:

Therefore, it is a natural question whether we can implement Bayesian planning by some protocol of this type, assuming (for starters) that it is possible to efficiently sample from the prior. For example, assume that the hypothesis about computations translates to a hypothetical infra-Bayesian coarsening (or Bayesian approximation?) of the prior for which optimal planning is feasible. Assume also that there if resulting policy performs less well than promised on the true prior, there is some way to produce a counterexample to the hypothesis about computations. Then, counterexample queries are effectively available.

Automata learning might also be feasible in a setting where the automaton generates a sequence that we need to predict. (At least for a deterministic automaton this is trivial because the sequence is periodic.) Speculatively, the amount of available computational resources can be treated as defining a "sequence" (possibly related to "logical time") and this can be exploited to get guarantees even when the prior cannot be sampled efficiently.

Direction 6.2: Expressiveness of string machinesString machines are a potential starting point for finding a frugal universal prior for metacognitive agents. In order to strike the right balance between "frugality" and "universality" we need to understand the expressiveness of the string language and the computational complexity of evaluating string machines (or answering more complicated questions about string machines) under various assumptions. In particular, parameterized complexity theory might be relevant here, since there seem to be more relevant parameters than just description size (e.g. the number of meta-vertex levels we allow).

Direction 6.3: Query learning of string machinesWe can also study the sample complexity of learning string machines in the two-type query model above. While existing results guarantee learning with a number of queries polynomial in the

number of statesof an automaton, we would like query complexity to scale well with thedescription sizeof an automaton in string language. More general string machines are not even automata.## Direction 7: Approximation algorithms for Bayesian planning

Subproblem 1.2 comes from an NP-hardness result, but that still leaves room for studying approximation algorithms. There are many NP-hard optimization problems for which we have conjectured optimal approximation ratios (e.g. based on the PCP theorem or the unique games conjecture). It is conceivable that we can prove a result of this form for Bayesian planning. Some approaches:

- (Direction 7.1) Fix a prior ζ over set of MDPs with state set S and action set A, and an arbitrary stationary policy π0:S→A. Let π∗ be the (unknown) Bayes-optimal policy. For any policy π we can consider the approximation ratio

Eζπ[U]−Eζπ0[U]Eζπ∗[U]−Eζπ0[U]## Direction 8: Expanding Safety Envelope

The Expanding Safety Envelope (XSE) is an approach to traps which can be summarized as follows:

One significant limitation of this approach is, when using a rich prior, discovery of new safe actions can at best be approximate. While we can study variants where we sometimes try risky actions, guaranteeing any sort of optimality there is likely to be intractable. However, maybe it is feasible to require that, whenever the posterior admits an approximation (e.g. in the sense of total variation distance) by a prior with iteratively discoverable safe actions, we get a corresponding lower bound on the performance of our algorithm starting from this point.

## Better regret bounds for simple priors

This section is a grab bag of research directions about improving regret bound theory for simple hypothesis classes based on MDPs. They are not directly related to AIXI, but rather fit perfectly within classical RL theory

^{[18]}. However, it seems plausible that they are necessary pieces of the bigger puzzle (i.e. for a full solution of Problem 2).## Direction 9: Unifilar MDPs

[

EDIT:What I called "unifilar MDP" is already known in the literature under the name "regular decision process" (RDP): see Ronca and De Giacomo. Notably, Subproblem 2.2 (password games) already emerges in learning RDPs with small number of states. Ronca and De Giacomo address it by assuming that the uniformly random policy has a high probability to reach every RDP state: a condition which is clearly insufficiently general, but does serve as an interesting starting point.]Most of RL theory literature focuses on learning an MDP with a known state space, implicitly assume a known representation. Ortner et al prove a regret bound in a setting with unknown representation, but it scales with the

cardinalityof the set of possible representations. For real-world agents, I expect this cardinality to be something like doubly-exponentially large, so their bound is extremely weak.As a first step towards a better theory of representation learning, I propose the following setting. Fix action set A and observation set O. Then, a

unifilar MDP(UMDP: combination of MDP and unifilar hidden Markov model) M consists of the following data: the state set S, the initial observation distribution β0∈ΔO, the state initialization function ι:O→S, the response kernel ρ:S×A→ΔO and the transition function T:S×A×O→S. The interaction of a policy π with a UMDP works as follows:Notice that the state at any moment of time can be recovered from the observation-action history, i.e. the above data defines some σM:(O×A)∗×O→S. Hence this is indeed an MDP rather than a POMDP.

Fix a reward function r:(O×A)∗→[0,1] that can itself be represented by a deterministic finite state machine. We then constrain M by requiring that there exists some rM:S×A→[0,1] s.t. for any h∈(O×A)∗×O and a∈A we have

r(ha)=rM(σM(h),a)The problem is then proving the existence of an algorithm for interacting with an unknown communicating UMDP M compatible with r which satisfies a regret bound that scales with M's number of states, number of actions and diameter.

## Direction 10: Foreground MDPs

Regret bounds in (lifelong/non-episodic) RL scale with MDP diameter because of the need to rule out traps. On the other hand, sequence prediction / online learning requires no such constraint and even allows environments with infinitely many states. However, sequence prediction / online learning can also be regarded as a degenerate special case of RL. This suggests that there should be a natural theory that combines both.

Motivated by the above, I propose to study MDPs of the following form. Assume that the state set of the MDP factors as S=Sback×Sfore (background and foreground). Moreover, there is some Tback:Sback→ΔSback s.t. the Sback-marginal distribution of T(s,t,a) is Tback(s) for any s∈Sback, t∈Sfore and a∈A, where T is the transition kernel of the MDP. That is, the background dynamics depends neither on the foreground state nor on the agent's actions. On the other hand, the foreground dynamics can depend both on the background state and the agent's actions.

We can then define a notion of diameter for the foreground only, and derive regret bounds that scale with it. Such a regret bound can scale with the total number of states, or with some dimension parameter (see Direction 11 below). In the latter case, Sback can be infinite.

## Direction 11: Canonical RL Dimension

[

EDIT:Foster et al achieved significant progress on this, which I didn't know at the time of writing.]This direction already receives some attention in mainstream academia, so it might be of relatively lower priority.

In multiple areas of learning theory, we have a fairly complete characterization of how sample complexity depends on the hypothesis class. In many cases, the key parameter of the hypothesis class is some type of "dimension". For example:

In RL, our understanding is less good. We do have some results of similar nature, most of them derived from Russo and Van Roy's "eluder" dimension. For example Jin, Liu and Miryoosefi demonstrate a model-free regret bound that scales with "Bellman-Eluder" dimension and the logarithm of a covering number (the later can be said to correspond to Minkowski-Bouligand dimension). However, we don't have matching

lower boundsfor these results. This means that we don't know whether the results we have cannot be subsumed by some better theory (indeed, Du et al show results that are neither strictly weaker nor strictly stronger than Jin, Liu and Miryoosefi). In contrast, the flavors of learning theory mentioned above come with proofs that their notions of dimension are optimal (albeit, only under the assumption of a worst-case distribution).## Direction 12: Epistemic Regret Bounds

This direction attempts to address Subproblem 2.2.

The motivating intuition is: formulating the correct frequentist guarantee requires correctly distinguishing between

epistemicandaleatoricuncertainty. The problem with password games is that the password is completely random information that is impossible to infer by any method other than brute force. Hence, we should treat this information as part of "aleatoric" uncertainty.Let's recall the definition of Bayesian regret. Let H be a set representing hypotheses and ζ∈ΔH the prior. Since each hypothesis defines an environment, we have a mapping

μ:H×(O×A)∗→ΔOThe Bayesian regret of a policy π is given by

BReg(π,T):=Eθ∼ζ[maxπ′Eμπ′[T−1∑i=0ri∣∣ ∣∣θ]−Eμπ[T−1∑i=0ri∣∣ ∣∣θ]]Here T is the time horizon, and ri is the reward at time i.

For any environment ν:(O×A)∗→ΔO and history h=o0a0…on−1an−1on, denote

νabs(h):=n∏i=0ν(oi|o0a0…oi−1ai−1)That is, νabs(h) is the probability that ν produces h assuming a deterministic policy compatible with the actions in h.

Notice that minimizing Bayesian regret is equivalent to maximizing expected utility. However, expected utility only depends on μ and ζ via μBayes:(O×A)∗→ΔO defined by

μBayes:=Eθ∼ζ[μ(θ)]Here, the expectation is in the sense of the Bayes' rule for environments, not pointwise. That is, first θ is sampled out of ζ and then the agent faces the environment μ(θ). Explicitly:

μBayes(ha)=Eθ∼ζ[μabs(h∣θ)μ(ha∣θ)]Eθ∼ζ[μabs(h∣θ)]On the other hand, Bayesian regret depends directly on μ and ζ. Indeed, the minimal Bayesian regret is typically non-zero even for learnable H. On the other hand, if we took H′ to be a single element set {θ0} and defined μ′:H′×(O×A)∗→ΔO by μ′(θ0)=μBayes, we would get μ′Bayes=μBayes but the corresponding Bayesian regret of the Bayes-optimal policy would vanish.

This means that the definition of Bayesian regret treats the uncertainty coming from ζ and the uncertainty coming from the stochasticity of μ differently. We can call the former "epistemic" and the latter "aleatoric". Intuitively, the T-dependence of Bayesian regret measures the rate at which the agent reduces its goal-relevant epistemic uncertainty, and it is sublinear iff the agent asymptotically reduces it to 0 i.e. learns all the "epistemic information" it needs. On the other hand, we don't expect the agent to learn all "aleatoric information" but only to maximize expected utility w.r.t. to aleatoric uncertainty.

In the case of the Solomonoff prior, the naive approach regards the uncertainty about the program for the UTM as epistemic uncertainty and the uncertainty corresponding to the probabilities computed by that program as aleatoric uncertainty. However, in algorithmic statistics

^{[19]}there are more nuanced tools to separate "structured" and "random" information, like the notion of "sophistication".Given any string x∈{0,1}∗, we can consider ways to represent it as running program p (which we assume to halt on

|p|+|y|≤K(x)+callinputs) on input y. Let's constrain this representation to be nearly-optimal compression, i.e. require that for some some "small" number c∈NThe minimal length of p over all such representations is called the c-

sophisticationof x and can be interpreted as the amount of "structured" information in x. In our case, the minimal p of an MDP might be considered as the "epistemic" information it contains, which leads to a novel decomposition of the Solomonoff prior into some ζ and μ. Care might be required for choosing c. For example, maybe we can set c=logK(x).In purely frequentist terms, we can hope for a regret bound of e.g. the form

~O(Pα2η(K+c(K)−P)τβTδ)Here P is the c(K)-sophistication of the MDP+representation (s.t. K+c(K)−P is the amount of "random" information) and η is another constant.

More generally, instead of a single MDP we can consider a computable distribution ξ over a D-dimensional family of MDPs of diameter at most τ, using some notion of dimension from RL theory (see Direction 11 above). Then, we can expect that the average (ξ-Bayesian) regret of the family can be bounded by

~O(Pα2η(K+c(K)−P)τβDκTδ)Here κ is yet another constant, whereas P and K refer to the sophistication and Kolmogorov complexity of ξ.

Given an appropriate notion of "computable distribution over MDPs", this would encompass learning of MDPs with arbitrary (uncomputable) transition probabilities.

## Subproblem 2.4: Multi-Agent Guarantees

One special case is completely straightforward in IBRL, namely two-player zero-sum games: IBRL essentially

isa zero-sum game, so it's enough that each player views the other as "Murphy". So, for example two IBRL agents equipped with sources of randomness (s.t. effectively their actions are lotteries over game moves) playing a stochastic game necessarily converge to a Nash equilibrium, assuming that the corresponding infra-MDPs are in their respective hypothesis classes.Beyond that, there are roughly two approaches to multi-agent guarantees:

superrationalbehavior from learning theory.AFAIK either of the following possibilities is conceivable:

Different directions below have different relationships with these worlds.

[

EDIT:See also Infra-Bayesian Haggling, a new direction for Worlds 2.4.2/3]## Direction 13: Population Games

This direction is motivated by Worlds 2.4.1/2.

A population game is a setting where large populations of players play against each other by being sorted into random groups on each turn, s.t. the probability to meet the same opponent twice is negligible. This setting is designed to reduce incentives for superrationality, although it's still unclear whether this is sufficient to justify classical game theory.

An appealing feature of this setting is that it can be naturally interpreted as a dynamical system. For learning agents satisfying very mild assumptions, every fixed point corresponds to an ϵ-Nash equilibrium (where ϵ→0 as the geometric time discount constant γ→1). Moreover, a fixed point exists for any tuple of player policies.

It is therefore natural to ask whether any of the fixed points have to be attractors (under some assumptions about the learning algorithms), study their attraction basins, study other fixed submanifolds etc.

## Direction 14: Logit equilibria of finite-state repeated games

This direction is motivated by World 2.4.1. (See this for early thinking on the topic.)

A logit equilibrium is defined similarly to a Nash equilibrium with "optimal response" replaced by "soft-optimal response" (i.e. the players behave according to the softmax distribution associated with the expected payoffs of different plays for fixed opponent distributions). Consider the iterated prisoner's dilemma where each player's pure strategies are constrained to be of the form:

In particular, tit-for-tat corresponds to p0=C and f(q)=q.

For geometric time discount γ close to 1, both always-defect/always-defect and tit-for-tat/tit-for-tat are Nash equilibria. Moreover, tit-for-tat/tit-for-tat is also an approximate logit equilibrium for softmax parameters

λ=ω(1−γ)^{[20]}However, always-defect/always-defect is

λ=o(1)notan approximate logit equilibrium as long asThis happens for the following reason. Tit-for-tat is almost as good a response to always-defect as always-defect: it only loses O(1) utility because of round 0. Therefore, the soft-optimal response to always-defect is a mixture of at least always-defect

andtit-for-tat. Moreover, the soft-optimal response to a mixture of always-defect and tit-for-tat places little probability on always-defect, because tit-for-tat does much better.As a result, in this asymptotic regime logit equilibrium entails a probability approaching 1 of cooperating forever. More generally, I propose the following conjecture: in any repeated (or even stochastic?) game in which we constrain the strategies to be finite state machines (with/without a uniform bound on the number of states?), while allowing a sufficient number of states, any logit equilibrium converges to Pareto efficiency in the asymptotic regime above.

This seems like a potential method of obtaining superrational behavior as a corollary of classical game theory in some situations.

## Direction 15: Infra-Bayesian Veil of Ignorance

This direction is motivated by Worlds 2.4.2/3.

As a warm-up, consider a repeated symmetric game between two agents who are copies of each other, either without randomization or with both using the same random generator. In this setting, each agent learns that the other parrots its actions precisely. Therefore, even a simple learning algorithm will converge to the symmetric Pareto efficient outcome.

Direction 15.1: IBRL with exact clonesWhat if the game is not symmetric, or there is randomization which breaks the symmetry? If we assume all the agents are exact clones (i.e. they execute the exact same code / follow the exact same policy), each agent can regard this as a Newcombian environment (the clones are implemented by "Omega"). And, we know that Newcombian environments can be represented as infra-Bayesian laws. We will call it the

clone lawin this case.Notice that, even though exact clones must have the same utility function, this doesn't rule out asymmetric games. Indeed, the utility functions are the same

in terms of the particular agent's actions and observations, but this is consistent with each agent identifying their own role ("which clone/player am I") and receiving payoffs accordingly. As a simple example, we can imagine the payoff being a part of the observation channel.Notice also that the clone law is

nota convex combination of laws corresponding to the different roles the agent can play. Rather, the clone law requires Murphy to predict the behavior of the agent in each possible rolein advanceof the random role selection. This way pseudocausality is ensured (i.e. Murphy can't "cheat" by only predicting the behavior of one role correctly.)Moreover, two clone laws which are identical except for a different probability distribution over roles typically cannot coexist in the same learnable hypothesis class. This is because there is no way to infer the distribution over roles from observations. Instead, we should think of this distribution as determined by the prior. Indeed, mixing two clone laws is equivalent to mixing the role distributions in the γ→1 limit, as long as the role distributions have full support. Hence, we can imagine all clone laws for the same game inside the prior effectively aggregating into a single clone law with the average role distribution.

The optimal policy for the clone law is maximizing the linear combination of utilities of different roles according to the prior distribution over roles. Hence, the prior determines which point on the Pareto frontier is selected, and can be regarded as defining a "notion of fairness".

One issue with clone laws is that they cannot be represented as e.g. infra-POMDPs with a finite number of states (the state has to encode Murphy's prediction i.e. the entire relevant part of the policy). This leads to a concrete research problem:

find natural hypotheses classes containing clone laws that admit efficient learning algorithms.Direction 15.2: IBRL with inexact clonesWhat if we want to allow agents with different utility functions? One trick to achieve this is postulate that the agent receives a description of its utility function upon waking up (e.g. in the form of a program, if it's a metacognitive agent). This way, the policy has to account for all admissible utility functions. It is possible that we can apply similar methods without this trick, using instead the observation that most utility functions can appear as

instrumentalgoalsin some situations (e.g. using the formalization of compositional polytope-MDPs).What if the agents have different priors (in particular over utility functions but also in general)? Now they are not exact clones anymore. However, it is conceivable that their policies are still

similarin some sense that can be used to represent the setting as Newcombian. For example, we can equip the space of utility functions with some metric. We can then define a metric onpoliciesvia the Hausdorff distance between theirgraphs.As a toy model, we can consider a fixed game in normal form, with the agent having some prior over:

We can then try to prove that, under some assumptions, priors that are "close" (according to some divergence) lead to policies that are close. Moreover, when agents with close priors are facing each other (and know their mutual distances are small), they should play close to Pareto efficiency. It would be fascinating if we could recover something like Yudkowky's proposal.

In particular, consider a game with a

uniquePareto efficient outcome. Assume that all admissible payoff matrices are monotonically increasing transformations of each other. Then, the agents will always play Pareto efficiently, because doing so unconditionally guarantees that all opponents will do the same thing, for any finite distance from opponents. (In particular, the actual distance is 0).## Direction 16: Hidden Rewards

This direction attempts to address Problem 3 (value ontology).

In order to define a utility function, we need some ontology Ω to specify its domain. One natural way to represent an ontology is using an infra-POMDP. That is, we have the state set S0, the initial infradistribution σ0∈□S0, the observation mapping ω0:S0→O and the transition infrakernel T0:S0×A→□S0. Here, □S0 stands for credal sets over S0, i.e. non-empty closed convex subsets of ΔS0 (we can also allow more general infradistributions). The reason we're using an infra-POMDP instead of a POMDP is, we don't want Ω to fully specify the environment.

To specify an environment over Ω (in ontology Ω), we consider a

maximal refinementof the infra-POMDP (we call this ahidden environment^{[21]}). That is, a state set S, an initial distribution σ∈ΔS, an observation mapping ω:S→O, a transition kernel T:S×A→ΔS and a translation mapping λ:S→S0 s.t. the following conditions hold:If we want to do IBRL rather than Bayesian RL, we can consider general (non-maximal) refinements, i.e. refinements that are infra-POMDPs rather than POMDPs.

To specify a reward function over Ω ("hidden reward function"), we consider a function r:S0×A→R. More generally, we can allow bounded functions r:(S0×A)∗→R. Given a reward function, the corresponding utility function is defined as usual i.e. as a time-discount sum of rewards.

An interesting special case of a hidden environment is a UMDP equipped with a translation mapping. We will call it a

relativeUMPD (RUMDP). Assuming a hidden reward function of the form r:S0×A→R, it admits efficiently computing the optimal policy.Given hidden environments/rewards instead of observable (i.e. ordinary) environments/rewards, it is straightforward to define analogues to concepts from RL theory such as prior, Bayes-optimality, regret, learnability etc.

## Direction 16.1: Agency with partial monitoring

Learnability is more difficult to establish with hidden reward functions than with observable reward functions, since ruling out traps is insufficient (see Example 1 here). For stateless environments, we already have the closely related theory of partial monitoring, which provides a comprehensive classification of possible regret bounds

^{[22]}. Therefore, we can try building an extension of this theory to our RL-like setting.## Direction 16.2: Specifying semi-instrumental reward functions

A related setting in which learnability is established is instrumental reward functions (and very likely we can extend that to semi-instrumental reward functions as well). Its drawback is that (as opposed to hidden rewards) it doesn't come with a simple natural way to specify reward function. A formal relationship between the two frameworks could allow us enjoying both worlds.

Given any hidden reward function r, we can ask whether there exists a semi-instrumental reward function R s.t. for any hidden environment μ and history h∈(S0×A)∗ compatible with μ

R(h,I(μ∣h))=r(h)Here, I(μ∣h) stands for the instrumental state corresponding to μ∣h, and we abuse notation on the left hand side by implicitly projecting h to (O×A)∗ using ω0.

The problem, then, is to characterize hidden reward functions with this property. Also, we can try to characterize Ω for which all hidden reward functions have this property.

## Direction 16.3: Undogmatic Ontologies

Every hidden environment defines an observable environment. However, for an arbitrary Ω, not every observable environment can be obtained this way. In other words, Ω constraints what the agent can expect to see. Arguably, this is undesirable: why would the real world agree with an ontology which is just an arbitrary/subjective feature of the agent's axiology?

Hence, it is natural to consider

undogmaticontologies: Ω that admit all observable environments. However, it seems tricky to find pairs of Ω and a hidden reward function r that simultaneously satisfy all of the following conditions:As an existence proof, we can consider the ontology whose states are instrumental states (S0=ISω) and initialization is completely uncertain (σ0=⊤S0), and r corresponding to any instrumental reward function. It would be interesting to find a good general characterization.

~~In particular, I don't even know whether~~S0~~can be finite.~~[EDIT:Here is a way to construct many examples, including finite ones.]## Direction 16.3: Bayes-optimal planning for communicating RUMDPs

We know that approximating Bayes-optimal planning is intractable for MDPs with traps. However, maybe for communicating RUMDPs it is tractable even when they form an unlearnable class. This seems like an interesting question to investigate.

## Direction 17: Algorithmic Descriptive Agency Measure (ADAM)

This direction attempts to address Problem 5 (descriptive agent theory).

ADAM is a formal way of quantifying the extent to which some arbitrary policy π:(O×A)∗×O→A is "agentic".

Fix some UTM M0.

We will consider computable utility functions of the form U:(O×A)ω→[0,1]. Notice that such a function is automatically continuous. Since it is computable, we can talk about its Kolmogorov complexity K(U) (w.r.t. M0).

Also, for any UTM M we denote ζM the corresponding Solomonoff (semi-)environment (prior). We can talk about K(M): the Kolmogorov complexity of M w.r.t. M0. Also, given U as above we can consider the

jointKolmogorov complexity K(U,M).We can now tentatively define ADAM by

g(π):=supU,M(minπ′:EζMπ′[U]≥EζMπ[U]K(π′)−K(U,M))Here the maximum is over all computable utility functions U and UTMs M.

Intuitively, we are looking for a way to interpret π as maximizing U w.r.t. prior ζM s.t., on the one hand, π does a good job at maximizing (enforced by the first term), and on the other hand, this interpretation is not contrived (enforced by the second term).

Notice that g inherits all the other problems of AIXI. Hence, we will ultimately need to modify it according to the respective solutions (e.g. use frugal universal priors and their associated complexity measure instead of Solomonoff priors and Kolmogorov complexity).

g has a few easy to check nice properties:

However, at present I don't know much else about its properties. This leaves a lot of questions to explore:

## Direction 17.1: Comparing ADAM variants

Here are some superficially similar definitions:

g′(π):=supU,M(log1Prπ′∼ξ[EζMπ′[U]≥EζMπ[U]]−K(U,M))g′′(π):=logEU,M∼ψ[1Prπ′∼ξ[EζMπ′[U]≥EζMπ[U]]]Here, ξ is a Solomonoff prior over policies and ψ is a Solomonoff prior over utility functions and UTMs

^{[23]}.At present, I slightly prefer g over g′ and g′′ because for the latter two, I don't know property 3 above (I know that e.g. g′(πn)≥n but not g′(πn)≈K(πn)).

It is easy to see that

g(π)≳g′(π)≲g′′(π)Open problem:Are g′,g′′ the same as g up to O(1)?## Direction 17.2: ADAM of random policies

Deterministically generating an object of high Kolmogorov complexity is hard: for any program p that outputs a bitstring x we have K(x)≤|p|, so generating x with very high K(x) requires a very long program. Since g(π)≲K(π), this means that deterministically generating an agentic policy is also hard. But,

randomlygenerating an object of high Kolomogorov complexity is easy: most bitstrings x of length k have K(x)≈k. On the other hand, I expect most policies to be unagentic.

Er∼μ0[g(ϖ(r))]≲K(ϖ)Conjecture:For any computable ϖ:{0,1}ω×(O×A)∗×O→A, we haveHere, ϖ is computable in the sense that it's a Turing machine that receives the first argument on a separate input tape, and μ0∈Δ{0,1}ω is the IID fair coin.

## Direction 17.3: ADAM Hierarchy Theorem

It would be interesting to understand how dense the possible values of g are. That is, find a function f:N→N, as slow growing as we can, s.t. for every n∈N there is a policy π for which

n≤g(π)≤f(n)## Direction 17.4: ADAM for finite-state policies

Consider policies that can be implemented by a deterministic finite state machine. That is, such a machine has a finite state set S, an initial state s0∈S and a transition function T:S×O→S×A. Suppose that |S|=n. Clearly, the policy π implemented by the machine has

g(π)≲K(π)=O(nlogn)Open problem:Is this bound on g(π) in terms of n close to tight? Are there infinitely many n∈N s.t. for some n-state policy π it holds that g(π)=Ω(n)?## Direction 17.5: Inferring the utility function

Given an agentic policy, can we infer its utility function and prior? For simplicity, let's start with the case when the prior is known

g0(π):=supU(minπ′:Eζπ′[U]≥Eζπ[U]K(π′)−K(U))^{[24]}. For this purpose, we can consider the "semidescriptive" agency measureHere, ζ:=ζM0.

The obvious candidates for the utility function associated with π are all U that attain the supremum on the right hand side, or at least come close to attaining the supremum. Notice that the contrived utility function Uconπ which just rewards behavior according to π will typically fail this criterion: K(Uconπ)≈K(π) and hence the expression inside the maximum is ≲0 when U=Uconπ.

As opposed to g, g0 is not approximately UTM invariant, strictly speaking. However, it is still approximately invariant w.r.t. changing the UTM used for defining K(π′) and K(U) while holding ζ fixed. That is, we imagine the Kolmogorov complexities defined w.r.t. some UTM M1 and ζ w.r.t. a different UTM M0, and we have approximate invariance w.r.t. changing M1.

This raises the question of whether the inferred utility function is at least somewhat stable w.r.t. changing M1. Intuitively, it is unlikely to be the case for arbitrary π: there is no canonical way to ascribe a utility function to a rock. However, we can expect it to be the case for agentic π, i.e. in the limit g0(π)≫0.

To simply matters even further, let's focus on the case g0(π)=+∞. Moreover, let's assume that there is exists a computable U s.t. π is the associated AIXI

π=argmaxπ′Eζπ′[U]^{[25]}, i.e.We probably still don't have full uniqueness since small changes in rewards that happen a bounded number of times can be insufficient to affect the optimal policy. However, the

asymptoticrewards might be unique. Formally, I propose the following uniqueness conjecture.

u(y)=αv(y)+βDefinition:Given a topological space X and bounded function u,v:X→R, u and v are said to belocally equivalent utility functionswhen for any x∈X there is V an open neighborhood of x, α>0 and β∈R s.t. for any y∈VConjecture:Let π∗ be a policy and U1,2 computable utility functions s.t. π∗ is AIXI for both U1 and U2. Then U1 and U2 are locally equivalent in the sense of the product topology on (O×A)ω.[

EDIT:The Conjecture as stated is false, but other versions might be true, see discussion here.]## Direction 18: Infra-Bayesian Physicalism

This research direction attempts to address Problem 4.

Infra-Bayesian Physicalism (IBP) is a framework that builds on metacognitve agents, introducing a way to talk about hypotheses, priors and counterfactuals that does away with cartesian privilege.

IBP comes with an agent-independent ontology for values, which provides another possible answer to Problem 3. There might be a natural way to translate hidden rewards to the IBP framework (see original article), so these two answers are not mutually exclusive. However, the price tag is a bizarre constraint on values which we call the "monotonicity principle", for which I only have wildly speculative explanations at present. Hopefully, further research can shed some light on the question.

IBP also has an interesting relationship with ADAM: it seems that in that framework it's impossible to define an "ideal" agent, but we have to be content with defining an agency measure. At the very least, "ideal agent" is a less natural object in that context.

See the original article for concrete directions for further research.

## Physicalist Superimitation

## Motivation

Physicalist Superimitation (PSI) is a (currently informal) approach to creating a rigorous alignment protocol, i.e a formal abstract model of a (hopefully feasible) AGI design that will be provably aligned under some (hopefully realistic) assumptions.

PSI is not themain motivation for LTA. The case for creating a mathematical theory of intelligent agents, both in general and even in particular via LTA, is much more robust than the case for PSI specifically. Moreover, I expect the theory produced by LTA to be applicable to analyzing many different alignment protocols (e.g. IDA, debate or AQRL), not just to PSI.

However, there are several reasons why discussing PSI here seems important:

^{[26]}: the AI is optimizing directly towards goals in the world, without e.g. routing through prediction or through human understanding of plans, and also without any constraints on the AI's knowledge, methods or magnitude of impact.## Superimitation

The first component of PSI is

superimitation: an agent (henceforth: the "imitator") that receives the policy of another agent (henceforth: the "original"), and produces behavior which pursues the same goals butsignificantlybetter. (Later, we will apply the framework in a way in which the role of "policy" is played by something broader than externally visible behavior: the required "behaviorist" assumptions about values are a lot weaker than they might seem.) This is different from mainstream research in inverse reinforcement learning where the goal is usually producing behavior which is merely equally good, or better only by the virtue of being less noisy. For now, we will put aside the question ofhowto obtain the original policy (but, see Agent Detection below).While PSI heavily leans on IBP, basic superimitation can be formulated and studied in a cartesian setting as well. In fact, starting the research from a cartesian setting seems like the reasonable approach. More concretely, the starting point can be ADAM and in particular Direction 17.5.

However, the successful completion of Direction 17.5 is would still fall short of a model of superimitation. Indeed, presumably the utility function can only be inferred (more or less) perfectly in the limit g≫0, but in this limit the original is already optimal and cannot be improved upon. And, for any finite g, the inaccuracy of the inference might negate any improved optimization. Therefore we need to add to our model some mechanism by which the imitator gains an

advantageover the original.One advantage mechanism is allowing the imitator more actions and/or observations. Examples of operationalizations:

pulling backthe original utility function U:(O×A)ω→R using f and g. (One problem with this is: the AI having e.g. the apparent experience of a person enjoying themself is very different from the person having that experience.)The parenthetical issues related to the domain of the utility function can hopefully be addressed given a solution of Problem 3. But, even putting these aside, this mechanisms seems somewhat disappointing: we want the AI to be superhuman not only thanks to better input/output channels, but also thanks to some deeper notion of "cognitive power". In other words, the resulting system might be very uncompetitive.

An improved advantage mechanism is suggested by the framework of metacognitive agents. The imitator can have superior computing hardware which allows it run computations faster than the original (via the internal interface), or even run computations that the original cannot run

^{[27]}. It has some formal similarity to the first mechanism, since the computing hardware can be regarded as a type of input/output channel, but seems like a better model of "cognitive power".## Applicable to humans

I often hear objections along the lines of "but humans don't have utility functions / are not rational / are not coherent / are not agents". I think that that reasoning is confused.

I already touched on this topic in the section Nonproblem above. But, in the context of alignment, there is a different important argument: alignment is fundamentally a

normativeproblem. Whenever we talk about whether an AI is "aligned" or "good" or "safe" we're using value-laden concepts that only makes sense within amodel(approximation) in which we actually possess (more or less) well-defined values. Whenever we discuss what AI design we shouldchoose, we're working on a problem that only makes sense within amodelin which we are rational agents that make choices according to our values. Let's call this model "the anthroposinepic^{[28]}view".This is not to say that the anthroposinepic view is a perfect description of reality. The map is not the territory, all models are wrong - but some are useful. For example, "the center of mass of the sun" is also not a well-defined concept: it's not clear which particles are part of the sun, in QFT particle positions are inherently fuzzy within the Compton wavelength, and in quantum gravity

anynotion of position is probably approximate at best. Nevertheless there are many contexts in astrophysics in which speaking of the center of mass of the sun is perfectly sensible. Similarly, there are contexts in which the anthroposinepic view is perfectly sensible: in particular normative questions, since they assume this view from the onset!Another objection of similar flavor is: what if humans are actually made of multiple coexisting agents (see e.g. Sotala), or have dynamically inconsistent preferences (i.e. a utility function that varies over time, such as hyperbolic discounting)? My answer is: maybe it means that we will have to extend the framework to deal with those possibilities. But it's also possible that it's unnecessary or that it works out "automatically": multiple agents (coexisting or distributed along the subjective timeline) behave effectively as a single agent (thanks to superrationality). My methodology it to always start with the simplest model (in this case, humans as unitary agents) and only go beyond it when:

Most importantly, I don't think this debate can be definitely resolved purely by philosophical arguments. Instead, we should study the emerging theory, check whether it satisfies intuitive desiderata, cross-reference it with cognitive science, and see what it says about the edge cases in which philosophical confusion arises. If the theory is correct, it will ultimately serve as a source of intuition which will lay all philosophical confusion to rest. And if it isn't, well, then we will see where it breaks and grow wiser from that too.

## Agent Detection

Superimitation assumes we can access the original's policy. In the context of an alignment protocol, this assumption has major issues:

PSI addresses this by combining IBP and ADAM. IBP provides us with a rigorous notion of "which computations is the universe running". ADAM provides us with a rigorous way to decide which sequences of computations are agents. Putting the two together should allow specifying an "agent detector": a mathematical operator that transforms any given hypotheses about the universe into the collections of agents that inhabit this universe.

This can solve the problems above modulo the question of

whichagent should we superimitate (for this, see User Identification below). That said, I don't know how the solution would cache out in practice for humans. It would be fascinating to combine this theory with brain science when the former is much more mature. Speculatively, I suspect that a metacognitive version of ADAM would interpret some of human inner mental life as input/output through the internal interface.Also, I think that there might be some formal relation between this "physicalist agent detector" and some version of Garrabrant's Cartesian Frames, but I won't try to articulate it here.

## User Identification

We don't want the AI to superimitate any and all agents. Some of those agents might be descendants of the AI. Some of those agents might be other AIs, aliens and animals. We don't necessarily want the AI to superimitate humans who lived in the past either.

For now, let's make the simplifying assumption that there is a single human we want the AI to imitate: the "user". Generalizing the method to groups of humans is left as an exercise to the reader

^{[29]}.It seems that IBP has a natural notion of causality. Since IBP is a computationalist framework, this notion talks about causal relationships between

instantiated computations. Specifically, the results of some computations control the instantiation or input^{[30]}of other computations. Since an agent is a type of computation, this gives us a notion of causality between agents: the action Alice takes in a particular Alice-counterfactual can affect the observation Bob makes in a particular Bob-counterfactual.We can use causality to design a protocol for user identification. For example, imagine that the AI wakes up in a room with the user. The user can observe all of the AI's outputs (e.g. characters on a terminal) and the AI can observe many of the user's outputs (e.g. through a camera pointed at the user). This creates a short and tight causal loop between the user and the AI during the AI's early history that doesn't exist between the AI and any other agent

^{[31]}.Moreover, we can use causality to specify the point on the user's subjective timeline at which the AI begins to influence them. The AI is designed to superimitate only the user's

past: this way there is no perverse incentive to modify the user. Since agents in this framework areprograms, this still allows access to the user's entire policy. (Although we still need a to disambiguate between equivalent problems, probably based on description complexity.)## Full Specification

We can now put all the pieces of PSI together:

At least, this is the simplest version. The above is probably not the best way to aggregate preferences across hypothesis, since, for example, it doesn't facilitate acausal trade. Here are additional steps that can fix it

Notice that all this is just an (informal) specification of the function we're optimizing, it's

notthe algorithm used for optimization. Similarly to how asymptotic Bayes-optimality can be achieved (for certain priors, e.g. small MDPs) by a feasible learning algorithm, without actually tracking every possible hypothesis and calculating the true Bayesian update, this specification can (hopefully) be approximated by some feasible algorithm the details of which we currently don't know (and which we will only begin to describe after substantial additional progress on the foundational part of LTA).## Alignment Guarantee

## Formal Alignment Criterion

How to formally define alignment? An appealing approach is, formalizing the following criterion:

Criterion A:An AI is (approximately) aligned when it (approximately) optimizes the expectation of the user's utility functionUusrwith respect to the user's beliefsΘusr.Indeed, if I am the user, and I am choosing which AI to run, and I am an agent with a utility function Uusr and beliefs Θusr, clearly it is rational for me (given my subjective epistemic vantage point) to choose an AI satisfying Criterion A.

This might seem confusing at first: shouldn't I want the AI to have more accurate beliefs than the beliefs I hold myself? But, this isn't a contradiction. If we imagine the AI as having access to more

new evidencethan the user, then starting from the same beliefs it arrives at a better outcome. This is closely related to the "advantage mechanisms" we discussed in the section on superimitation.We will call the part of Criterion A about Uusr

axiological alignment. It is more or less the same as "outer alignment". We will call the part of criterion A about Θusrepistemic alignment. It is closely related to (and implies) "inner alignment". The last point might be confusing so let's dwell on it a little.Inner misalignment is usually defined as the formation of malign mesa-optimizers: models produced by the learning algorithm which are in themselves malign agents (see Hubinger et al). In learning theory, possible models correspond to

hypothesesin the hypothesis class of algorithm. If the AI's starting beliefs are sufficiently different from Θusr (it is epistemically misaligned), it might converge to a hypothesis that the user would not have endorsed. Such a hypothesis may lead to actions that are catastrophic from the user's point of view: for example, because the hypothesis encodes a malign agent. Two (not mutually exclusive) examples:simulation hypothesis.neverhas side effects, it might end up running a simulation of a malign agent which will escape the "box". (This is a non-Cartesian daemon.)Finally, notice that Criterion A is an approach to formalizing

alignment, not formalizing corrigibility. Corrigibility seems to me a not especially natural concept, which is henceforth especially difficult to formalize and therefore also especially difficult to implement. For this reason, I am relatively pessimistic about approaches based on corrigibility (which PSI is not). That said, some weak versions of corrigibility might be feasible, e.g. the Hippocractic principle.In the following subsections, I will give a rough outline of why I think that it might be possible to prove a formal alignment guarantee about PSI, along the lines of Criterion A.

## Axiological alignment of PSI

The axiological alignment of PSI relies on the ultimate success of the ADAM research direction (in conjunction with the necessary parts of the rest of the foundational programme). However, the fact PSI relies on physicalism makes the following problem especially acute:

should PSI use cartesian-ADAM (i.e. model the user as a cartesian agent) or physicalist-ADAM (i.e. model the user as a physicalist agent)?Currently I see the following possibilities:

someways this can happen, they all require biting bullets. Alternatively, it might be possible to extend or overhaul the IBP framework, in a way that gets rid of monotonicity.transcartesianagent: i.e. based on some framework for agents which assign cartesian privilege to adifferentagent (in this case the user). I can see a sketch of how this would be formalized (using similar mathematical building blocks as IBP), but this is very early stage at the moment.maa...aaybethe difference is not significant, since PSI doesn't have strong incentives to instantiate those computations.## Epistemic alignment of PSI

The case for the epistemic alignment of PSI relies on the following (informal) conjecture:

Physicalist Agreement Conjecture (PhyAC):Two IBP agents in the same universe that start with similar priors, and share most of their evidence, will converge to similar posteriors^{[32]}.Notice that this is

notthe case forcartesianagents. This is because cartesian agents withsyntacticallysimilar priors have very different priorssemantically: each prior refers to the respective agent's actions and observation. Each agent assumes cartesian privilege foritselfbut not for the other agent.A strong demonstration for how it fails with cartesian agents is through simulation hypotheses: if Alice is much more influential than Bob, then Alice will assign a simulation hypothesis much more credence than Bob. Even if both Alice and Bob end up believing simulation hypotheses, they will often be

differentsimulation hypothesis: each will postulate the kind of simulators that would have incentives to simulate that person's respective subjective viewpoint.On the other hand, for IBP agents this argument doesn't work anymore. For example, suppose that universe A has simulators with incentives to simulate Alice's viewpoint. Since Alice observes Bob fairly closely (by assumption of PhyAC), simulating Alice's viewpoint requires also running a simulation of Bob. Moreover, since Bob is an IBP agent, and IBP is computationalist, this implies that Bob expects to experience this simulation (created for Alice's sake) as well. Hence, both of them assign similar credence to being in universe A.

In the case of PSI, a possible objection is, maybe the simulators only need a low fidelity simulation of the user to fool the AI, which the user wouldn't identify as themself. This is unclear (since this low-fidelity simulation has to be good enough to be indistinguishable for the AI from the null hypothesis that we're not in a simulation). However, if this

isa concern, we can imagine a version of the agent detector that would also accept low fidelity simulations.If the user is an IBP agent, PhyAC can be applicable to prove epistemic alignment of PSI. Is the user an IBP agent? This is unclear, but it's essentially the same debate as in the previous subsection.

Another concern is, even assuming that the user is an IBP agent, how similar is its prior to whatever prior we put into PSI? Presumably, the difference should be comparable to using Solomonoff induction with different UTMs, but hopefully not

toodifferent (i.e. the associated bisimulation is not too complex). As long as we are in a learnable setting, this doesn't matter. In reality, there are traps, so itdoesmatter: but perhaps not critically. Alternatively, we might need to modify the specification of PSI to make stronger use of the inferred user prior in order to close this gap.## A Plan

Assuming PSI

isthe correct approach to aligning AI, how would we get from here to a practical implementation? I imagine the plan roughly as follows:Informal model:The model isinspiredby mathematical ideas but is not fully rigorous yet. This where we are now. It gradually acquires more details and precision, until we are ready to formalize at least a part of it.Toy model:The model is rigorous but makes simplifying assumptions that are completely unrealistic. We study the toy model's formal properties (including alignment), while gradually moving to more realistic assumptions. One axis along which this evolution can happen is computational resource constraints. For example, it can be the following progression:(Semi-)Realistic model:The assumptions are sufficiently realistic that we could conceivably implement the model at least in some utopian scenario (e.g. if all capability research stopped and we had 100 years to bootstrap the AI). We continue moving to more realistic assumptions, while also gaining semiformal understanding about which properties of the model are crucial to alignment and which aren't.## Summary

In this article, I tried to convey the following points:

There is a lot of work and perhaps not that much time. I hope that this article will inspire more researchers to joint the effort. See you all in 2028

^{[33]}!^{^}listed alphabetically by last name

^{^}By "foundational part" I mean the theory of intelligent agents qua intelligent agents, as opposed to the "applied part" where we use this theory to study alignment per se.

^{^}Sometimes we are content with "any except for a set of prior probability 0", such as when we only bound the Bayesian regret in a Bayesian online/bandit/reinforcement learning setting.

^{^}See e.g. Lattimore and Szepesvari for an introduction to regret bounds, in particular section 38 talks about reinforcement learning.

^{^}Here, we assume that the first observation comes before of the first action, in contrast to my usual convention, because this time it's more convenient.

^{^}The diameter is the maximal expected time to reach one state from another state. The bound has to scale with τ since as τ→∞, a trap can develop. Alternative parameterizations are also possible, such as using mixing time or bias span. The latter might be advantageous since bounding the diameter also caps the number of states at O(2τ), whereas bounding bias span allows an infinite number of states. On the other hand, the bias span depends on the reward function.

^{^}Alternatively, we can consider geometric time discount γ, in which case T is replaced by 11−γ.

^{^}The existence of a computationally feasible optimal policy is usually a much stronger condition than the computational feasibility of simulation: for example finding an approximately optimal policy for a POMDP is PSPACE-hard while simulating it in P. Of course there are degenerate cases in which the optimal policy is easy even though simulation is hard.

^{^}See e.g. Shalev-Shwartz and Ben-David.

^{^}The name comes from Kalai and Lehrer.

^{^}I haven't properly studied the prior work relating automata to category theory (which definitely exists) so I make no strong claims to originality here.

^{^}The paper uses a theorem stated without proof, for which the citation given is "Lempp, Miller, Ng and Turetsky, 2010, unpublished, private communication". However, Lattimore kindly provided me with this unpulished communication upon request, and, to the best of my judgement, the proof is valid.

^{^}Infra-Bayesian laws were originally called "belief functions" in the infra-Bayesianism sequenece.

^{^}I haven't done a sufficiently thorough literature survey, so there might already be such results.

^{^}It's not just a law that depends on a point in the support of Θ because there is no disintegration theorem for infradistributions. See this.

^{^}While running

mostprograms doesn't have irreversible harmful side-effects, but it probably doesn't hold forallprograms: for example we can imagine code that uses some hardware exploit. In particular, this is related to what I previous called non-cartesian daemons. It should be possible to get guarantees that take this into account by e.g. somewhat randomizing the precise code we run each time (it might be related to quantilization).^{^}See e.g. Kearns and Vazirani chapter 8, Higuera 2004 and Higuera 2010.

^{^}As a consequence, this section carries an especially high risk of missing some pertinent known results that I'm unfamiliar with.

^{^}See e.g. Vereshchagin and Shen.

^{^}The convention I'm using here doesn't normalize the sum of payoffs over time, i.e.

Utot=∞∑i=0γiUiand the probability of playing a strategy is proportional to exp(λE[Utot]).

^{^}It would be more natural to say that the hidden environment is an equivalence class of such objects, where two are considered equivalent if it is not possible to distinguish them in terms of actions and states in S0. However, the distinction is not critical for the exposition.

^{^}See e.g. chapter 37 in Lattimore and Szepesvari.

^{^}Or maybe just TMs, that would still allow making sense of ζM as the lower semicomputable environment computed by M.

^{^}If we want to infer the utility function and the prior simultaneously, we probably need to take into account that some ways to redefine both of them together produce an equivalent decision problem. Hence, it makes sense to focus on recovering their

product. Formally, any environment ζ corresponds to a measure ζpol on (O×A)ω s.t. for any π, the distribution induced by ζ and π is equal to ζpolχπ, where χπ is the function that's 1 on sequences compatible with π and 0 otherwise. We can then define theutility measureto be the product Uζpol.^{^}This doesn't seem to automatically follow from the assumption g0(π)=+∞. However, it might be possible to show that there is always at least an uncomputable utility function that can be computed with a slow-growing amount of advice.

^{^}There might still be alignment overhead. Specifically, (i) the unusual structure of the loss function, (ii) prior shaping to deal with potential traps and (iii) prior shaping to deal with potential side-effects of computations, might incur overhead. We need to return to this question when we have better models. Maybe the overhead is small, or maybe we can prove that significant overhead is unavoidable.

^{^}Technically, the mathematical analysis will probably need to focus on the asymptotic in which the agents can run all computations, but the original and the imitator can converge to this limit with different speeds.

^{^}from Greek: anthropos (human) + sinepis (coherent, according to ChatGPT)

^{^}Speculatively, we might not even have to do that: the AI itself can facilitate superrational cooperation between all agents who could affect the creation of this or similar AI.

^{^}Input is a special case of instantiation: saying that program p runs with input x is equivalent to saying that the computation p(x) is instantiated.

^{^}We would have to be careful to set the ADAM threshold correctly and make sure the room indeed doesn't contain other agents. Otherwise, we are risking the AI version of "The Fly".

^{^}In some "behaviorist" sense. IBP doesn't really have an update rule, and therefore there is no well-defined "posterior" in general.

^{^}I hope...