Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is joint work by Vanessa Kosoy and Alexander "Diffractor" Appel. For the proofs, see 1 and 2.

TLDR: We present a new formal decision theory that realizes naturalized induction. Our agents reason in terms of infra-Bayesian hypotheses, the domain of which is the cartesian product of computations and physical states, where the ontology of "physical states" may vary from one hypothesis to another. The key mathematical building block is the "bridge transform", which, given such a hypothesis, extends its domain to "physically manifest facts about computations". Roughly speaking, the bridge transforms determines which computations are executed by the physical universe. In particular, this allows "locating the agent in the universe" by determining on which inputs its own source is executed.

0. Background

The "standard model" of ideal agency is Bayesian reinforcement learning, and more specifically, AIXI. We challenged this model before due to its problems with non-realizability, suggesting infra-Bayesianism as an alternative. Both formalisms assume the "cartesian cybernetic framework", in which (i) the universe is crisply divided into "agent" and "environment" and (ii) the two parts interact solely via the agent producing actions which influence the environment and the environment producing observations for the agent. This is already somewhat objectionable on the grounds that this division is not a clearly well-defined property of the physical universe. Moreover, once we examine the structure of the hypothesis such an agent is expected to learn (at least naively), we run into some concrete problems.

The modern understanding of the universe is that no observer plays a privileged role[1]. Therefore, the laws of physics are insufficient to provide a cartesian description of the universe, and must, to this end, be supplemented with "bridge rules" that specify the agent's location inside the universe. That is, these bridge rules need to translate the fundamental degrees of freedom of a physical theory (e.g. quantum wavefunction) to the agent's observations (e.g. values of pixels on a camera), and translate the agent's actions (e.g. signal to robot manipulators) in the other direction[2]. The cost of this is considerable growth in the description complexity of the hypothesis.

A possible retort is something like "how many bits can it really take to pick out the computer within the universe state and describe how to translate universe state to observations? Sure, it might be a considerable chunk of data. But, since Solomonoff induction only needs to make a kilobyte worth of predictive mistakes to learn to predict the inputs as well as any thousand-bit predictor, it should learn this sort of stuff pretty fast." This objection is addressed more later on in this post. A first counter-objection is that, for practical algorithms, this can be a bigger obstacle, especially when sample complexity scales superlinearly with description complexity. For example, the Russo-Van Roy regret bound for Thompson sampling in multi-armed bandits has the time horizon necessary to get a particular regret bound scale as the square of entropy, and the computational complexity cost can also go way up as the description complexity rises, because you need to evaluate more hypotheses more times to test them. For reinforcement learning it's even worse, as in the case of a game where an agent must enter an n-digit password and if it gets it right, it gets reward, and if it gets it wrong, it does not get reward and can try again. The learning time for this game scales as , far superlinearly with description complexity.

Another assumption of AIXI is the simplicity prior, and we expect some form of this assumption to persist in computable and infra-Bayesian analogues. This reflects the intuitive idea that we expect the world to follow simple laws (or at least contain simple patterns). However, from the cartesian perspective, the "world" (i.e. the environment) is, prima facie, not simple at all (because of bridge rules)! Admittedly, the increase in complexity from the bridge rule is low compared to the cost of specifying the universe state, but once the agent learns the transition rules and bridge rule for the universe it's in, learning the state of the universe in addition doesn't seem to yield any particular unforeseen metaphysical difficulties. Further, the description complexity cost of the bridge rule seems likely to be above the description complexity of the laws of physics.

Hence, there is some disconnect between the motivation for using a simplicity prior and its implementation in a cartesian framework.

Moreover, if the true hypothesis is highly complex, it implies that the sample complexity of learning it is very high. And, as previously mentioned, the sample complexity issues are worse in practice than Solomonoff suggests. This should make us suspect that such a learning process is not properly exploiting Occam's razor. Intuitively, such an agent a-priori considers it equally plausible to discover itself to be a robot or to discover itself to be a random clump of dust in outer space, since it's about as hard to specify a bridge rule interface between the computer and observations as it is to specify a bridge rule interface between the dust clump and observations and needs a lot of data to resolve all those possibilities for how its observations connect to a world. Also, though Solomonoff is extremely effective at slicing through the vast field of junk hypotheses that do not describe the thing being predicted, once it's whittled things down to a small core of hypotheses that do predict things fairly accurately, the data to further distinguish between them may be fairly slow in coming. If there's a simple predictor of events occurring in the world but it's running malign computation, then you don't have the luxury of 500 bits of complexity wiggle room (to quickly knock out this hypothesis), because that's a factor of probability difference. Doing worst-case hypothesis testing as in KWIK learning would require a very aggressive threshold indeed, and mispredictions can be rare but important.

Furthermore, some events are simple to describe from the subjective (cartesian) point of view, but complex to describe from an objective (physical) point of view. (For example, all the pixels of the camera becoming black.) Modifying a hypothesis by positing exceptional behavior following a simple event only increases its complexity by the difficulty of specifying the event and what occurs afterwards, which could be quite low. Hence, AIXI-like agents would have high uncertainty about the consequences of observationally simple events. On the other hand, from an objective perspective such uncertainty seems irrational. (Throwing a towel on the camera should not break physics.) In other words, cartesian reasoning is biased to privilege the observer.

Yet another failure of cartesian agents is the inability to reason about origin theories. When we learn that a particular theory explains our own existence (e.g. evolution), this serves as a mark in favor of the theory. We can then exploit this theory to make useful predictions or plans (e.g. anticipate that using lots of antibiotics will cause bacteria to develop resistance). However, for a cartesian agent the question of origins is meaningless. Such an agent perceives its own existence as axiomatic, hence there is nothing to explain.

Finally, cartesian agents are especially vulnerable to acausal attacks. Suppose we deploy a superintelligent Cartesian AI called Kappa. And, imagine a superintelligent agent Mu that inhabits some purely hypothetical universe. If Mu is motivated to affect our own (real) universe, it can run simulations of Kappa's environment. Kappa, who doesn't know a priori whether it exists in our universe or in Mu's universe, will have to seriously consider the hypothesis it is inside such a simulation. And, Mu will deploy the simulation in such manner as to make the simulation hypothesis much simpler, thanks to simpler bridge rules. This will cause Kappa to become overwhelmingly confident that it is in a simulation. If this is achieved, Mu can cause the simulation to diverge from our reality in a strategically chosen point such that Kappa is induced to take an irreversible action in Mu's favor (effectively a treacherous turn). Of course, this requires Kappa to predict Mu's motivations in some detail. This is possible if Kappa develops a good enough understanding of metacosmology.

An upcoming post by Diffractor will discuss acausal attacks in more detail.

In the following sections, we will develop a "physicalist" formalism that entirely replaces the cartesian framework, curing the abovementioned ills, though we have not yet attained the stage of proving improved regret bounds with it, just getting the basic mathematical properties of it nailed down. As an additional benefit, it allows naturally incorporating utility functions that depend on unobservables, thereby avoiding the problem of "ontological crises". At the same time, it seems to impose some odd constraints on the utility function. We discuss the possible interpretations of this.

1. Formalism

Notation

It will be more convenient to use ultradistributions rather than infradistributions. This is a purely notational choice: the decision theory is unaffected, since we are going to apply these ultradistributions to loss functions rather than utility functions. As support for this claim, Diffractor originally wrote down most of the proofs in infradistribution form, and then changing their form for this post was rather straightforward to do. In addition, for the sake of simplicity, we will stick to finite sets: more general spaces will be treated in a future article. So far, we're up to countable products of finite sets.

We denote . Given a finite set , a contribution on is s.t. (it's best to regard it as a measure on ). The space of contributions is denoted . Given and , we denote . There is a natural partial order on contributions: when . Naturally, any distribution is in particular a contribution, so . A homogenous ultracontribution (HUC) on is non-empty closed convex which is downward closed w.r.t. the partial order on . The space of HUCs on is denoted . A homogenous ultradistribution (HUD) on is a HUC s.t. . The space of HUDs on is denoted . Given and , we denote .

Given , is the pushforward by :

Given , is the pushforward by :

is the projection mapping and . We slightly abuse notation by omitting the asterisk in pushforwards by these.

Given and , is the semidirect product:

We will also use the notation for the same HUC with and flipped. And, for , is the semidirect product of with the constant ultrakernel whose value is [3].

For more discussion of HUDs, see previous article, where we used the equivalent concept of "cohomogenous infradistribution".

Notation Reference

If you got lost somewhere and wanted to scroll back to see some definition, or see how the dual form of this with infradistributions works, that's what this section is for.

is a contribution, a measure with 1 or less measure in total. The dual concept is an a-measure with .

is a HUC (homogenous ultracontribution) or HUD (homogenous ultradistribution), a closed convex downwards closed set of contributions. The dual concepts are cohomogenous inframeasure and cohomogenous infradistribution, respectively.

are the spaces of contributions, homogenous ultracontributions, and homogenous ultradistributions respectively.

are the expectations of functions , defined in the usual way. For , it's just the expectation of a function w.r.t. a measure, and for , it's , to perfectly parallel a-measures evaluating functions by just taking expectation, and inframeasures evaluating functions via .

is the ordering on contributions/HUC's/HUD's, which is the function ordering, where iff for all , , and similar for the HUC's. Inframeasures are equipped with the opposite ordering.

is the pushforward along the function . This is a standard probability theory concept which generalizes to all the infra and ultra stuff.

is the pushforward of along the ultrakernel . This is just the generalization to infra and ultra stuff of the ability to push a probability distribution on through a probabilistic function to get a probability distribution on .

is the semidirect product of and , an element of . This is the generalization of the ability to take a probability distribution on , and a probabilistic kernel , and get a joint distribution on .

are the set of actions and observations.

is the time horizon.

is the space of programs, is the space of outputs.

is the space of functions from programs to the results they output. An element of can be thought of as the state of the "computational universe", for it specifies what all the programs do.

is the space of histories, action-observation sequences that can end at any point, ending with an observation.

is the space of destinies, action-observation sequences that are as long as possible, going up to the time horizon.

is a relation on that says whether a computational universe is consistent with a given destiny. A very important note is that this is not the same thing as "the policy is consistent with the destiny" (the policy's actions are the same as what the destiny advises). This is saying something more like "if the destiny has an observation that the computer spat out result 1 when run on computation A, then that is only consistent with mathematical universes which have computation A outputting result 1". Except we don't want to commit to the exact implementation details of it, so we're leaving it undefined besides just "it's a relation"

is the space of "physics outcomes", it can freely vary depending on the hypothesis. It's not a particular fixed space.

is the variable typically used for physicalist hypotheses, elements of . Your uncertainty over the joint distribution over the computational universe and the physical universe.

is the code of the program-which-is-the-agent. So, would be the computation that runs what the agent does on history , and returns its resulting action.

is the subset of the space consisting of pairs s.t. . The can be thought of as the mathematical universe, and can be thought of as the set of mathematical universes that are observationally indistinguishable from it.

is the indicator function that's 1 on the set , and 0 everywhere else. It can be multiplied by a measure, to get the restriction of a measure to a particular set.

is the bridge transform of , defined in definition 1.1, an ultracontribution over the space .

is the set of instantiated histories, relative to mathematical universe , and set of observationally indistinguishable universes . It's the set of histories where all the "math universes" in agree on how the agent's source code reacts to all the prefixes of , and where the history can be extended to some destiny that's consistent with math universe . Ie, for a history to be in here, all the prefixes have to be instantiated, and it must be consistent with the selected math universe.

Setting

As in the cartesian framework, we fix a finite set of actions and a finite set of observations. We assume everything happens within a fixed finite[4] time horizon We assume that our agent has access to a computer[5] on which it can execute some finite[4:1] set of programs with outputs in a finite alphabet . Let be the set of "possible computational universes"[6]. We denote (the set of histories) and (the set of "destinies").

To abstract over the details of how the computer is operated, we assume a relation whose semantics is, (our notation for ) if and only if destiny is a priori consistent with computational universe . For example, suppose some implies a command to execute program and if follows , it implies observing the computer return output for . Then, if contains the substring and , it must be the case that .

A physicalist hypothesis is a pair ), where is a finite[4:2] set representing the physical states of the universe and represents a joint belief about computations and physics. By slight abuse of notation we will refer to such as a physicalist hypothesis, understanding to be implicitly specified. Our agent will have a prior over such hypotheses, ranging over different .

Two questions stand out to us at this point. The first is, what is the domain over which our loss function should be defined? The second is, how do we define the counterfactuals corresponding to different policies ? The answers to both questions turn out to require the same mathematical building block.

For the first question, we might be tempted to identify as our domain. However, prima facie this doesn't make sense, since is hypothesis-dependent. This is the ontological crisis problem: we expect the agent's values to be defined within a certain ontology which is not the best ontology for formulating the laws of the physical universe. For example, a paperclip maximizer might benefit from modeling the universe in terms of quantum fields rather than paperclips. In principle, we can circumvent this problem by requiring our to be equipped with a mapping where is the "axiological" ontology. However, this is essentially a bridge rule, carrying with it all the problems of bridge rules: acausal attack is performed by the adversarial hypothesis imprinting the "axiological rendering" of the target universe on the microscopic degrees of freedom of the source universe in order to have a low-complexity function; the analogue of the towel-on-camera issue is that, once you've already coded up your uncertainty over math and physics, along with how to translate from physics to the ontology of value, it doesn't take too much extra complexity to tie "axiologically-simple" results (the analogue of low-complexity observations) to physics-simple results (the analogue of a low-complexity change in what happens), like "if all the paperclips are red, the fine-structure constant doubles in value".

Instead, we will take a computationalist stance: value is a not property of physical states or processes, but of the computations realized by physical processes. For example, if our agent is "selfish" in the sense that, rewards/losses are associated purely with subjective histories, the relevant computation is the agent's own source code. Notice that, for the program-which-is-the-agent , histories are input. Hence, given loss function we can associate the loss with the computation . Admittedly, there is an implicit assumption that the agent has access to its own source code, but modal decision theory made the same assumption. For another example, if our agent is a diamond maximizer, then the relevant computations are simulations of the physics used to define "diamonds". A more concrete analogue of this is worked out in detail in section 3, regarding Conway's Game of Life.

For the second question, we might be tempted to follow updateless decision theory: counterfactuals correspond to conditioning on . Remember, G is the code of the agent. However, this is not "fair" since it requires the agent to be "responsible" for copies of itself instantiated with fake memories. Such a setting admits no learning-theoretic guarantees, since learning requires trusting your own memory. (Moreover, the agent also has to be able to trust the computer.) Therefore our counterfactuals should only impose when is a "real memory", which we against interpret through computationalism: is real if and only if, is physically realized for any prefix of .

Both of our answers requires a formalization of the notion "assuming hypothesis , this computation is physically realized". More precisely, we should allow for computations to be realized with certain probabilities, and more generally allow for ultradistributions over which computations are realized. We will now accomplish this formalization.

Bridge transform

Given any set , we denote stands for "support" and is the characteristic function of .

Definition 1.1: Let be finite sets and . The bridge transform of is s.t. if and only if

  • for any , .

We will use the notation when is obvious from the context.

Notice that we are multiplying by , not pushforwarding.

The variable of the bridge transform denotes the "facts about computations realized by physics". In particular, if this takes the form for some and , then we may say that the computations are "realized" and the computations are "not realized". More generally, talking only about which computations are realized is imprecise since might involve "partial realization" and/or "entanglement" between computations (i.e. not be of the form above).

Intuitively, this definition expresses that the "computational universe" can be freely modified as long as the "facts known by physics" are preserved. However, that isn't what originally motivated the definition. The bulk of its justification comes from its pleasing mathematical properties, discussed in the next section.

A physicalist agent should be equipped with a prior over physicalist hypotheses. For simplicity, suppose it's a discrete Bayesian prior (it is straightforward to generalize beyond this): hypothesis is assigned probability and . Then, we can consider the total bridge transform of the prior. It can't be given by mixing the hypotheses together and applying the bridge transform, because every hypothesis has its own choice of , its own ontology, so you can't mix them before applying the bridge transform. You have to apply the bridge transform to each component first, forget about the choice of via projecting it out, and then mix them afterwards. This receives further support from Proposition 2.13 which takes an alternate possible way of defining the bridge transform for mixtures (extend all the hypotheses from to in the obvious way so you can mix them first) and shows that it produces the same result.

Definition 1.2:

Evaluating policies

Given , is defined as . I.e., it's total uncertainty over which point in will be selected.

We need to assume that contains programs representing the agent itself. That is, there is some , and for each , . Pretty much, if you have large numbers of actions available, and a limited number of symbols in your language, actions (and the outputs of other programs with a rich space of outputs) can be represented be m-tuples of programs, like "what's the first bit of this action choice" and "what's the second bit of this action choice" and so on. So, is just how many bits you need, is the mapping from program outputs to the actual action, and is the m-tuple of of programs which implements the computation "what does my source code do on input history ". The behavior of the agent in a particular mathematical universe is given by taking each program in , using the mathematical universe to figure out what each of the programs outputs, and then using to convert that bitstring to an action.

Definition 1.3: Given and , we define to be for given by .

We can now extract counterfactuals from any Specifically, given any policy we define some (the set of stuff consistent with behaving according to , for a yet-to-be-defined notion of consistency) and define the counterfactual as . We could use the "naive" definition:

Definition 1.4: is the set of all s.t. for any , .

Per discussion above, it seems better to use a different definition. We will use the notation to mean " is a proper prefix of " and to mean " is a prefix of ",

Definition 1.5: Given , let be the set of all s.t. the following two conditions hold:

  1. For all and where and , .

  2. for some s.t. .

is the set of all s.t. for any , .

Here, condition 1 says that we only "take responsibility" for the action on a particular history when the history was actually observed (all preceding evaluations of the agent are realized computations). Condition 2 says that we only "take responsibility" when the computer is working correctly.

At this stage, Definition 1.5 should be regarded as tentative, since we only have one result so far that validates this definition, namely that the set of in only depends on what the policy does on possible inputs, instead of having the set depending on what the policy does on impossible inputs where one of the past memorized actions is not what the policy would actually do. We hope to rectify this in future articles.

Putting everything together: given a loss function , which depends on how the mathematical universe is, and which computations are realized or not-realized, the loss of policy is given by just applying the bridge transform to coax the hypotheses into the appropriate form, intersecting with the set of possibilities consistent with the agent behaving according to a policy, and evaluating the expectation of the loss function, as detailed below.

Evaluating agents

So far we regarded the agent's source code and the agent's policy as independent variables, and explained how to evaluate different policies given fixed . However, in reality the policy is determined by the source code. Therefore, it is desirable to have a way to evaluate different codes. We achieve this using an algorithmic information theory approach.

We also want to allow the loss function to depend on . That is, we postulate . Specifically, since can be thought of as the space of actions, is basically the space of policies. In section 3 we will see why in some detail, but for now think of the difference between "maximizing my own happiness" and "maximizing Alice's happiness": the first is defined relative to the agent (depends on ) whereas the second is absolute (doesn't depend on ). In particular, for a selfish agent that just cares about its own observations, its loss function must reference its own source code.

Definition 1.6: Denote the policy actually implemented by . Fix ). The physicalist intelligence of relative to the baseline policy mixture , prior and loss function is defined by:

Notice that depends on in two ways: through the direct dependence of on and through .

In particular, it makes sense to choose and as simplicity priors.

There is no obvious way to define "physicalist AIXI", since we cannot have . For one thing, is not even defined for uncomputable agents. In principle we could define it for non-uniform agents, but then we get a fixpoint problem that doesn't obviously have a solution: finding a non-uniform agent s.t. . On the other hand, once we spell out the infinitary () version of the formalism, it should be possible to prove the existence of agents with arbitrarily high finite . That's because our agent can use quining to access its own source code , and then brute force search a policy with .

2. Properties of the bridge transform

In this section we will not need to assume is of the form : unless stated otherwise, it will be any finite set.

Sanity test

Proposition 2.1: For any , and , exists and satisfies . In particular, if then .

Downwards closure

Roughly speaking, the bridge transform tell us which computations are physically realized. But actually it only bounds it from one side: some computations are definitely realized but any computation might be realized. One explanation for why it must be so is: if you looked at the world in more detail, you might realize that there are small-scale, previously invisible, features of the world which depend on novel computations. There is a direct tension between bounding both sides (i.e. being able to say definitively that a computation isn't instantiated) and having the desirable property that learning more about the small-scale structure of the universe narrows down the uncertainty. To formalize this, we require the following definitions:

Definition 2.1: Let be a partially ordered set (poset). Then the induced partial order on is defined as follows. Given , if and only if for any monotonically non-decreasing function , .

This is also called the stochastic order (which is standard mathematical terminology). Intuitively, means that has its measure further up in the poset than does. To make that intuition formal, we can also characterize the induced order as follows:

Proposition 2.2: Let be a poset, . Then, if and only if there exists s.t.:

  • For all , : .

Proposition 2.3: Let be a poset, . Then, if and only if there exists s.t.:

  • For all , : .

Or, in words, you can always go from to by moving probability mass upwards from where it was, since is always supported on the set of points at-or-above .

We can now state the formalization of only bounding one side of the bridge transform. Let be equipped with the following order. if and only if , and . Then:

Proposition 2.4: For any , and , is downwards closed w.r.t. the induced order on . That is, if and then .

Simple special case

Let's consider the special case where there's only one program, which can produce two possible outputs, a 0 and a 1. And these two outputs map to two different distributions over physics outcomes in . Intuitively, if the computation isn't realized/instantiated, the distribution over physics outcome should be identical, while if the computation is realized/instantiated, it should be possible to look at the physics results to figure out how the computation behaves. The two probability distributions may overlap some intermediate amount, in which case it should be possible to write the two probability distributions as a mixture between a probability distribution that behaves identically regardless of the program output (the "overlap" of the two distributions), and a pair of probability distributions corresponding to the two different program outputs which are disjoint. And the total variation distance () between the two probability distributions is connected to the size of the distribution overlap. Proposition 2.5 makes this formal.

Proposition 2.5: Consider a finite set , , , and . Then, . Conversely, consider any . Then, there exist and s.t. for .

The bridge transform should replicate this same sort of analysis. We can interpret the case of "total uncertainty over math, but knowing how physics turns out conditional on knowing how math turns out" by for some . Taking the special case where for one program that can output two possible answers, it should return the same sort of result, where the two probability distributions can be decomposed into their common overlap, and non-overlapping pieces, and the overlapping chunk of probability measure should be allocated to the event "the computation is not instantiated", while the disjoint ones should be allocated to the event "the computation is instantiated". As it turns out, this does indeed happen, with the same total-variation-distance-based upper bound on the probability that the computation is unrealized (namely, )

Proposition 2.6: Consider any and . Denote (the event "program is unrealized"). Let . Then,

Bound in terms of support

If the physical state is , then it is intuitively obvious that the physically manifest facts about the computational universe must include the fact that . Formally:

Proposition 2.7: If is a poset and , then will denote downward closure of . For any , and if and , then . Moreover, define by . Then, (we slightly abuse notation by treating as a mapping that doesn't depend on the first argument, and also playing loose with the order of factors in the set on which our HUCs live).

Putting that into words, all the that are in the support of the bridge transform have being a subset of the support of restricted to . The bridge transform knows not to provide sets which are too large. Further, the bridge transform is smaller/has more information than the transform that just knows about what the support of is.

Idempotence

Imagine replacing the physical state space by the space of manifest facts . Intuitively, the latter should contain the same information about computations as the former. Therefore, if we apply the bridge transform again after making such a replacement, we should get the same thing. Given any set , we denote the identity mapping and the diagonal mapping.

Proposition 2.8: For any , and

Refinement

If we refine a hypothesis (i.e. make it more informative) in the ultradistribution sense, the bridge transform should also be more refined:

Proposition 2.9: For any , and , if then .

Another way of refining a hypothesis is refining the ontology, i.e. moving to a richer state space. For example, we can imagine one hypothesis that describes the world in terms of macroscopic objects and another hypothesis that describes the worlds in term of atoms which are otherwise perfectly consistent with each other. In this case, we also expect the bridge transform to become more refined. This desiderata is where the downwards closure comes from, because it's always possible to pass to a more detailed view of the world and new computations could manifest then. It's also worth noting that, if you reduce your uncertainty about math, or physics, or move to a more detailed state space, that means more computations are instantiated, and the loss goes down as a result.

Proposition 2.10: For any , , , and

In general, when we refine the ontology, the manifest facts can become strictly refined: the richer state "knowns more" about computations than the poorer state. However, there is a special case when we don't expect this to happen, namely when the rich state depends on the computational universe only through the poor state. Roughly speaking, in this case once we have the poor state we can sample the rich state without evaluating any more programs. To formalize this, it is convenient to introduce pullback of HUCs, which is effectively the probabilistic version of taking a preimage.

Definition 2.2: Let be finite sets and a mapping. We define the pullback[7] operator by

Proposition 2.11: Consider any , , , , and s.t. . Then,

In particular,

In other words, we can think of as the coarse-grained version, and as the fine-grained version, and as the coarse-graining function, and as the function mapping a coarse grained state to some uncertainty over what the corresponding fine-grained state is. is your starting uncertainty over computations and the coarse-grained state. Then, in order from most to least informative, you could apply after the bridge transform, apply before the bridge transform, or pull back along after the bridge transform (because the pullback is the most uninformative inverse, in a sense). But all this stuff only affects the information about the fine-grained state . If you were to forget about that, the data about computations and how they're instantiated is the same no matter whether you do the bridge transform with the coarse state, or try your best to fill in what the fine state is first, because the "fill in the fine state" function doesn't depend on your uncertainty about computations, so it doesn't contain any more information about how math is connected to physics than the original coarse-grained view had.

Mixing hypotheses

In general, we cannot expect the bridge transform to commute with taking probabilistic mixtures. For example, let , , , . Clearly either for or for , we expect the program to be realized (since the physical state encodes its output). On the other hand, within we have both the distribution "50% on 00, 50% on 01" and the distribution "50% on 10, 50% on 11". The physical state no longer necessarily "knows" the output, since the mixing "washes away" information. However, there's no reason why mixing would create information, so we can expect the mixture of bridge transforms to be a refinement of the bridge transform of the mixture.

Proposition 2.12: For any , , and

Now suppose that we are we are mixing hypotheses whose state spaces are disjoint. In this case, there can be no washing away since the state "remembers" which hypothesis it belongs to.

Proposition 2.13: Consider some , , , , and . Regard as subsets of , so that . Then,

In particular, if we consider the hypotheses within a prior as having disjoint state spaces because of using different ontologies, then the total bridge transform (Definition 1.2) is just the ordinary bridge transform, justifying that definition.

Conjunction

Given a hypothesis and a subset of , we can form a new hypothesis by performing conjunction with the subset. Intuitively, the same conjunction should apply to the manifest facts.

Proposition 2.14: Consider some , , and . Let be defined by . Then,

Moreover, let and be the natural injections. Then,

That first identity says that it doesn't matter whether you update on a fact about math first and apply the bridge transform later, or whether you do the bridge transform first and then "update" (actually it's slightly more complicated than that since you have to narrow down the sets ) the hypothesis afterwards. The second identity shows that, post conjunction with , taking the bridge transform with respect to is essentially the same as taking it with respect to . If a batch of math universes is certain not to occur, then dropping them entirely from the domain produces the same result.

As a corollary, we can eliminate dependent variables. That is, suppose some programs can be expressed as functions of other programs (using ). Intuitively, such programs can be ignored. Throwing in the dependent variables and doing the bridge transform on that augmented space of "math universes" produces the same results as trying to compute the dependent variables after the bridge transform is applied. So, if you know how to generate the dependent variables, you can just snip them off your hypothesis, and compute them as-needed at the end. Formally:

Proposition 2.15: Consider some , , , , . Let be given by . Then,

Factoring

When is of the form , we can interpret the same in two ways: treat as part of the computational universe, or treat it as part of the physical state. We can also "marginalize" over it altogether staying with a HUD on . The bridge transforms of these objects satisfy a simple relationship:

Proposition 2.16: Consider some , , and . Define by . Then,

Intuitively, the least informative result is given by completely ignoring the part of math about how a particular batch of computations turns out. A more informative result is given by treating as an aspect of physics. And the most informative result is given by treating as an aspect of math, and then pruning the resulting down to be over by plugging in the particular value.

Continuity

Proposition 2.17: is a continuous function of .

Lamentably, this only holds when all the involved sets are finite. However, a generalization of continuity, scott-continuity, still appears to hold in the infinite case. Intuitively, the reason why is, if is a continuous space, like the interval , you could have a convergent sequence of hypotheses where the events "program outputs 0" and "program outputs 1" are believed to result in increasingly similar outputs in physics, and then in the limit, where they produce the exact same physics outcome, the two possible outputs of the program suddenly go from "clearly distinguishable by looking at physics" to "indistinguishable by looking at physics" and there's a sharp change in whether the program is instantiated. This is still a Scott-continuous operation, though.

3. Constructing loss functions

In section 1 we allowed the loss function to be arbitrary. However, in light of Proposition 2.4, we might as well require it to be monotonically non-decreasing. Indeed, we have the following:

Proposition 3.1: Let be a finite poset, and downward closed. Define by . Observe that is always non-decreasing. Then, .

This essentially means that downwards-closed HUC's are entirely pinned down by their expectation value on monotone functions, and all their other expectation values are determined by the expectation value of the "nearest monotone function", the .

This requirement on (we will call it the "monotonicity principle") has strange consequences that we will return to below.

The informal discussion in section 1 hinted at how the computationalist stance allows describing preferences via a loss function of the type . We will now proceed to make it more formal.

Selfish agents

A selfish agent naturally comes with a cartesian loss function

Definition 3.1: For any , the set of experienced histories is the set of all (see Definition 1.5) s.t. for any , . The physicalized loss function is

Notice that implicitly depends on through .

Intuitively, if is a large set, there's different mathematical universes that result in the agent's policy being different, but which are empirically indistinguishable (ie, they're in the same set ), so the agent's policy isn't actually being computed on those histories where the contents of disagree on what the agent does. The agent takes the loss of the best history that occurs, but if there's uncertainty over what happens afterwards/later computations aren't instantiated, it assumes the worst-case extension of that history.

This definition has the following desirable properties:

  • If , and , then

  • is monotone. If the set of indistinguishable mathematical universes gets larger, fewer histories are instantiated, so the loss is larger because of the initial minimization.

  • Fix . Suppose there is s.t. . Then . If the set of indistinguishable mathematical universes all agree on what the destiny of the agent is and disagree elsewhere, then the loss is just the loss of that destiny.

We can think of inconsistent histories within as associated with different copies of the agent. If is s.t. for any , , then represents the history of a copy which reached the end of its life. If there is only one copy and its lifespan is maximal, then produces the cartesian loss of this copy. If the lifespan is less than maximal, produces the loss of the worst-case continuation of the history which was experienced. In other words, death (i.e., the future behavior of the agents policy not being a computation that has any effect on the state of the universe) is always the worst-case outcome. If there are multiple copies, the best-off copy determines the loss.

Definition 3.1 is somewhat arbitrary, but the monotonicity principle is not. Hence, for a selfish agent (i.e. if only depends on through ) death must be the worst-case outcome (for fixed ). Note that this weighs against getting the agent to shut down if it's selfish. And creating additional copies must always be beneficial. For non-selfish agents, similar observations hold if we replace "death" by "destruction of everything that matters" and "copies of the agent" by "copies of the (important sector of the) world".

This is an odd state of affairs. Intuitively, it seems coherent to have preferences that contradict the monotonicity principle. Arguably, humans have such preferences, since we can imagine worlds that we would describe as net negative. At present, we're not sure how to think about it, but here are some possibilities, roughly in order of plausibility:

  1. Perfect physicalists must obey the monotonicity principle, but humans are not perfect physicalists. Indeed, why would they be? The ancestral environment didn't require us to be physicalists, and it took us many centuries to stop thinking of human minds as "metaphysically privileged". This might mean that we need of theory of "cartesian agents in a physicalist universe". For example, perhaps from a cartesian perspective, a physicalist hypothesis needs to be augmented by some "theory of epiphenomena" which predicts e.g. which of several copies will "I" experience.

  2. Something is simply wrong about the formalism. One cheap way to escape monotonicity is by taking the hull of maximal points of the bridge transform. However, this spoils compatibility with refinements (Propositions 2.9 and 2.10). We expect this to spoil ability to learn in some way.

  3. Death is inherently an irreversible event. Every death is the agent's first death, therefore the agent must always have uncertainty about the outcome. Given a non-dogmatic prior, there will be some probability on outcomes that are not actually death. These potential "afterlives" will cause the agent to rank death as better than the absolute worst-case. So, "non-survivalist" preferences can be reinterpreted as a prior biased towards not believing in death. Similar reasoning applies to non-selfish agents.

  4. We should just bite the bullet and prescriptively endorse the odd consequences of the monotonicity principle.

In either case, we believe that continuing the investigation of monotonic physicalist agents is a promising path forward. If a natural extension or modification of the theory exists which admits non-monotonic agents, we might discover it along the way. If we don't, at least we will have a better lower bound for the desiderata such a theory should satisfy.

Unobservable states

What if the loss depends on an unobservable state from some state space ? In order to encode states, let's assume some and . That is, just as the action of the agent can be thought of as a tuple of programs returning the bits of the computation of the agent's action, a state can be thought of as a tuple of programs returning the bits of the computation of what the next state is. Let be a finite set ("nature's actions"). Just as observations, in the selfish case, occur based on what computations are instantiated, nature's choice of action depends on what computations are instantiated. Denote , . Let be the "ontological"[8] transition rule program, the observation rule and the loss function.

Definition 3.2: For any , the set of admissible histories is the set of all s.t. the following conditions hold:

  1. If , and are s.t. then .

  2. If , , , and are s.t. and then (the notation is analogical to Definition 1.3).

Pretty much, condition 1 says that the visible parts of the history are consistent with how math behaves in the same sense as we usually enforce with . Condition 2 says that the history should have the observations arising from the states in the way that demands. And condition 3 says that the history should have the states arising from your action and nature's action and the previous state in the way the transition program dictates (in that batch of mathematical universes).

The physicalized loss function is

This arises in a nearly identical way to the loss function from Definition 3.1, it's precisely analogous.

That is, for a history to be admissible we require not only the agent to be realized on all relevant inputs but also the transition rule.

We can also define a version with in which the agent is not an explicit part of the ontology. In this case, the agent's influence is solely through the "entanglement" of with . I.e., the agent's beliefs over math have entanglements between how the agent's policy turns out and how the physics transition program behaves, s.t. when conditioning on different policies, the computations corresponding to different histories become instantiated.

Cellular automata

What if our agent maximizes the number of gliders in the game of life? More generally, consider a cellular automaton with cell state space and dimension . For any , let be the set of cells whose states are necessary to compute the state of the origin cell at timestep . Let be the program which computes the time evolution of the cellular automaton. I.e., it's mapping the state of a chunk of space to the computation which determines the value of a cell steps in the future. Denote . This is the space of all possible histories, specifying everything that occurs from the start of time. Let be the loss function (the coordinate is time).

Definition 3.3: For any , the set of admissible histories is the set of all pairs where is finite and s.t. the following conditions hold:

  1. If , and then .

  2. If , and then .

Basically, the is a chunk of spacetime, and condition 1 says that said chunk must be closed under taking the past lightcone, and condition 2 says the state of that spacetime chunk must conform with the behavior of the transition computation.

The physicalized loss function is defined very much as it used to be, except that for what used to be "past histories" now it has states of past-closed chunks of spacetime.

Notice that, since the dynamics of the cellular automaton is fixed, the only variables "nature" can control are the initial conditions. However, it is straightforward to generalize this definition to cellular automata whose time evolution is underspecified (similarly to the we had before).

Diamond maximizer

What if our agent maximizes the amount of diamond (or some other physical property)? In order to define "diamond" we need a theory of physics. For example, we can start with macroscopic/classical physics and define diamond by its physical and/or chemical properties. Or, we can start with nonrelativistic quantum mechanics and define diamond in terms of its molecular structure. These definitions might be equivalent in our universe, but different in other hypothetical universes (this is similar to twin Earth). In any case, there is no single "right" choice, it is simply a matter of definition.

A theory of physics is mathematically quite similar to a cellular automaton. This theory will usually be incomplete, something that we can represent in infra-Bayesianism by Knightian uncertainty. So, the "cellular automaton" has underspecified time evolution. Using this parallel, the same method as in the previous subsection should allow physicalizing the loss function. Roughly speaking, a possible history is "real" when the computation simulating the time evolution of the theory of physics with the corresponding initial condition is realized. However, we are not sure what do if the theory of physics is stochastic. We leave it as an open problem for now.

To summarize, you can just take a hypothesis about how computations are entangled with some theory of physics, some computations simulating the time evolution of the universe (in which the notion of "diamond" is actually well-defined letting the amount of diamond in a history be computed), and a utility function over these states. The bridge transform extracts which computations are relevant for the behavior of physics in the hypothesis (which may or may not include the diamond-relevant computations), and then you just check whether the computations for a particular diamond-containing history are instantiated, take the best-case one, and that gives you your loss.

Human preferences seem not substantially different from maximizing diamonds. Most of the things we value are defined in terms of people and social interactions. So, we can imagine a formal loss function defined in terms of some "theory of people" implicit in our brain.

4. Translating Cartesian to Physicalist

Physicalism allows us to do away with the cartesian boundary between the agent and its environment. But what if there is a cartesian boundary? An important sanity test is demonstrating that our theory can account for purely cartesian hypotheses as a special case.

Ordinary laws

We will say "law" to mean what was previously called "belief function". We start from examining a causal law . I.e., it maps a policy to an ultradistribution over destinies, in a way that looks like a closed convex set of environments. Let and so that becomes the set of policies (we identify the history with the computation given by what the agent's source does on input .). In particular, for now there is no computer, or the computer can only simulate the agent itself. Let . This is our set of "physics outcomes", complete action-observation sequences. Then, induces the physicalist hypothesis by . is what we called the "first-person static view" of the law. Total uncertainty over the choice of policy, interacting with the kernel which maps a policy to the resulting uncertainty over the environment.

We want to find the bridge transform of and evaluate the corresponding counterfactuals, in order to show that the resulting is the same as in the cartesian setting. In order to do this, we will use some new definitions.

Definition 4.1: Let be finite sets and a relation. A polycontribution on is s.t. for any :

is a polydistribution when for any

We denote the space of polycontributions by and the space of polydistributions by .

Here, stands for . Polycontributions and polydistributions are essentially a measure on a space, along with a family of subsets indexed by the , s.t. restricting to any particular subset makes a contribution/distribution, respectively.

Definition 4.2: Let and be finite sets and a relation. A polycontribution kernel (PoCK) over is s.t. there is s.t. for any and

Or, in other words, a PoCK is the kernel that you get by taking a polycontribution and using the input point to decide which -indexed subset to restrict the measure to. Let be defined by . This is the "destiny is consistent with the policy" relation where the policy produces the actions that the destiny says it does, when placed in that situation. Let be a Bayesian causal law (an environment). Then, is a PoCK over ! Conversely, any which is a PoCK is an environment.

Essentially, for an environment, each choice of policy results in a different probability distribution over outcomes. However, these resulting probability distributions over outcomes are all coherent with each other where the different policies agree on what to do, so an alternate view of an environment is one where there's one big measure over destinies (where the measure on a particular destiny is the probability of getting that destiny if the policy plays along by picking the appropriate actions) and the initial choice of policy restricts to a particular subset of the space of destinies, yielding a probability distributions over destinies compatible with a given policy.

Clearly, the next step is to generalize this beyond probability distributions.

Definition 4.3: Let and be finite sets a relation. A polyultracontribution kernel (PUCK) over is s.t. there is a set of PoCKs s.t. for any , is the closure of the convex hull of .

In particular, is a PUCK over . Conversely, any which is a PUCK is a causal law.

Proposition 4.1: Consider some , , a relation and a PUCK over . Let . Then,

Here, . The first identity says that the inequality of Proposition 2.7 is saturated.

Specifically, as shown in the next proposition, for our case of cartesian causal laws, the computations realized according to are exactly the histories that occur.

Corollary 4.3: Suppose that for any and s.t. , it holds that . That is, the observations predicts to receive from the computer are consistent with the chosen policy. Let be a cartesian loss function and a policy. Then,

Notice that the assumption is causal combined with the assumption is consistent with imply that it's not possible to use the computer to simulate the agent on any history other than factual past histories. For pseudocausal the bridge transform no longer agrees with the cartesian setting: if a selfish agent is simulated in some counterfactual, then the computationalist stance implies that it can incur the loss for that counterfactual. This artifact should be avoidable for non-selfish agents, but we leave this for future articles.

Turing laws

Now let's examine a cartesian setting which involves a general computer. Without further assumptions, we again cannot expect the physicalist loss to agree with the cartesian, since the programs running on the computer might be entangled with the agent. If the agent runs simulations of itself then it should expect to experience those simulations (as opposed to the cartesian view). Therefore, we will need a "no entanglement" assumption.

Let for some (the computations that don't involve the agent) and is still . We want a notion of "joint belief about the environment and " in the cartesian setting, which generalizes causal laws. We will call it a "Turing law" or "Turing environment" in the Bayesian case.

For our particular application (though the theorems are more general), will be generally taken to be the space of computations that aren't about what you personally do, is the space of computations about what you personally do, and is the space of destinies/physics outcomes, if the reader wants to try their hand at interpreting some of the more abstract theorems.

Definition 4.4: Let , and be finite sets and a relation. A -polycontribution on is s.t.

is a -polydistribution on when it is both a -polycontribution on and a polydistribution on . Here, we regard as a relation between and .

We denote the space of -polycontributions by and the space of -polydistributions by .

Notice that a -polycontribution on is in particular a polycontribution on : .

Definition 4.5: Let , , be finite sets and a relation. A -polycontribution kernel (-PoCK) over is s.t. there is s.t. for any , and

In the context of the identification we'll be using where is the rest of math, is the parts of math about what you personally do, and is the space of physics outcomes/destinies, this definition is saying that the Z-PoCK is mapping a policy to the joint distribution over math and destinies you'd get from restricting the measure to "I'm guaranteed to take this policy, regardless of how the rest of math behaves".

A Turing environment can be formalized as a -PoCK over . Indeed, in this case Definition 4.4 essentially says that the environment can depend on a variable (Remember, is the rest of math), and also we have a belief about this variable that doesn't depend on the policy. Now, how to transform that into a physicalist hypothesis:

Proposition 4.2: Let , and be finite sets, a relation, a -PoCK over and . Then, there exist and s.t. for all , is a PoCK over , and for all , . Moreover, suppose that and are both as above[9]. Then,

Or, for a more concrete example, this is saying that it's possible to decompose the function mapping a policy to a distribution over the rest of math and the destiny, into two parts. One part is just a probability distribution over the rest of math, the second part is a function mapping a policy and the behavior of the rest of math to a probability distribution over destinies. Further, this decomposition is essentially unique, in that for any two ways you do it, if you take the probability distribution over the rest of math, have it interact with some arbitrary way of mapping the rest of math to policies, and then have that interact with the way to get a distribution over destinies, that resulting joint distribution could also be produced by any alternate way of doing the decomposition.

In the setting of Proposition 4.2, we define by . The physicalist hypothesis corresponding to Turing environment turns out be . The no-entanglement assumption is evident in how the and variables are "independent". Well, the actual choice of policy can depend on how the rest of math turns out, but the set of available choices of policy doesn't depend on how the rest of math turns out, and that's the sense in which there's independence.

To go from Turing environments to Turing laws, we will need the following generalization to sets of contributions.

Definition 4.6: Let , and be finite sets a relation. A -polyultracontribution kernel (-PUCK) over is s.t. there is a set of -PoCKs s.t. for any , is the closure of the convex hull of .

A Turing law is a -PUCK over .

Definition 4.7: Let , and be finite sets, a relation, a -PUCK over and . Let be the set of -PoCKs s.t. for any , . Then, is the closure of the convex hull of .

As a sanity test for the sensibility of Definition 4.7, we have the following proposition. It demonstrates that instead of taking all compatible -PoCKs, we could take any subset whose convex hull is all of them. If that was not the case, it would undermine our justification for using convex sets (namely, that taking convex hull doesn't affect anything).

Proposition 4.3: Let , and be finite sets, a relation, -PoCKs over , and . Then, is also a -PoCK over , and

The physicalist hypothesis correspond to Turing law is . Total uncertainty over the available choices of policy interacting with the correspondence between the policy, the rest of math, and the resulting destiny. Now, we need to study its bridge transform with a necessary intermediate result to proceed further, bounding the bridge transform result between an upper and lower bound:

Proposition 4.4: Let , and be finite sets, a relation and a -PUCK over . Denote . Define by , . Then,

Here, we slightly abuse notation by implicitly changing the order of factors in .

The application of this is that the resulting uncertainty-over-math/info about what computations are realized (i.e., the set) attained when you do the bridge transform is bounded between two possible results. One bound is perfect knowledge of what the rest of math besides your policy is (everything's realized), paired with exactly as much policy uncertainty as is compatible with the destiny of interest. The other bound is no knowledge of the rest of math besides your policy (possibly because the policy was chosen without looking at other computations, and the rest of reality doesn't depend on those computations either), combined with exactly as much policy uncertainty as is compatible with the destiny of interest.

Corollary 4.5: Suppose that for any , and s.t. , it holds that . That is, the observations predicts to receive from the computer are consistent with the chosen policy and 's beliefs about computations. Let be a cartesian loss function and a policy. Define by . Then,

And so, the Turing law case manages to add up to normality in the cartesian setting (with computer access) as well.

5. Discussion

Summary

Let's recap how our formalism solves the problems stated in section 0:

  • Physicalist hypotheses require no bridge rules, since they are formulated in terms of a physical state space (that be chosen to be natural/impartial) rather than agent's actions and observations. Therefore, they accrue no associated description complexity penalty, and should allow for a superior learning rate. Intuitively, a physicalist agent knows it is an manifestation of the program rather than just any object in the universe.

  • A physicalist agent doesn't develop uncertainty after subjectively-simple but objectively-complex events. It is easy to modify a cartesian hypothesis such that the camera going black leads to completely new behavior. However, it would require an extremely contrived (complex) modification of physics, since the camera is not a simple object and has no special fundamental significance. Basically, the simplest way that computations and reality correspond which instantiates the computations of the agent seeing black forever afterward gets a complexity boost by about the complexity of the agent's own source code.

  • A physicalist agent doesn't regard its existence as axiomatic. A given physicalist hypothesis might contain 0 copies of the agent, 1 copy or multiple copies (more generally, some ultradistribution over the number of copies). Therefore, hypotheses that predict the agent's existence with relatively high probability gain an advantage (i.e. influence the choice of policy more strongly), as they intuitively should. At the same time, hypotheses that predict too many copies of the agent to have predictive power are typically[10] also suppressed, since policy selection doesn't affect their loss by much (whatever you do, everything happens anyway).

  • Acausal attacks against physicalist agents are still possible, but the lack of bridge rule penalty means the malign hypotheses are not overwhelmingly likely compared to the true hypothesis. This should allow defending using algorithms of "consensus" type.

What about Solomonoff's universality?

For selfish agents the physicalist framework can be reinterpreted as a cartesian framework with a peculiar prior. Given that the Solomonoff prior is universal (dominates every computable prior), does it mean AIXI is roughly as good as a selfish physicalist agent? Not really. The specification of that "physicalist prior" would involve the source code of the agent. (And, if we want to handle hypotheses that predict multiple copies, it would also involve the loss function.) For AIXI, there is no such source code. More generally, the intelligence of an agent w.r.t. the Solomonoff prior is upper bounded by its Kolmogorov complexity (see this). So, highly intelligent cartesian agents place small prior probability on physicalism, and it would take such an agent a lot of data to learn it.

Curiously, the converse is also true: the translation of section 4 also involves the source code of the agent, so intelligent physicalist agents place small prior probability on cartesian dualism. We could maybe use description complexity relative to the agent's own source code, in order to define a prior[11] which makes dualism and physicalism roughly equivalent (for selfish agents). This might be useful as a way to make cartesian algorithms do physicalist reasoning. But in any case, you still need physicalism for the analysis.

It would be interesting to do a similar analysis for bounded Solomonoff-like priors, which might require understanding the computational complexity of the bridge transform. In the bounded case, the Kolmogorov complexity of intelligent agents need not be high, but a similar penalty might arise from the size of the environment state space required for the physicalist-to-cartesian translation.

Are manifest facts objective?

Physicalism is supposed to be "objective", i.e. avoid privileging the observer. But, is the computationalist stance truly objective? Is there an objective fact of the matter about which facts about computations are physically manifest / which computations are realized?

At first glance it might seem there isn't. By Proposition 2.7, if the physicalist hypothesis is s.t. for some , then . So, if the hypothesis asserts a fact then is always physically manifest. But, isn't merely the subjective knowledge of the agent? And if so, doesn't it follow that physical manifesting is also subjective?

At second glance things get more interesting. Suppose that agent learned hypothesis . Since is part of the physical universe, it is now indeed manifest in the state of the physical universe that hypothesis (and in particular ) is true. Looking at it from a different angle, suppose that a different agent knows that is a rational physicalist. Then it must agree that if will learn then is manifest. So, possibly there is a sense in which manifesting is objective after all.

It would be interesting to try and build more formal theory around this question. In particular, the ability of the user and the AI to agree on manifesting (and agree in general) seems important for alignment in general (since we want them to be minimizing the same loss) and for ruling out acausal attacks in particular (because we don't want the AI to believe in a hypothesis s.t. it wouldn't be rational for the user to believe in it).

Physicalism and alignment

The domain of the physicalist loss function doesn't depend on the action and observation sets, which is an advantage for alignment since it can make it easier to "transfer" the loss function from the user to the AI when they have different action or observation sets.

On the other hand, the most objectionable feature of physicalism is the monotonicity principle (section 3). The default conclusion is, we need to at least generalize the formalism to allow for alignment. But, suppose we stick with the monotonicity principle. How bad is the resulting misalignment?

Consider an alignment protocol in which the AI somehow learns the loss function of the user. As a toy model, assume the user has a cellular automaton loss function , like in section 3. Assume the AI acts according to the loss function of Definition 3.3. What will result?

Arguably, in all probable scenarios the result will be pretty close to optimal from the user's perspective. One type of scenario where the AI makes the wrong choice is, when choosing between destruction of everything and a future which is even worse than destruction of everything. But, if these are the only available options then probably something has gone terribly wrong at a previous point. Another type of scenario where the AI possibly makes the wrong choice is, if it creates multiple disconnected worlds each of which has valence from the perspective of the user, at least one of which is good but some of the which are bad. However, there usually seems to be no incentive to do that[12].

In addition, the problems might be milder in approaches that don't require learning the user's loss function, such as IDA of imitation.

So, the monotonicity principle might not introduce egregious misalignment in practice. However, this requires more careful analysis: there can easily be failure modes we have overlooked.

Future research directions

Here's a list of possible research directions, not necessarily comprehensive. Some of them were already mentioned in the body of the article.

Directions which are already more or less formalized:

  • Generalizing everything to infinite spaces (actually, we already have considerable progress on this and hopefully will publish another article soon).

  • Generalizing section 4 to certain pseudocausal laws. Here, we can make use of the monotonicity principle to argue that if Omega discontinues all of its simulations, they correspond to copies which are not the best-off and therefore don't matter in the loss calculus. Alternatively, we can design a non-selfish loss function that only counts the "baseline reality" copies. Moreover, it would be interesting to understand the connection between the pseudocausality condition and the "fairness" of Definition 1.5.

  • Define a natural candidate for the Solomonoff prior for physicalist hypotheses.

  • Prove that the physicalist intelligence is truly unbounded, and study its further properties.

Directions which are mostly clear but require more formalization:

  • Prove learning-theoretic results (regret bounds). For example, if we assume that there is a low-cost experiment that reveals the true hypothesis, then low regret should be achievable. In particular, such theorems would allow validating Definition 1.5.

  • Generalize section 3 to stochastic ontologies.

  • Allow uncertainty about the loss function, and/or about the source code. This would introduce another variable into our physicalist hypothesis, and we need to understand how to bridge transform should act on it.

  • Analyze the relationship between cartesian and physicalist agents with bounded simplicity priors.

  • For (cartesian) reinforcement learning, MDPs and POMDPs are classical examples of environment for which various theoretical analysis is possible. It would be interesting to come up with analogical examples of physicalist hypotheses and study them. For example, we can consider infra-Markov chains where the transition kernel depends on .

Directions where it's not necessarily clear how to approach the problem:

  • Understand the significance of the monotonicity principle better, and whether there are interesting ways to avoid it.

  • Formally study the objectivity of manifesting, perhaps by deriving some kind of agreement theorems for physicalist agents.

  • Define and analyze physicalist alignment protocols.

  • It is easy to imagine how a physicalist hypothesis can describe a universe with classical physics. But, the real universe is quantum. So, it doesn't really have a distribution over some state space but rather a density matrix on some Hilbert space. Is there a natural way to deal with this in the present formalism? If so, that would seem like a good solution to the interpretation of quantum mechanics. [EDIT: Indeed, there is a solution.]


  1. If we ignore ideas such as the von Neumann-Wigner interpretation of quantum mechanics. ↩︎

  2. This other direction also raises issues with counterfactuals. These issues are also naturally resolved in our formalism. ↩︎

  3. The notation is reserved for a different, commutative, operation which we will not use here. ↩︎

  4. A simplifying assumption we are planning to drop in future articles. ↩︎ ↩︎ ↩︎

  5. We have considered this type of setting before with somewhat different motivation. ↩︎

  6. The careful reader will observe that programs sometimes don't halt which means that the "true" computational universe is ill-defined. This turns out not to matter much. ↩︎

  7. Previously we defined pullback s.t. it can only be applied to particular infra/ultradistributions. Here we avoid this limitation by using infra/ultracontributions as the codomain. ↩︎

  8. Meaning that, this rule is part of the definition of the state rather than a claim about physical reality. ↩︎

  9. Disintegrating a distribution into a semidirect product yields a unique result, but for contributions that's no longer true, since it's possible to move scalars between the two factors. ↩︎

  10. One can engineer loss functions for which they are not suppressed, for example if the loss only depends on the actions and not on the observations. But such examples seem contrived. ↩︎

  11. It's a self-referential definition, but we can probably resolve the self-reference by quining. ↩︎

  12. One setting in which there is an incentive: Suppose there are multiple users and the AI is trying to find a compromise between their preferences. Then, it might decide to create disconnected worlds optimized for different users. But, this solution might be much worse than the AI thinks, if Alice's world contains Bob!atrocities. ↩︎

New to LessWrong?

New Comment
23 comments, sorted by Click to highlight new comments since: Today at 3:35 PM

Personally I am very very confused about infra-Bayesianism, so I cannot comment on this post. This is despite the fact that I would really like to understand it as I'm pretty interested in the notion of naturalized induction.

I think a big constraint preventing me and others from getting into infra-Bayesianism is that you are immediately hit with a giant wall of math that you have no intuition for. Meanwhile when it comes to ordinary Bayesianism, I have a lot of intuition due to having been exposed to lots of basic examples and such. I think production of extremely beginner-friendly material could get a lot more people discussing this.

(I could probably learn it without extremely beginner-friendly material if I took a lot of time to sit down and work with the hard math. But then it competes for my attention with many other topics that I also really should put effort into, so it is unlikely to happen.)

Well, Alex is on a working on an infra-Bayesianism textbook, maybe that will help.

I think the thing I would like most is a post that's "explain why infrabayes matters like I'm 5, basically without math." (Admittedly I'm not the main target audience, but, I think this is a valuable thing to exist that will help figure out the pedagogical architecture of an eventual textbook)

I'm somewhat worried, from some past experience hearing people try to learn about infrabayes, that you and Alex keep overshooting in complexity. Curious if for the textbook Alex is talking to people with various degrees of background knowledge and seeing how different explanations land?

Personally I feel like I get why it matters from the other posts, and would just like a gentle introduction to the math.

Cool! I too was intimidated by all the math, but I want to understand it at some point!

The post is still largely up-to-date. In the intervening year, I mostly worked on the theory of regret bounds for infra-Bayesian bandits, and haven't made much progress on open problems in infra-Bayesian physicalism. On the other hand, I also haven't found any new problems with the framework.

The strongest objection to this formalism is the apparent contradiction between the monotonicity principle and the sort of preferences humans have. While my thinking about this problem evolved a little, I am still at a spot where every solution I know requires biting a strange philosophical bullet. On the other hand, IBP is still my best guess about naturalized induction, and, more generally, about the conjectured "attractor submanifold" in the space of minds, i.e. the type of mind to which all sufficiently advanced minds eventually converge.

One important development that did happen is my invention of the PreDCA alignment protocol, which critically depends on IBP. I consider PreDCA to be the most promising direction I know at present to solving alignment, and an important (informal) demonstration of the potential of the IBP formalism.

fwiw that strange philosophical bullet fits remarkably well with a set of thoughts I had while reading Anthropic Bias about 'amount of existence' being the fundamental currency of reality (a bunch of the anthropic paradoxes felt like they were showing that if you traded sufficiently large amounts of "patterns like me exist more" then you could get counterintuitive results like bending the probabilities of the world around you without any causal pathway), and infraBayes requiring it actually updated me a little towards infraBayes being on the right track.

My model of why humans seem to prefer non-existence to existence in some cases is that our ancestors faced situations which could reduce their ability to self-propagate to almost zero, and needed to avoid these really hard. Evolution gave us training signals which can easily generate subagents which are single-mindedly obsessed with avoiding certain kinds of intense suffering. This motivates us to avoid a wide range of realistic things which cost us existence, but as a side-effect of being emphasized so much make it possible to tip into suicidality in cases where, in our history, it was not too costly because things were bad enough anyway that the agent wouldn't propagate much (suicide when the cues for self-propagation being relatively likely for on-distribution humans should have been weeded out). This strikes me as unintended and a result of a hack which works pretty well on-distribution, and likely not reflectively consistent in the limit. An evolution which could generate brains with unbounded compute would not make agents which ever preferred suicide or non-existence.

Another angle on this is thinking of evolution having set things up for a sign-flipped subagent to be reinforced, which just wants to Not Be. This is not a natural shape for an agent to be, but it's useful enough that the pattern to generate it is common.

This is all pretty handwave-y and I don't claim high confidence that it's correct or useful, but might be interesting babble.

Could you explain what the monotonicity principle is, without referring to any symbols or operators? I gathered that it is important, that it is problematic, that it is a necessary consequence of physicalism absent from cartesian models, and that it has something to do with the min-(across copies of an agent) max-(across destinies of the agent copy) loss. But I seem to have missed the content and context that makes sense of all of that, or even in what sense and over what space the loss function is being monotonic.

Your discussion section is good. I would like to see more of the same without all the math notation.

If you find that you need to use novel math notation to convey your ideas precisely, I would advise you to explain what every symbol, every operator, and every formula as a whole means every time you reference them. With all the new notation, I forgot what everything meant after the first time they were defined. If I had a year to familiarize myself with all the symbols and operators and their interpretations and applications, I imagine that this post would be much clearer.

That being said, I appreciate all the work you put into this. I can tell there's important stuff to glean here. I just need some help gleaning it.

Could you explain what the monotonicity principle is, without referring to any symbols or operators?

The loss function of a physicalist agent depends on which computational facts are physically manifest (roughly speaking, which computations the universe runs), and on the computational reality itself (the outputs of computations). The monotonicity principle requires it to be non-decreasing w.r.t. the manifesting of less facts. Roughly speaking, the more computations the universe runs, the better.

This is odd, because it implies that the total destruction of the universe is always the worst possible outcome. And, the creation of an additional, causally disconnected, world can never be net-negative. For a monotonic agent, there can be no net-negative world[1]. In particular, for selfish monotonic agents (such that only assign value to their own observations), this means death is the worst possible outcome and the creation of additional copies of the agent can never be net-negative.

With all the new notation, I forgot what everything meant after the first time they were defined.

Well, there are the "notation" and "notation reference" subsections, that might help.

That being said, I appreciate all the work you put into this. I can tell there's important stuff to glean here.

Thank you!


  1. At least, all of this is true if we ignore the dependence of the loss function on the other argument, namely the outputs of computations. But it seems like that doesn't qualitatively change the picture. ↩︎

Thank you for explaining this! But then how can this framework be used to model humans as agents?  People can easily imagine outcomes worse than death or destruction of the universe.

The short answer is, I don't know.

The long answer is, here are some possibilities, roughly ordered from "boring" to "weird":

  1. The framework is wrong.
  2. The framework is incomplete, there is some extension which gets rid of monotonicity. There are some obvious ways to make such extensions, but they look uglier and without further research it's hard to say whether they break important things or not.
  3. Humans are just not physicalist agents, you're not supposed to model them using this framework, even if this framework can be useful for AI. This is why humans took so much time coming up with science.
  4. Like #3, and also if we thought long enough we would become convinced of some kind of simulation/deity hypothesis (where the simulator/deity is a physicalist), and this is normatively correct for us.
  5. Because the universe is effectively finite (since it's asymptotically de Sitter), there are only so many computations that can run. Therefore, even if you only assign positive value to running certain computations, it effectively implies that running other computations is bad. Moreover, the fact the universe is finite is unsurprising since infinite universes tend to have all possible computations running which makes them roughly irrelevant hypotheses for a physicalist.
  6. We are just confused about hell being worse than death. For example, maybe people in hell have no qualia. This makes some sense if you endorse the (natural for physicalists) anthropic theory that only the best-off future copy of you matters. You can imagine there always being a "dead copy" of you, so that if something worst-than-death happens to the apparent-you, your subjective experiences go into the "dead copy".

The monotonicity principle requires it to be non-decreasing w.r.t. the manifesting of less facts. Roughly speaking, the more computations the universe runs, the better.

I think this is what I was missing. Thanks.

So, then, the monotonicity principle sets a baseline for the agent's loss function that corresponds to how much less stuff can happen to whatever subset of the universe it cares about, getting worse the fewer opportunities become available, due to death or some other kind of stifling. Then the agent's particular value function over universe-states gets added/subtracted on top of that, correct?

No, it's not a baseline, it's just an inequality. Let's do a simple example. Suppose the agent is selfish and cares only about (i) the experience of being in a red room and (ii) the experience of being in a green room. And, let's suppose these are the only two possible experiences, it can't experience going from a room in one color to a room in another color or anything like that (for example, because the agent has no memory). Denote the program corresponding to "the agent deciding on an action after it sees a green room" and the program corresponding to "the agent deciding on an action after it sees a red room". Then, roughly speaking[1], there are 4 possibilities:

  • : The universe runs neither nor .
  • : The universe runs but not .
  • : The universe runs but not .
  • : The universe runs both and .

In this case, the monotonicity principle imposes the following inequalities on the loss function :

That is, must be the worst case and must be the best case.


  1. In fact, manifesting of computational facts doesn't amount to selecting a set of realized programs, because programs can be entangled with each other, but let's ignore this for simplicity's sake. ↩︎

Okay, so it's just a constraint on the final shape of the loss function. Would you construct such a loss function by integrating a strictly non-positive computation-value function over all of space and time (or at least over the future light-cones of all its copies, if it focuses just on the effects of its own behavior)?

Space and time are not really the right parameters here, since these refer to (physical states), not (computational "states") or (physically manifest facts about computations). In the example above, it doesn't matter where the (copy of the) agent is when it sees the red room, only the fact the agent does see it. We could construct such a loss function by a sum over programs, but the constructions suggested in section 3 use minimum instead of sum, since this seems like a less "extreme" choice in some sense. Ofc ultimately the loss function is subjective: as long as the monotonicity principle is obeyed, the agent is free to have any loss function.

[-]bargo10mo40

A theory of physics is mathematically quite similar to a cellular automaton. This theory will usually be incomplete, something that we can represent in infra-Bayesianism by Knightian uncertainty. So, the "cellular automaton" has underspecified time evolution.

What evidence is there that incomplete models with Knightian uncertainty are a way to turn rough models of physics into loss functions? Can the ideas behind it be applied to regular Bayesianism?

A physicalist hypothesis is a pair ), where  is a finite[4:2] set representing the physical states of the universe and  represents a joint belief about computations and physics. [...] Our agent will have a prior over such hypotheses, ranging over different .

I am confused what the state space  is adding to your formalism and how it is supposed to solve the ontology identification problem. Based on what I understood, if I want to use this for inference, I have this prior , and now I can use the bridge transform to project phi out again to evaluate my loss in different counterfactuals. But when looking at your loss function, it seems like most of the hard work is actually done in your relation  that determines which universes are consistent, but its definition does not seem to depend on . How is that different from having a prior that is just over  and taking the loss, if  is projected out anyway and thus not involved?

First, the notation makes no sense. The prior is over hypotheses, each of which is an element of . is the notation used to denote a single hypothesis.

Second, having a prior just over doesn't work since both the loss function and the counterfactuals depend on .

Third, the reason we don't just start with a prior over , is because it's important which prior we have. Arguably, the correct prior is the image of a simplicity prior over physicalist hypotheses by the bridge transform. But, come to think about it, it might be about the same as having a simplicity prior over , where each hypothesis is constrained to be invariant under the bridge transform (thanks to Proposition 2.8). So, maybe we can reformulate the framework to get rid of (but not of the bridge transform). Then again, finding the "ultimate prior" for general intelligence is a big open problem, and maybe in the end we will need to specify it with the help of .

Fourth, I wouldn't say that is supposed to solve the ontology identification problem. The way IBP solves the ontology identification problem is by asserting that is the correct ontology. And then there are tricks how to translate between other ontologies and this ontology (which is what section 3 is about).

Γ=Σ^R, it's a function from programs to what result they output. It can be thought of as a computational universe, for it specifies what all the functions do.

Should this say "elements are function... They can be thought of as...?"

Can you make a similar theory/special case with probability theory, or do you really need infra-bayesianism? If the second, is there a simple explanation of where probability theory fails?

Should this say "elements are function... They can be thought of as...?"

Yes, the phrasing was confusing, I fixed it, thanks.

Can you make a similar theory/special case with probability theory, or do you really need infra-bayesianism?

We really need infrabayesianism. On bayesian hypotheses, the bridge transform degenerates: it says that, more or less, all programs are always running. And, the counterfactuals degenerate too, because selecting most policies would produce "Nirvana".

The idea is, you must have Knightian uncertainty about the result of a program in order to meaningfully speak about whether the universe is running it. (Roughly speaking, if you ask "is the universe running 2+2?" the answer is always yes.) And, you must have Knightian uncertainty about your own future behavior in order for counterfactuals to be meaningful.

It is not surprising that you need infrabayesianism in order to do naturalized induction: if you're thinking of the agent as part of the universe then you are by definition in the nonrealizable setting, since the agent cannot possibly have a full description of something "larger" than itself.

[-]noam8mo10

Can someone break Definition 1.1 down for me? I got lost in all the notation and what acts on what, what is projected to where..

stands for "support"

I am guessing this refers to this notion of support I found on Wikipedia:

   

Edit: fixed Definition.

It should be . More generally, there is the notion of support from measure theory, which sometimes comes up, although in this post we only work with finite sets so it's the same.