This is joint work by Vanessa Kosoy and Alexander "Diffractor" Appel.For the proofs, see 1 and 2.
TLDR: We present a new formal decision theory that realizes
naturalized induction.
Our agents reason in terms of infra-Bayesian hypotheses, the domain
of which is the cartesian product of computations and physical states,
where the ontology of "physical states" may vary from one hypothesis
to another. The key mathematical building block is the "bridge transform",
which, given such a hypothesis, extends its domain to "physically
manifest facts about computations". Roughly speaking, the bridge
transforms determines which computations are executed by the physical
universe. In particular, this allows "locating the agent in the
universe" by determining on which inputs its own source is executed.
0. Background
The "standard model" of ideal agency is Bayesian reinforcement
learning, and more specifically, AIXI.
We challenged
this model before due to its problems with non-realizability, suggesting
infra-Bayesianism as an alternative. Both formalisms assume the "cartesian
cybernetic framework", in which (i) the universe is crisply divided
into "agent" and "environment" and (ii) the two parts interact
solely via the agent producing actions which influence the environment
and the environment producing observations for the agent. This is
already somewhat objectionable on the grounds that this division is
not a clearly well-defined property of the physical universe. Moreover,
once we examine the structure of the hypothesis such an agent is expected
to learn (at least naively), we run into some concrete problems.
The modern understanding of the universe is that no observer plays
a privileged role[1]. Therefore, the laws of
physics are insufficient to provide a cartesian description of the universe,
and must, to this end, be supplemented with "bridge rules"
that specify the agent's location inside the universe. That is, these
bridge rules need to translate the fundamental degrees of freedom
of a physical theory (e.g. quantum wavefunction) to the agent's observations
(e.g. values of pixels on a camera), and translate the agent's actions
(e.g. signal to robot manipulators) in the other direction[2].
The cost of this is considerable growth in the description complexity
of the hypothesis.
A possible retort is something like "how many bits can it really take to pick out the computer within the universe state and describe how to translate universe state to observations? Sure, it might be a considerable chunk of data. But, since Solomonoff induction only needs to make a kilobyte worth of predictive mistakes to learn to predict the inputs as well as any thousand-bit predictor, it should learn this sort of stuff pretty fast." This objection is addressed more later on in this post. A first counter-objection is that, for practical algorithms, this can be a bigger obstacle, especially when sample complexity scales superlinearly with description complexity. For example, the Russo-Van Roy regret bound for Thompson sampling in multi-armed bandits has the time horizon necessary to get a particular regret bound scale as the square of entropy, and the computational complexity cost can also go way up as the description complexity rises, because you need to evaluate more hypotheses more times to test them. For reinforcement learning it's even worse, as in the case of a game where an agent must enter an n-digit password and if it gets it right, it gets reward, and if it gets it wrong, it does not get reward and can try again. The learning time for this game scales as Ω(2n), far superlinearly with description complexity.
Another assumption of AIXI is the simplicity prior, and we expect
some form of this assumption to persist in computable and infra-Bayesian
analogues. This reflects the intuitive idea that we expect the world
to follow simple laws (or at least contain simple patterns). However,
from the cartesian perspective, the "world" (i.e. the environment)
is, prima facie, not simple at all (because of bridge rules)! Admittedly, the increase in complexity from the bridge rule is low compared to the cost of specifying the universe state, but once the agent learns the transition rules and bridge rule for the universe it's in, learning the state of the universe in addition doesn't seem to yield any particular unforeseen metaphysical difficulties. Further, the description complexity cost of the bridge rule seems likely to be above the description complexity of the laws of physics.
Hence,
there is some disconnect between the motivation for using a simplicity
prior and its implementation in a cartesian framework.
Moreover, if the true hypothesis is highly complex, it implies that
the sample complexity of learning it is very high. And, as previously mentioned, the sample complexity issues are worse in practice than Solomonoff suggests. This should make
us suspect that such a learning process is not properly exploiting
Occam's razor. Intuitively, such an agent a-priori considers it equally plausible
to discover itself to be a robot or to discover itself to be a random
clump of dust in outer space, since it's about as hard to specify a bridge rule interface between the computer and observations as it is to specify a bridge rule interface between the dust clump and observations and needs a lot of data to resolve all
those possibilities for how its observations connect to a world. Also, though Solomonoff is extremely effective at slicing through the vast field of junk hypotheses that do not describe the thing being predicted, once it's whittled things down to a small core of hypotheses that do predict things fairly accurately, the data to further distinguish between them may be fairly slow in coming. If there's a simple predictor of events occurring in the world but it's running malign computation, then you don't have the luxury of 500 bits of complexity wiggle room (to quickly knock out this hypothesis), because that's a factor of 2500 probability difference. Doing worst-case hypothesis testing as in KWIK learning would require a very aggressive threshold indeed, and mispredictions can be rare but important.
Furthermore, some events are simple to describe from the subjective
(cartesian) point of view, but complex to describe from an objective
(physical) point of view. (For example, all the pixels of the camera
becoming black.) Modifying a hypothesis by positing exceptional behavior
following a simple event only increases its complexity by the difficulty of specifying the event and what occurs afterwards, which could be quite low. Hence,
AIXI-like agents would have high uncertainty about the consequences
of observationally simple events. On the other hand, from an objective perspective such
uncertainty seems irrational. (Throwing a towel on the camera should
not break physics.) In other words, cartesian reasoning is biased
to privilege the observer.
Yet another failure of cartesian agents is the inability to reason
about origin theories. When we learn that a particular theory explains
our own existence (e.g. evolution), this serves as a mark in favor
of the theory. We can then exploit this theory to make useful predictions
or plans (e.g. anticipate that using lots of antibiotics will cause
bacteria to develop resistance). However, for a cartesian agent the
question of origins is meaningless. Such an agent perceives its own
existence as axiomatic, hence there is nothing to explain.
Finally, cartesian agents are especially vulnerable to acausal
attacks. Suppose we deploy a superintelligent Cartesian AI called Kappa. And, imagine a superintelligent agent Mu that inhabits some purely hypothetical universe. If Mu is motivated to affect our own (real) universe, it can run simulations of Kappa's environment. Kappa, who doesn't know a priori whether it exists in our universe or in Mu's universe, will have to seriously consider the hypothesis it is inside such a simulation. And, Mu will deploy the simulation in such manner as to make the simulation hypothesis much simpler, thanks to simpler bridge rules. This will cause Kappa to become overwhelmingly confident that it is in
a simulation. If this is achieved, Mu can cause the simulation to diverge from our reality in a strategically chosen point such that Kappa is induced to take an irreversible action in Mu's favor (effectively a treacherous turn). Of course, this requires Kappa to predict Mu's motivations in some detail. This is possible if Kappa develops a good enough understanding of metacosmology.
An upcoming post by Diffractor will discuss acausal attacks in more detail.
In the following sections, we will develop a "physicalist" formalism
that entirely replaces the cartesian framework, curing the
abovementioned ills, though we have not yet attained the stage of proving improved regret bounds with it, just getting the basic mathematical properties of it nailed down. As an additional benefit, it allows naturally
incorporating utility functions that depend on unobservables, thereby avoiding the problem of "ontological crises". At the
same time, it seems to impose some odd constraints on the utility
function. We discuss the possible interpretations of this.
1. Formalism
Notation
It will be more convenient to use ultradistributions rather than infradistributions.
This is a purely notational choice: the decision theory is unaffected,
since we are going to apply these ultradistributions to loss functions
rather than utility functions. As support for this claim, Diffractor originally wrote down most of the proofs in infradistribution form, and then changing their form for this post was rather straightforward to do. In addition, for the sake of simplicity,
we will stick to finite sets: more general spaces will be treated
in a future article. So far, we're up to countable products of finite sets.
We denote R+:=[0,∞). Given a finite set X, a
contribution on X is θ:X→R+
s.t. ∑xθ(x)≤1 (it's best to regard it as a measure
on X). The space of contributions is denoted ΔcX. Given f:X→R
and θ∈ΔcX, we denote θ(f):=∑xθ(x)f(x).
There is a natural partial order on contributions: θ1≤θ2
when ∀x∈X:θ1(x)≤θ2(x). Naturally,
any distribution is in particular a contribution, so ΔX⊆ΔcX.
A homogenous ultracontribution (HUC) on X is non-empty closed
convex Θ⊆ΔcX which is downward closed w.r.t. the
partial order on ΔcX. The space of HUCs on X is denoted □cX.
A homogenous ultradistribution (HUD) on X is a HUC Θ
s.t. Θ∩ΔX≠∅. The space of HUDs on X
is denoted □X. Given f:X→R and Θ∈□cX,
we denote Θ(f):=maxθ∈Θθ(f).
Given s:X→Y, s∗:□cX→□cY is the
pushforward by s:
s∗Θ:={s∗θ∣θ∈Θ}
(s∗θ)(y):=∑x∈s−1(y)θ(x)
Given Ξ:X→□cY, Ξ∗:□cX→□cY
is the pushforward by Ξ:
Ξ∗Θ:={κ∗θ∣θ∈Θ,κ:X→ΔcY,∀x∈X:κ(x)∈Ξ(x)}
(κ∗θ)(y):=∑x∈Xθ(x)κ(x;y)
prX:X×Y→X is the projection mapping and pr−Y:=prX.
We slightly abuse notation by omitting the asterisk in pushforwards
by these.
Given Θ∈□cX and Ξ:X→□cY, Θ⋉Ξ∈□c(X×Y)
is the semidirect product:
Θ⋉Ξ:={κ⋉θ∣θ∈Θ,κ:X→ΔcY,∀x∈X:κ(x)∈Ξ(x)}
(κ⋉θ)(x,y):=θ(x)κ(x;y)
We will also use the notation Ξ⋊Θ∈□c(Y×X)
for the same HUC with X and Y flipped. And, for Λ∈□cY, Θ⋉Λ∈□c(X×Y)
is the semidirect product of Θ with the constant ultrakernel
whose value is Λ[3].
For more discussion of HUDs, see previous article,
where we used the equivalent concept of "cohomogenous infradistribution".
Notation Reference
If you got lost somewhere and wanted to scroll back to see some definition, or see how the dual form of this with infradistributions works, that's what this section is for.
θ is a contribution, a measure with 1 or less measure in total. The dual concept is an a-measure (λμ,b) with λ+b=1.
Θ is a HUC (homogenous ultracontribution) or HUD (homogenous ultradistribution), a closed convex downwards closed set of contributions. The dual concepts are cohomogenous inframeasure and cohomogenous infradistribution, respectively.
ΔcX,□cX,□X are the spaces of contributions, homogenous ultracontributions, and homogenous ultradistributions respectively.
θ(f),Θ(f) are the expectations of functions f:X→[0,1], defined in the usual way. For θ, it's just the expectation of a function w.r.t. a measure, and for Θ, it's Θ(f):=maxθ∈Θθ(f), to perfectly parallel a-measures evaluating functions by just taking expectation, and inframeasures evaluating functions via min(m,b)∈Ψm(f)+b.
≤ is the ordering on contributions/HUC's/HUD's, which is the function ordering, where θ≤θ′ iff for all f:X→[0,1], θ(f)≤θ′(f), and similar for the HUC's. Inframeasures are equipped with the opposite ordering.
s∗Θ is the pushforward along the function s:X→Y. This is a standard probability theory concept which generalizes to all the infra and ultra stuff.
Ξ∗Θ is the pushforward of Θ along the ultrakernel Ξ:X→□cY. This is just the generalization to infra and ultra stuff of the ability to push a probability distribution on X through a probabilistic function X→ΔY to get a probability distribution on Y.
Θ⋉Ξ is the semidirect product of Θ and Ξ, an element of □c(X×Y). This is the generalization of the ability to take a probability distribution on X, and a probabilistic kernel X→ΔY, and get a joint distribution on X×Y.
A,O are the set of actions and observations.
N is the time horizon.
R is the space of programs, Σ is the space of outputs.
Γ=ΣR is the space of functions from programs to the results they output. An element of Γ can be thought of as the state of the "computational universe", for it specifies what all the programs do.
H is the space of histories, action-observation sequences that can end at any point, ending with an observation.
D is the space of destinies, action-observation sequences that are as long as possible, going up to the time horizon.
C is a relation on Γ×D that says whether a computational universe is consistent with a given destiny. A very important note is that this is not the same thing as "the policy is consistent with the destiny" (the policy's actions are the same as what the destiny advises). This is saying something more like "if the destiny has an observation that the computer spat out result 1 when run on computation A, then that is only consistent with mathematical universes which have computation A outputting result 1". Except we don't want to commit to the exact implementation details of it, so we're leaving it undefined besides just "it's a relation"
Φ is the space of "physics outcomes", it can freely vary depending on the hypothesis. It's not a particular fixed space.
Θ is the variable typically used for physicalist hypotheses, elements of □(Γ×Φ). Your uncertainty over the joint distribution over the computational universe and the physical universe.
G is the code of the program-which-is-the-agent. So, G(h) would be the computation that runs what the agent does on history h, and returns its resulting action.
elΓ is the subset of the space Γ×2Γ consisting of (y,α) pairs s.t. y∈α. The y can be thought of as the mathematical universe, and α can be thought of as the set of mathematical universes that are observationally indistinguishable from it.
χA is the indicator function that's 1 on the set A, and 0 everywhere else. It can be multiplied by a measure, to get the restriction of a measure to a particular set.
Br(Θ) is the bridge transform of Θ, defined in definition 1.1, an ultracontribution over the space elΓ×Φ.
Hyα is the set of instantiated histories, relative to mathematical universe y, and set of observationally indistinguishable universes α. It's the set of histories h where all the "math universes" in α agree on how the agent's source code reacts to all the prefixes of h, and where the history h can be extended to some destiny that's consistent with math universe y. Ie, for a history to be in here, all the prefixes have to be instantiated, and it must be consistent with the selected math universe.
Setting
As in the cartesian framework, we fix a finite set A of actions
and a finite set O of observations. We assume everything happens
within a fixed finite[4] time horizon N∈N.
We assume that our agent has access to a computer[5]
on which it can execute some finite[4:1] set of
programs R with outputs in a finite alphabet Σ. Let
Γ:=ΣR be the set of "possible computational universes"[6].
We denote H:=(A×O)<N (the set of histories) and D:=(A×O)N−1×A
(the set of "destinies").
To abstract over the details of how the computer is operated, we assume
a relation C⊆D×Γ whose semantics is, dCy
(our notation for (d,y)∈C) if and only if destiny d is a
priori consistent with computational universe y. For example, suppose
some a∈A implies a command to execute program r∈R,and
if o∈O follows a, it implies observing the computer return
output i∈Σ for r. Then, if d contains the substring
ao and dCy, it must be the case that y(r)=i.
A physicalist hypothesis is a pair (Φ,Θ), where Φ
is a finite[4:2] set representing the physical states
of the universe and Θ∈□(Γ×Φ) represents
a joint belief about computations and physics. By slight abuse of
notation we will refer to such Θ as a physicalist hypothesis,
understanding Φ to be implicitly specified. Our agent will have
a prior over such hypotheses, ranging over different Φ.
Two questions stand out to us at this point. The first is, what is
the domain over which our loss function should be defined? The second
is, how do we define the counterfactuals corresponding to different
policies π:H→A? The answers to both questions turn
out to require the same mathematical building block.
For the first question, we might be tempted to identify Φ as
our domain. However, prima facie this doesn't make sense, since Φ
is hypothesis-dependent. This is the ontological crisis
problem: we expect the agent's values to be defined within a certain
ontology which is not the best ontology for formulating the laws of
the physical universe. For example, a paperclip maximizer might benefit
from modeling the universe in terms of quantum fields rather than
paperclips. In principle, we can circumvent this problem by requiring
our Φ to be equipped with a mapping ν:Φ→Φ0,where
Φ0 is the "axiological" ontology. However, this ν
is essentially a bridge rule, carrying with it all the problems of
bridge rules: acausal attack is performed by the adversarial hypothesis imprinting the "axiological rendering" of the target universe on the microscopic degrees of freedom of the source universe in order to have a low-complexity Φ→Φ0 function; the analogue of the towel-on-camera issue is that, once you've already coded up your uncertainty over math and physics, along with how to translate from physics to the ontology of value, it doesn't take too much extra complexity to tie "axiologically-simple" results (the analogue of low-complexity observations) to physics-simple results (the analogue of a low-complexity change in what happens), like "if all the paperclips are red, the fine-structure constant doubles in value".
Instead, we will take a computationalist stance: value is a
not property of physical states or processes, but of the computations
realized by physical processes. For example, if our agent is "selfish"
in the sense that, rewards/losses are associated purely with subjective
histories, the relevant computation is the agent's own source code.
Notice that, for the program-which-is-the-agent G, histories are
input. Hence, given loss function l:H→R we
can associate the loss l(h) with the computation G(h). Admittedly, there is an implicit assumption that the agent has access to its own source code, but modal decision theory made the same assumption. For another
example, if our agent is a diamond maximizer, then the relevant computations
are simulations of the physics used to define "diamonds". A more concrete analogue of this is worked out in detail in section 3, regarding Conway's Game of Life.
For the second question, we might be tempted to follow updateless
decision theory:
counterfactuals correspond to conditioning on G=π. Remember, G is the code of the agent. However, this
is not "fair" since it requires the agent to be "responsible"
for copies of itself instantiated with fake memories. Such a setting
admits no learning-theoretic guarantees, since learning requires trusting
your own memory. (Moreover, the agent also has to be able to trust
the computer.) Therefore our counterfactuals should only impose G(h)=π(h)
when h is a "real memory", which we against interpret through
computationalism: h is real if and only if, G(h′) is physically
realized for any prefix h′ of h.
Both of our answers requires a formalization of the notion "assuming
hypothesis Θ, this computation is physically realized".
More precisely, we should allow for computations to be realized with
certain probabilities, and more generally allow for ultradistributions
over which computations are realized. We will now accomplish this
formalization.
Bridge transform
Given any set A, we denote elA={(a,B)∈A×2A∣a∈B}.supp stands for "support" and χA is the characteristic
function of A.
Definition 1.1:Let Γ,Φ be finite sets and
Θ∈□c(Γ×Φ). The bridge transform of
Θ is BrΓ(Θ)∈□c(Γ×2Γ×Φ)
s.t. θ∈BrΓ(Θ) if and only if
suppθ⊆elΓ×Φ
for any s:Γ→Γ, prΓ×ΦχelΓ×Φ(s×id2Γ×Φ)∗θ∈Θ.
We will use the notation Br(Θ) when Γ is obvious
from the context.
Notice that we are multiplying by χelΓ×Φ, not pushforwarding.
The 2Γ variable of the bridge transform denotes the "facts
about computations realized by physics". In particular, if this
α∈2Γ takes the form {y∈Γ∣∀r∈R0:y(r)=y0(r)}
for some R0⊆R and y0∈ΣR0,
then we may say that the computations R0 are "realized"
and the computations R∖R0 are "not realized".
More generally, talking only about which computations are realized
is imprecise since α might involve "partial realization"
and/or "entanglement" between computations (i.e. not be of the
form above).
Intuitively, this definition expresses that the "computational universe"
can be freely modified as long as the "facts known by physics"
are preserved. However, that isn't what originally motivated the definition. The bulk of its justification comes from its
pleasing mathematical properties, discussed in the next section.
A physicalist agent should be equipped with a prior over physicalist
hypotheses. For simplicity, suppose it's a discrete Bayesian prior
(it is straightforward to generalize beyond this): hypothesis Θi
is assigned probability ζ(i) and ∑iζ(i)=1. Then,
we can consider the total bridge transform of the prior. It can't be given by mixing the hypotheses together and applying the bridge transform, because every hypothesis has its own choice of Φ, its own ontology, so you can't mix them before applying the bridge transform. You have to apply the bridge transform to each component first, forget about the choice of Φ via projecting it out, and then mix them afterwards. This receives further support from Proposition 2.13 which takes an alternate possible way of defining the bridge transform for mixtures (extend all the hypotheses from Γ×Φi to Γ×⨆iΦi in the obvious way so you can mix them first) and shows that it produces the same result.
Definition 1.2:Br(ζ⋉Θ):=∑iζ(i)prelΓBr(Θi)∈□elΓ
Evaluating policies
Given A⊆X, ⊤A∈□X is defined as {θ∈ΔcX∣suppθ⊆A}. I.e., it's total uncertainty over which point in A will be selected.
We need to assume that R contains programs representing the
agent itself. That is, there is some M∈N, dec:ΣM→A
and for each h∈H, ┌G(h)┐∈RM. Pretty much, if you have large numbers of actions available, and a limited number of symbols in your language, actions (and the outputs of other programs with a rich space of outputs) can be represented be m-tuples of programs, like "what's the first bit of this action choice" and "what's the second bit of this action choice" and so on. So, M is just how many bits you need, dec is the mapping from program outputs to the actual action, and ┌G(h)┐ is the m-tuple of of programs which implements the computation "what does my source code do on input history h". The behavior of the agent in a particular mathematical universe is given by taking each program in ┌G(h)┐, using the mathematical universe to figure out what each of the programs outputs, and then using dec to convert that bitstring to an action.
Definition 1.3:Given y∈Γ and h∈H,
we define Gy(h)∈A to be dec(a) for a∈ΣM
given by ai:=y(┌G(h)┐i).
We can now extract counterfactuals from any Λ∈□elΓ.
Specifically, given any policy π we define some Cπ⊆elΓ (the set of stuff consistent with behaving according to π, for a yet-to-be-defined notion of consistency)
and define the counterfactual as Λ∩⊤Cπ.
We could use the "naive" definition:
Definition 1.4:Cπnaive is the set of all
(y,α)∈elΓ s.t. for any h∈H, Gy(h)=π(h).
Per discussion above, it seems better to use a different definition.
We will use the notation h⊏g to mean "h is a proper
prefix of g" and h⊑g to mean "h is a prefix
of g",
Definition 1.5:Given (y,α)∈elΓ, let
Hyα be the set of all h∈H s.t. the following two
conditions hold:
For all ga∈H×A and y′∈Γ where ga⊏h and y′∈α, Gy′(g)=a.
h⊏d for some d∈D s.t. dCy.
Cπfair is the set of all (y,α)∈elΓ
s.t. for any h∈Hyα, Gy(h)=π(h).
Here, condition 1 says that we only "take responsibility" for
the action on a particular history when the history was actually observed
(all preceding evaluations of the agent are realized computations).
Condition 2 says that we only "take responsibility" when the computer
is working correctly.
At this stage, Definition 1.5 should be regarded as tentative, since
we only have one result so far that validates this definition, namely that the set of (y,α) in Cπfair only depends on what the policy π does on possible inputs, instead of having the set depending on what the policy does on impossible inputs where one of the past memorized actions is not what the policy would actually do. We hope to rectify this in future
articles.
Putting everything together: given a loss function L:elΓ→R, which depends on how the mathematical universe is, and which computations are realized or not-realized,
the loss of policy π is given by just applying the bridge transform to coax the hypotheses into the appropriate form, intersecting with the set of possibilities consistent with the agent behaving according to a policy, and evaluating the expectation of the loss function, as detailed below.
Lpol(π,ζ):=(Br(ζ⋉Θ)∩⊤Cπfair)(L)
Evaluating agents
So far we regarded the agent's source code G and the agent's policy
π as independent variables, and explained how to evaluate different
policies given fixed G. However, in reality the policy is determined
by the source code. Therefore, it is desirable to have a way to evaluate
different codes. We achieve this using an algorithmic informationtheory
approach.
We also want to allow the loss function to depend on G. That is,
we postulate L:(RM)H×elΓ→R. Specifically, since RM can be thought of as the space of actions, (RM)H is basically the space of policies. In
section 3 we will see why in some detail, but for now think of the
difference between "maximizing my own happiness" and "maximizing
Alice's happiness": the first is defined relative to the agent
(depends on G) whereas the second is absolute (doesn't depend on
G). In particular, for a selfish agent that just cares about its own observations, its loss function must reference its own source code.
Definition 1.6:Denote G∗:H→A the
policy actually implemented by G. Fix ξ∈Δ(AH).
The physicalist intelligence of G relative to the baseline policy mixture ξ,
prior ζ and loss function L is defined by:
Notice that Lpol depends on ┌G┐ in two ways:
through the direct dependence of L on ┌G┐ and
through Cπfair.
In particular, it makes sense to choose ζ and ξ as simplicity
priors.
There is no obvious way to define "physicalist AIXI", since we
cannot have g=∞. For one thing, g is not even
defined for uncomputable agents. In principle we could define it for
non-uniform
agents, but then we get a fixpoint problem that doesn't obviously
have a solution: finding a non-uniform agent G s.t. Lpol(┌G┐,G∗,ζ)=minπLpol(┌G┐,π,ζ). On the other hand, once we spell out the infinitary
(N=∞) version of the formalism, it should be possible to prove
the existence of agents with arbitrarily high finite g. That's
because our agent can use quining to access its own source code ┌G┐,
and then brute force search a policy π∗ϵ with Lpol(┌G┐,π∗ϵ,ζ)<infπ:H→ALpol(┌G┐,π,ζ)+ϵ.
2. Properties of the bridge transform
In this section we will not need to assume Γ is of the
form ΣR: unless stated otherwise, it will be any finite
set.
Sanity test
Proposition 2.1:For any Γ, Φ and Θ∈□c(Γ×Φ),
Br(Θ) exists and satisfies prΓ×ΦBr(Θ)=Θ.
In particular, if Θ∈□(Γ×Φ) then Br(Θ)∈□(elΓ×Φ).
Downwards closure
Roughly speaking, the bridge transform tell us which computations
are physically realized. But actually it only bounds it from one side:
some computations are definitely realized but any computation might
be realized. One explanation for why it must be so is: if you looked at the world in more detail, you might realize that there are small-scale, previously invisible, features of the world which depend on novel computations. There is a direct tension between bounding both sides (i.e. being able to say definitively that a computation isn't instantiated) and having the desirable property that learning more about the small-scale structure of the universe narrows down the uncertainty. To formalize this, we require the following definitions:
Definition 2.1:Let X be a partially ordered set
(poset). Then the induced partial order on ΔcX is
defined as follows. Given θ,η∈ΔcX, θ⪯η
if and only if for any monotonically non-decreasing function f:X→R+,
θ(f)≤η(f).
This is also called the stochastic order (which is standard mathematical terminology). Intuitively, θ⪯η means that η has its measure further up in the poset than θ does. To make that intuition formal, we can also characterize the induced order as follows:
Proposition 2.2:Let X be a poset, θ,η∈ΔcX.
Then, θ⪯η if and only if there exists κ:X→ΔX
s.t.:
For all x∈X, y∈suppκ(x): x≤y.
κ∗θ≤η
Proposition 2.3:Let X be a poset, θ,η∈ΔcX.
Then, θ⪯η if and only if there exists κ:X→ΔX
s.t.:
For all x∈X, y∈suppκ(x): x≥y.
θ≤κ∗η
Or, in words, you can always go from θ to η by moving probability mass upwards from where it was, since κ(x) is always supported on the set of points at-or-above x.
We can now state the formalization of only bounding one side of the bridge transform. Let elΓ×Φ be
equipped with the following order. (y,α,x)≤(y′,α′,x′)
if and only if y=y′, x=x′ and α⊆α′. Then:
Proposition 2.4:For any Γ, Φ and Θ∈□c(Γ×Φ), Br(Θ)
is downwards closed w.r.t. the induced order on Δc(elΓ×Φ).
That is, if θ∈Br(Θ) and η⪯θ then η∈Br(Θ).
Simple special case
Let's consider the special case where there's only one program, which can produce two possible outputs, a 0 and a 1. And these two outputs map to two different distributions over physics outcomes in Φ. Intuitively, if the computation isn't realized/instantiated, the distribution over physics outcome should be identical, while if the computation is realized/instantiated, it should be possible to look at the physics results to figure out how the computation behaves. The two probability distributions may overlap some intermediate amount, in which case it should be possible to write the two probability distributions as a mixture between a probability distribution that behaves identically regardless of the program output (the "overlap" of the two distributions), and a pair of probability distributions corresponding to the two different program outputs which are disjoint. And the total variation distance (dTV(ϕ(0),ϕ(1))) between the two probability distributions is connected to the size of the distribution overlap. Proposition 2.5 makes this formal.
Proposition 2.5:Consider a finite set X, ϕ?∈ΔX,
ϕ!:{0,1}→ΔX, p∈[0,1] and ϕ:=(1−p)ϕ?+pϕ!.
Then, p≥dTV(ϕ(0),ϕ(1)). Conversely, consider any ϕ:{0,1}→ΔX.
Then, there exist ϕ?∈ΔX and ϕ!:{0,1}→ΔX
s.t. ϕ=(1−p)ϕ?+pϕ! for p:=dTV(ϕ(0),ϕ(1)).
The bridge transform should replicate this same sort of analysis. We can interpret the case of "total uncertainty over math, but knowing how physics turns out conditional on knowing how math turns out" by ⊤Γ⋉ϕ for some ϕ:Γ→ΔΦ. Taking the special case where Γ={0,1} for one program that can output two possible answers, it should return the same sort of result, where the two probability distributions can be decomposed into their common overlap, and non-overlapping pieces, and the overlapping chunk of probability measure should be allocated to the event "the computation is not instantiated", while the disjoint ones should be allocated to the event "the computation is instantiated". As it turns out, this does indeed happen, with the same total-variation-distance-based upper bound on the probability that the computation is unrealized (namely, 1−dTV(ϕ(0),ϕ(1)))
Proposition 2.6:Consider any Φ and ϕ:{0,1}→ΔΦ.
Denote U:={0,1}×{{0,1}}×Φ (the event "program
is unrealized"). Let Λ:=Br(⊤{0,1}⋉ϕ).
Then,
Λ(χU)=1−dTV(ϕ(0),ϕ(1))
Bound in terms of support
If the physical state is x∈Φ, then it is intuitively obvious
that the physically manifest facts about the computational universe
y∈Γ must include the fact that (y,x)∈suppΘ. Formally:
Proposition 2.7:If X is a poset and Θ∈□cX,
then Θ↓ will denote downward closure of Θ.
For any Γ, Φ and Θ∈□c(Γ×Φ)
if (y,α,x)∈suppBr(Θ) and y′∈α, then (y′,x)∈suppΘ.
Moreover, define susΘ:Φ→2Γ by susΘ(x):={y∈Γ∣(y,x)∈suppΘ}.
Then, Br(Θ)⊆(Θ⋉susΘ)↓
(we slightly abuse notation by treating susΘ as a mapping
Γ×Φ→2Γ that doesn't depend on the
first argument, and also playing loose with the order of factors in
the set on which our HUCs live).
Putting that into words, all the (α,x) that are in the support of the bridge transform have α being a subset of the support of Θ restricted to x. The bridge transform knows not to provide α sets which are too large. Further, the bridge transform is smaller/has more information than the transform that just knows about what the support of Θ is.
Idempotence
Imagine replacing the physical state space Φ by the space of
manifest facts 2Γ. Intuitively, the latter should contain
the same information about computations as the former. Therefore,
if we apply the bridge transform again after making such a replacement,
we should get the same thing. Given any set X, we denote idX:X→X
the identity mapping and diagX:X→X×X the diagonal
mapping.
Proposition 2.8:For any Γ, Φ and Θ∈□c(Γ×Φ)
Br(prΓ×2ΓBr(Θ))=[(idΓ×diag2Γ)∗prΓ×2ΓBr(Θ)]↓
Refinement
If we refine a hypothesis (i.e. make it more informative) in the ultradistribution
sense, the bridge transform should also be more refined:
Proposition 2.9:For any Γ, Φ and Θ1,Θ2∈□c(Γ×Φ),
if Θ1⊆Θ2 then Br(Θ1)⊆Br(Θ2).
Another way of refining a hypothesis is refining the ontology, i.e.
moving to a richer state space. For example, we can imagine one hypothesis
that describes the world in terms of macroscopic objects and another
hypothesis that describes the worlds in term of atoms which are otherwise
perfectly consistent with each other. In this case, we also expect
the bridge transform to become more refined. This desiderata is where the downwards closure comes from, because it's always possible to pass to a more detailed view of the world and new computations could manifest then. It's also worth noting that, if you reduce your uncertainty about math, or physics, or move to a more detailed state space, that means more computations are instantiated, and the loss goes down as a result.
Proposition 2.10:For any Γ, Φ1, Φ2,
t:Φ2→Φ1 and Θ∈□c(Γ×Φ2)
(idelΓ×t)∗Br(Θ)⊆Br((idΓ×t)∗Θ)
In general, when we refine the ontology, the manifest facts can become
strictly refined: the richer state "knowns more" about computations
than the poorer state. However, there is a special case when we don't
expect this to happen, namely when the rich state depends on the computational
universe only through the poor state. Roughly speaking, in this case
once we have the poor state we can sample the rich state without evaluating
any more programs. To formalize this, it is convenient to introduce
pullback of HUCs, which is effectively the probabilistic version of taking a preimage.
Definition 2.2:Let X,Y be finite sets and t:X→Y
a mapping. We define the pullback[7] operator
t∗:□cY→□cX by
t∗Θ:={θ∈ΔcX∣t∗θ∈Θ}
Proposition 2.11:Consider any Γ, Φ1,
Φ2, t:Φ2→Φ1, Ξ:Φ1→□cΦ2
and Θ∈□c(Γ×Φ1) s.t. tΞ=idΦ1.
Then,
(idelΓ×Ξ)∗Br(Θ)⊆Br((idΓ×Ξ)∗Θ)⊆(idelΓ×t)∗Br(Θ)
In particular,
prelΓBr(Θ)=prelΓBr((idΓ×Ξ)∗Θ)
In other words, we can think of Φ1 as the coarse-grained version, and Φ2 as the fine-grained version, and t as the coarse-graining function, and Ξ as the function mapping a coarse grained state to some uncertainty over what the corresponding fine-grained state is. Θ is your starting uncertainty over computations and the coarse-grained state. Then, in order from most to least informative, you could apply Ξ after the bridge transform, apply Ξ before the bridge transform, or pull back along t after the bridge transform (because the pullback is the most uninformative inverse, in a sense). But all this stuff only affects the information about the fine-grained state Φ2. If you were to forget about that, the data about computations and how they're instantiated is the same no matter whether you do the bridge transform with the coarse state, or try your best to fill in what the fine state is first, because the "fill in the fine state" function doesn't depend on your uncertainty about computations, so it doesn't contain any more information about how math is connected to physics than the original coarse-grained view had.
Mixing hypotheses
In general, we cannot expect the bridge transform to commute with
taking probabilistic mixtures. For example, let Γ={0,1},
Φ={0,1}, Θ1=⊤{00,11}, Θ2=⊤{
This is joint work by Vanessa Kosoy and Alexander "Diffractor" Appel. For the proofs, see 1 and 2.
TLDR: We present a new formal decision theory that realizes naturalized induction. Our agents reason in terms of infra-Bayesian hypotheses, the domain of which is the cartesian product of computations and physical states, where the ontology of "physical states" may vary from one hypothesis to another. The key mathematical building block is the "bridge transform", which, given such a hypothesis, extends its domain to "physically manifest facts about computations". Roughly speaking, the bridge transforms determines which computations are executed by the physical universe. In particular, this allows "locating the agent in the universe" by determining on which inputs its own source is executed.
0. Background
The "standard model" of ideal agency is Bayesian reinforcement learning, and more specifically, AIXI. We challenged this model before due to its problems with non-realizability, suggesting infra-Bayesianism as an alternative. Both formalisms assume the "cartesian cybernetic framework", in which (i) the universe is crisply divided into "agent" and "environment" and (ii) the two parts interact solely via the agent producing actions which influence the environment and the environment producing observations for the agent. This is already somewhat objectionable on the grounds that this division is not a clearly well-defined property of the physical universe. Moreover, once we examine the structure of the hypothesis such an agent is expected to learn (at least naively), we run into some concrete problems.
The modern understanding of the universe is that no observer plays a privileged role[1]. Therefore, the laws of physics are insufficient to provide a cartesian description of the universe, and must, to this end, be supplemented with "bridge rules" that specify the agent's location inside the universe. That is, these bridge rules need to translate the fundamental degrees of freedom of a physical theory (e.g. quantum wavefunction) to the agent's observations (e.g. values of pixels on a camera), and translate the agent's actions (e.g. signal to robot manipulators) in the other direction[2]. The cost of this is considerable growth in the description complexity of the hypothesis.
A possible retort is something like "how many bits can it really take to pick out the computer within the universe state and describe how to translate universe state to observations? Sure, it might be a considerable chunk of data. But, since Solomonoff induction only needs to make a kilobyte worth of predictive mistakes to learn to predict the inputs as well as any thousand-bit predictor, it should learn this sort of stuff pretty fast." This objection is addressed more later on in this post. A first counter-objection is that, for practical algorithms, this can be a bigger obstacle, especially when sample complexity scales superlinearly with description complexity. For example, the Russo-Van Roy regret bound for Thompson sampling in multi-armed bandits has the time horizon necessary to get a particular regret bound scale as the square of entropy, and the computational complexity cost can also go way up as the description complexity rises, because you need to evaluate more hypotheses more times to test them. For reinforcement learning it's even worse, as in the case of a game where an agent must enter an n-digit password and if it gets it right, it gets reward, and if it gets it wrong, it does not get reward and can try again. The learning time for this game scales as Ω(2n), far superlinearly with description complexity.
Another assumption of AIXI is the simplicity prior, and we expect some form of this assumption to persist in computable and infra-Bayesian analogues. This reflects the intuitive idea that we expect the world to follow simple laws (or at least contain simple patterns). However, from the cartesian perspective, the "world" (i.e. the environment) is, prima facie, not simple at all (because of bridge rules)! Admittedly, the increase in complexity from the bridge rule is low compared to the cost of specifying the universe state, but once the agent learns the transition rules and bridge rule for the universe it's in, learning the state of the universe in addition doesn't seem to yield any particular unforeseen metaphysical difficulties. Further, the description complexity cost of the bridge rule seems likely to be above the description complexity of the laws of physics.
Hence, there is some disconnect between the motivation for using a simplicity prior and its implementation in a cartesian framework.
Moreover, if the true hypothesis is highly complex, it implies that the sample complexity of learning it is very high. And, as previously mentioned, the sample complexity issues are worse in practice than Solomonoff suggests. This should make us suspect that such a learning process is not properly exploiting Occam's razor. Intuitively, such an agent a-priori considers it equally plausible to discover itself to be a robot or to discover itself to be a random clump of dust in outer space, since it's about as hard to specify a bridge rule interface between the computer and observations as it is to specify a bridge rule interface between the dust clump and observations and needs a lot of data to resolve all those possibilities for how its observations connect to a world. Also, though Solomonoff is extremely effective at slicing through the vast field of junk hypotheses that do not describe the thing being predicted, once it's whittled things down to a small core of hypotheses that do predict things fairly accurately, the data to further distinguish between them may be fairly slow in coming. If there's a simple predictor of events occurring in the world but it's running malign computation, then you don't have the luxury of 500 bits of complexity wiggle room (to quickly knock out this hypothesis), because that's a factor of 2500 probability difference. Doing worst-case hypothesis testing as in KWIK learning would require a very aggressive threshold indeed, and mispredictions can be rare but important.
Furthermore, some events are simple to describe from the subjective (cartesian) point of view, but complex to describe from an objective (physical) point of view. (For example, all the pixels of the camera becoming black.) Modifying a hypothesis by positing exceptional behavior following a simple event only increases its complexity by the difficulty of specifying the event and what occurs afterwards, which could be quite low. Hence, AIXI-like agents would have high uncertainty about the consequences of observationally simple events. On the other hand, from an objective perspective such uncertainty seems irrational. (Throwing a towel on the camera should not break physics.) In other words, cartesian reasoning is biased to privilege the observer.
Yet another failure of cartesian agents is the inability to reason about origin theories. When we learn that a particular theory explains our own existence (e.g. evolution), this serves as a mark in favor of the theory. We can then exploit this theory to make useful predictions or plans (e.g. anticipate that using lots of antibiotics will cause bacteria to develop resistance). However, for a cartesian agent the question of origins is meaningless. Such an agent perceives its own existence as axiomatic, hence there is nothing to explain.
Finally, cartesian agents are especially vulnerable to acausal attacks. Suppose we deploy a superintelligent Cartesian AI called Kappa. And, imagine a superintelligent agent Mu that inhabits some purely hypothetical universe. If Mu is motivated to affect our own (real) universe, it can run simulations of Kappa's environment. Kappa, who doesn't know a priori whether it exists in our universe or in Mu's universe, will have to seriously consider the hypothesis it is inside such a simulation. And, Mu will deploy the simulation in such manner as to make the simulation hypothesis much simpler, thanks to simpler bridge rules. This will cause Kappa to become overwhelmingly confident that it is in a simulation. If this is achieved, Mu can cause the simulation to diverge from our reality in a strategically chosen point such that Kappa is induced to take an irreversible action in Mu's favor (effectively a treacherous turn). Of course, this requires Kappa to predict Mu's motivations in some detail. This is possible if Kappa develops a good enough understanding of metacosmology.
An upcoming post by Diffractor will discuss acausal attacks in more detail.
In the following sections, we will develop a "physicalist" formalism that entirely replaces the cartesian framework, curing the abovementioned ills, though we have not yet attained the stage of proving improved regret bounds with it, just getting the basic mathematical properties of it nailed down. As an additional benefit, it allows naturally incorporating utility functions that depend on unobservables, thereby avoiding the problem of "ontological crises". At the same time, it seems to impose some odd constraints on the utility function. We discuss the possible interpretations of this.
1. Formalism
Notation
It will be more convenient to use ultradistributions rather than infradistributions. This is a purely notational choice: the decision theory is unaffected, since we are going to apply these ultradistributions to loss functions rather than utility functions. As support for this claim, Diffractor originally wrote down most of the proofs in infradistribution form, and then changing their form for this post was rather straightforward to do. In addition, for the sake of simplicity, we will stick to finite sets: more general spaces will be treated in a future article. So far, we're up to countable products of finite sets.
We denote R+:=[0,∞). Given a finite set X, a contribution on X is θ:X→R+ s.t. ∑xθ(x)≤1 (it's best to regard it as a measure on X). The space of contributions is denoted ΔcX. Given f:X→R and θ∈ΔcX, we denote θ(f):=∑xθ(x)f(x). There is a natural partial order on contributions: θ1≤θ2 when ∀x∈X:θ1(x)≤θ2(x). Naturally, any distribution is in particular a contribution, so ΔX⊆ΔcX. A homogenous ultracontribution (HUC) on X is non-empty closed convex Θ⊆ΔcX which is downward closed w.r.t. the partial order on ΔcX. The space of HUCs on X is denoted □cX. A homogenous ultradistribution (HUD) on X is a HUC Θ s.t. Θ∩ΔX≠∅. The space of HUDs on X is denoted □X. Given f:X→R and Θ∈□cX, we denote Θ(f):=maxθ∈Θθ(f).
Given s:X→Y, s∗:□cX→□cY is the pushforward by s:
s∗Θ:={s∗θ∣θ∈Θ}
(s∗θ)(y):=∑x∈s−1(y)θ(x)
Given Ξ:X→□cY, Ξ∗:□cX→□cY is the pushforward by Ξ:
Ξ∗Θ:={κ∗θ∣θ∈Θ,κ:X→ΔcY,∀x∈X:κ(x)∈Ξ(x)}
(κ∗θ)(y):=∑x∈Xθ(x)κ(x;y)
prX:X×Y→X is the projection mapping and pr−Y:=prX. We slightly abuse notation by omitting the asterisk in pushforwards by these.
Given Θ∈□cX and Ξ:X→□cY, Θ⋉Ξ∈□c(X×Y) is the semidirect product:
Θ⋉Ξ:={κ⋉θ∣θ∈Θ,κ:X→ΔcY,∀x∈X:κ(x)∈Ξ(x)}
(κ⋉θ)(x,y):=θ(x)κ(x;y)
We will also use the notation Ξ⋊Θ∈□c(Y×X) for the same HUC with X and Y flipped. And, for Λ∈□cY, Θ⋉Λ∈□c(X×Y) is the semidirect product of Θ with the constant ultrakernel whose value is Λ[3].
For more discussion of HUDs, see previous article, where we used the equivalent concept of "cohomogenous infradistribution".
Notation Reference
If you got lost somewhere and wanted to scroll back to see some definition, or see how the dual form of this with infradistributions works, that's what this section is for.
θ is a contribution, a measure with 1 or less measure in total. The dual concept is an a-measure (λμ,b) with λ+b=1.
Θ is a HUC (homogenous ultracontribution) or HUD (homogenous ultradistribution), a closed convex downwards closed set of contributions. The dual concepts are cohomogenous inframeasure and cohomogenous infradistribution, respectively.
ΔcX,□cX,□X are the spaces of contributions, homogenous ultracontributions, and homogenous ultradistributions respectively.
θ(f),Θ(f) are the expectations of functions f:X→[0,1], defined in the usual way. For θ, it's just the expectation of a function w.r.t. a measure, and for Θ, it's Θ(f):=maxθ∈Θθ(f), to perfectly parallel a-measures evaluating functions by just taking expectation, and inframeasures evaluating functions via min(m,b)∈Ψm(f)+b.
≤ is the ordering on contributions/HUC's/HUD's, which is the function ordering, where θ≤θ′ iff for all f:X→[0,1], θ(f)≤θ′(f), and similar for the HUC's. Inframeasures are equipped with the opposite ordering.
s∗Θ is the pushforward along the function s:X→Y. This is a standard probability theory concept which generalizes to all the infra and ultra stuff.
Ξ∗Θ is the pushforward of Θ along the ultrakernel Ξ:X→□cY. This is just the generalization to infra and ultra stuff of the ability to push a probability distribution on X through a probabilistic function X→ΔY to get a probability distribution on Y.
Θ⋉Ξ is the semidirect product of Θ and Ξ, an element of □c(X×Y). This is the generalization of the ability to take a probability distribution on X, and a probabilistic kernel X→ΔY, and get a joint distribution on X×Y.
A,O are the set of actions and observations.
N is the time horizon.
R is the space of programs, Σ is the space of outputs.
Γ=ΣR is the space of functions from programs to the results they output. An element of Γ can be thought of as the state of the "computational universe", for it specifies what all the programs do.
H is the space of histories, action-observation sequences that can end at any point, ending with an observation.
D is the space of destinies, action-observation sequences that are as long as possible, going up to the time horizon.
C is a relation on Γ×D that says whether a computational universe is consistent with a given destiny. A very important note is that this is not the same thing as "the policy is consistent with the destiny" (the policy's actions are the same as what the destiny advises). This is saying something more like "if the destiny has an observation that the computer spat out result 1 when run on computation A, then that is only consistent with mathematical universes which have computation A outputting result 1". Except we don't want to commit to the exact implementation details of it, so we're leaving it undefined besides just "it's a relation"
Φ is the space of "physics outcomes", it can freely vary depending on the hypothesis. It's not a particular fixed space.
Θ is the variable typically used for physicalist hypotheses, elements of □(Γ×Φ). Your uncertainty over the joint distribution over the computational universe and the physical universe.
G is the code of the program-which-is-the-agent. So, G(h) would be the computation that runs what the agent does on history h, and returns its resulting action.
elΓ is the subset of the space Γ×2Γ consisting of (y,α) pairs s.t. y∈α. The y can be thought of as the mathematical universe, and α can be thought of as the set of mathematical universes that are observationally indistinguishable from it.
χA is the indicator function that's 1 on the set A, and 0 everywhere else. It can be multiplied by a measure, to get the restriction of a measure to a particular set.
Br(Θ) is the bridge transform of Θ, defined in definition 1.1, an ultracontribution over the space elΓ×Φ.
Hyα is the set of instantiated histories, relative to mathematical universe y, and set of observationally indistinguishable universes α. It's the set of histories h where all the "math universes" in α agree on how the agent's source code reacts to all the prefixes of h, and where the history h can be extended to some destiny that's consistent with math universe y. Ie, for a history to be in here, all the prefixes have to be instantiated, and it must be consistent with the selected math universe.
Setting
As in the cartesian framework, we fix a finite set A of actions and a finite set O of observations. We assume everything happens within a fixed finite[4] time horizon N∈N. We assume that our agent has access to a computer[5] on which it can execute some finite[4:1] set of programs R with outputs in a finite alphabet Σ. Let Γ:=ΣR be the set of "possible computational universes"[6]. We denote H:=(A×O)<N (the set of histories) and D:=(A×O)N−1×A (the set of "destinies").
To abstract over the details of how the computer is operated, we assume a relation C⊆D×Γ whose semantics is, dCy (our notation for (d,y)∈C) if and only if destiny d is a priori consistent with computational universe y. For example, suppose some a∈A implies a command to execute program r∈R,and if o∈O follows a, it implies observing the computer return output i∈Σ for r. Then, if d contains the substring ao and dCy, it must be the case that y(r)=i.
A physicalist hypothesis is a pair (Φ,Θ), where Φ is a finite[4:2] set representing the physical states of the universe and Θ∈□(Γ×Φ) represents a joint belief about computations and physics. By slight abuse of notation we will refer to such Θ as a physicalist hypothesis, understanding Φ to be implicitly specified. Our agent will have a prior over such hypotheses, ranging over different Φ.
Two questions stand out to us at this point. The first is, what is the domain over which our loss function should be defined? The second is, how do we define the counterfactuals corresponding to different policies π:H→A? The answers to both questions turn out to require the same mathematical building block.
For the first question, we might be tempted to identify Φ as our domain. However, prima facie this doesn't make sense, since Φ is hypothesis-dependent. This is the ontological crisis problem: we expect the agent's values to be defined within a certain ontology which is not the best ontology for formulating the laws of the physical universe. For example, a paperclip maximizer might benefit from modeling the universe in terms of quantum fields rather than paperclips. In principle, we can circumvent this problem by requiring our Φ to be equipped with a mapping ν:Φ→Φ0,where Φ0 is the "axiological" ontology. However, this ν is essentially a bridge rule, carrying with it all the problems of bridge rules: acausal attack is performed by the adversarial hypothesis imprinting the "axiological rendering" of the target universe on the microscopic degrees of freedom of the source universe in order to have a low-complexity Φ→Φ0 function; the analogue of the towel-on-camera issue is that, once you've already coded up your uncertainty over math and physics, along with how to translate from physics to the ontology of value, it doesn't take too much extra complexity to tie "axiologically-simple" results (the analogue of low-complexity observations) to physics-simple results (the analogue of a low-complexity change in what happens), like "if all the paperclips are red, the fine-structure constant doubles in value".
Instead, we will take a computationalist stance: value is a not property of physical states or processes, but of the computations realized by physical processes. For example, if our agent is "selfish" in the sense that, rewards/losses are associated purely with subjective histories, the relevant computation is the agent's own source code. Notice that, for the program-which-is-the-agent G, histories are input. Hence, given loss function l:H→R we can associate the loss l(h) with the computation G(h). Admittedly, there is an implicit assumption that the agent has access to its own source code, but modal decision theory made the same assumption. For another example, if our agent is a diamond maximizer, then the relevant computations are simulations of the physics used to define "diamonds". A more concrete analogue of this is worked out in detail in section 3, regarding Conway's Game of Life.
For the second question, we might be tempted to follow updateless decision theory: counterfactuals correspond to conditioning on G=π. Remember, G is the code of the agent. However, this is not "fair" since it requires the agent to be "responsible" for copies of itself instantiated with fake memories. Such a setting admits no learning-theoretic guarantees, since learning requires trusting your own memory. (Moreover, the agent also has to be able to trust the computer.) Therefore our counterfactuals should only impose G(h)=π(h) when h is a "real memory", which we against interpret through computationalism: h is real if and only if, G(h′) is physically realized for any prefix h′ of h.
Both of our answers requires a formalization of the notion "assuming hypothesis Θ, this computation is physically realized". More precisely, we should allow for computations to be realized with certain probabilities, and more generally allow for ultradistributions over which computations are realized. We will now accomplish this formalization.
Bridge transform
Given any set A, we denote elA={(a,B)∈A×2A∣a∈B}. supp stands for "support" and χA is the characteristic function of A.
Definition 1.1: Let Γ,Φ be finite sets and Θ∈□c(Γ×Φ). The bridge transform of Θ is BrΓ(Θ)∈□c(Γ×2Γ×Φ) s.t. θ∈BrΓ(Θ) if and only if
suppθ⊆elΓ×Φ
for any s:Γ→Γ, prΓ×ΦχelΓ×Φ(s×id2Γ×Φ)∗θ∈Θ.
We will use the notation Br(Θ) when Γ is obvious from the context.
Notice that we are multiplying by χelΓ×Φ, not pushforwarding.
The 2Γ variable of the bridge transform denotes the "facts about computations realized by physics". In particular, if this α∈2Γ takes the form {y∈Γ∣∀r∈R0:y(r)=y0(r)} for some R0⊆R and y0∈ΣR0, then we may say that the computations R0 are "realized" and the computations R∖R0 are "not realized". More generally, talking only about which computations are realized is imprecise since α might involve "partial realization" and/or "entanglement" between computations (i.e. not be of the form above).
Intuitively, this definition expresses that the "computational universe" can be freely modified as long as the "facts known by physics" are preserved. However, that isn't what originally motivated the definition. The bulk of its justification comes from its pleasing mathematical properties, discussed in the next section.
A physicalist agent should be equipped with a prior over physicalist hypotheses. For simplicity, suppose it's a discrete Bayesian prior (it is straightforward to generalize beyond this): hypothesis Θi is assigned probability ζ(i) and ∑iζ(i)=1. Then, we can consider the total bridge transform of the prior. It can't be given by mixing the hypotheses together and applying the bridge transform, because every hypothesis has its own choice of Φ, its own ontology, so you can't mix them before applying the bridge transform. You have to apply the bridge transform to each component first, forget about the choice of Φ via projecting it out, and then mix them afterwards. This receives further support from Proposition 2.13 which takes an alternate possible way of defining the bridge transform for mixtures (extend all the hypotheses from Γ×Φi to Γ×⨆iΦi in the obvious way so you can mix them first) and shows that it produces the same result.
Definition 1.2: Br(ζ⋉Θ):=∑iζ(i)prelΓBr(Θi)∈□elΓ
Evaluating policies
Given A⊆X, ⊤A∈□X is defined as {θ∈ΔcX∣suppθ⊆A}. I.e., it's total uncertainty over which point in A will be selected.
We need to assume that R contains programs representing the agent itself. That is, there is some M∈N, dec:ΣM→A and for each h∈H, ┌G(h)┐∈RM. Pretty much, if you have large numbers of actions available, and a limited number of symbols in your language, actions (and the outputs of other programs with a rich space of outputs) can be represented be m-tuples of programs, like "what's the first bit of this action choice" and "what's the second bit of this action choice" and so on. So, M is just how many bits you need, dec is the mapping from program outputs to the actual action, and ┌G(h)┐ is the m-tuple of of programs which implements the computation "what does my source code do on input history h". The behavior of the agent in a particular mathematical universe is given by taking each program in ┌G(h)┐, using the mathematical universe to figure out what each of the programs outputs, and then using dec to convert that bitstring to an action.
Definition 1.3: Given y∈Γ and h∈H, we define Gy(h)∈A to be dec(a) for a∈ΣM given by ai:=y(┌G(h)┐i).
We can now extract counterfactuals from any Λ∈□elΓ. Specifically, given any policy π we define some Cπ⊆elΓ (the set of stuff consistent with behaving according to π, for a yet-to-be-defined notion of consistency) and define the counterfactual as Λ∩⊤Cπ. We could use the "naive" definition:
Definition 1.4: Cπnaive is the set of all (y,α)∈elΓ s.t. for any h∈H, Gy(h)=π(h).
Per discussion above, it seems better to use a different definition. We will use the notation h⊏g to mean "h is a proper prefix of g" and h⊑g to mean "h is a prefix of g",
Definition 1.5: Given (y,α)∈elΓ, let Hyα be the set of all h∈H s.t. the following two conditions hold:
For all ga∈H×A and y′∈Γ where ga⊏h and y′∈α, Gy′(g)=a.
h⊏d for some d∈D s.t. dCy.
Cπfair is the set of all (y,α)∈elΓ s.t. for any h∈Hyα, Gy(h)=π(h).
Here, condition 1 says that we only "take responsibility" for the action on a particular history when the history was actually observed (all preceding evaluations of the agent are realized computations). Condition 2 says that we only "take responsibility" when the computer is working correctly.
At this stage, Definition 1.5 should be regarded as tentative, since we only have one result so far that validates this definition, namely that the set of (y,α) in Cπfair only depends on what the policy π does on possible inputs, instead of having the set depending on what the policy does on impossible inputs where one of the past memorized actions is not what the policy would actually do. We hope to rectify this in future articles.
Putting everything together: given a loss function L:elΓ→R, which depends on how the mathematical universe is, and which computations are realized or not-realized, the loss of policy π is given by just applying the bridge transform to coax the hypotheses into the appropriate form, intersecting with the set of possibilities consistent with the agent behaving according to a policy, and evaluating the expectation of the loss function, as detailed below.
Lpol(π,ζ):=(Br(ζ⋉Θ)∩⊤Cπfair)(L)
Evaluating agents
So far we regarded the agent's source code G and the agent's policy π as independent variables, and explained how to evaluate different policies given fixed G. However, in reality the policy is determined by the source code. Therefore, it is desirable to have a way to evaluate different codes. We achieve this using an algorithmic information theory approach.
We also want to allow the loss function to depend on G. That is, we postulate L:(RM)H×elΓ→R. Specifically, since RM can be thought of as the space of actions, (RM)H is basically the space of policies. In section 3 we will see why in some detail, but for now think of the difference between "maximizing my own happiness" and "maximizing Alice's happiness": the first is defined relative to the agent (depends on G) whereas the second is absolute (doesn't depend on G). In particular, for a selfish agent that just cares about its own observations, its loss function must reference its own source code.
Definition 1.6: Denote G∗:H→A the policy actually implemented by G. Fix ξ∈Δ(AH). The physicalist intelligence of G relative to the baseline policy mixture ξ, prior ζ and loss function L is defined by:
g(G∣ξ;ζ,L):=−logPrπ∼ξ[Lpol(┌G┐,π,ζ)≤Lpol(┌G┐,G∗,ζ)]
Notice that Lpol depends on ┌G┐ in two ways: through the direct dependence of L on ┌G┐ and through Cπfair.
In particular, it makes sense to choose ζ and ξ as simplicity priors.
There is no obvious way to define "physicalist AIXI", since we cannot have g=∞. For one thing, g is not even defined for uncomputable agents. In principle we could define it for non-uniform agents, but then we get a fixpoint problem that doesn't obviously have a solution: finding a non-uniform agent G s.t. Lpol(┌G┐,G∗,ζ)=minπLpol(┌G┐,π,ζ). On the other hand, once we spell out the infinitary (N=∞) version of the formalism, it should be possible to prove the existence of agents with arbitrarily high finite g. That's because our agent can use quining to access its own source code ┌G┐, and then brute force search a policy π∗ϵ with Lpol(┌G┐,π∗ϵ,ζ)<infπ:H→ALpol(┌G┐,π,ζ)+ϵ.
2. Properties of the bridge transform
In this section we will not need to assume Γ is of the form ΣR: unless stated otherwise, it will be any finite set.
Sanity test
Proposition 2.1: For any Γ, Φ and Θ∈□c(Γ×Φ), Br(Θ) exists and satisfies prΓ×ΦBr(Θ)=Θ. In particular, if Θ∈□(Γ×Φ) then Br(Θ)∈□(elΓ×Φ).
Downwards closure
Roughly speaking, the bridge transform tell us which computations are physically realized. But actually it only bounds it from one side: some computations are definitely realized but any computation might be realized. One explanation for why it must be so is: if you looked at the world in more detail, you might realize that there are small-scale, previously invisible, features of the world which depend on novel computations. There is a direct tension between bounding both sides (i.e. being able to say definitively that a computation isn't instantiated) and having the desirable property that learning more about the small-scale structure of the universe narrows down the uncertainty. To formalize this, we require the following definitions:
Definition 2.1: Let X be a partially ordered set (poset). Then the induced partial order on ΔcX is defined as follows. Given θ,η∈ΔcX, θ⪯η if and only if for any monotonically non-decreasing function f:X→R+, θ(f)≤η(f).
This is also called the stochastic order (which is standard mathematical terminology). Intuitively, θ⪯η means that η has its measure further up in the poset than θ does. To make that intuition formal, we can also characterize the induced order as follows:
Proposition 2.2: Let X be a poset, θ,η∈ΔcX. Then, θ⪯η if and only if there exists κ:X→ΔX s.t.:
For all x∈X, y∈suppκ(x): x≤y.
κ∗θ≤η
Proposition 2.3: Let X be a poset, θ,η∈ΔcX. Then, θ⪯η if and only if there exists κ:X→ΔX s.t.:
For all x∈X, y∈suppκ(x): x≥y.
θ≤κ∗η
Or, in words, you can always go from θ to η by moving probability mass upwards from where it was, since κ(x) is always supported on the set of points at-or-above x.
We can now state the formalization of only bounding one side of the bridge transform. Let elΓ×Φ be equipped with the following order. (y,α,x)≤(y′,α′,x′) if and only if y=y′, x=x′ and α⊆α′. Then:
Proposition 2.4: For any Γ, Φ and Θ∈□c(Γ×Φ), Br(Θ) is downwards closed w.r.t. the induced order on Δc(elΓ×Φ). That is, if θ∈Br(Θ) and η⪯θ then η∈Br(Θ).
Simple special case
Let's consider the special case where there's only one program, which can produce two possible outputs, a 0 and a 1. And these two outputs map to two different distributions over physics outcomes in Φ. Intuitively, if the computation isn't realized/instantiated, the distribution over physics outcome should be identical, while if the computation is realized/instantiated, it should be possible to look at the physics results to figure out how the computation behaves. The two probability distributions may overlap some intermediate amount, in which case it should be possible to write the two probability distributions as a mixture between a probability distribution that behaves identically regardless of the program output (the "overlap" of the two distributions), and a pair of probability distributions corresponding to the two different program outputs which are disjoint. And the total variation distance (dTV(ϕ(0),ϕ(1))) between the two probability distributions is connected to the size of the distribution overlap. Proposition 2.5 makes this formal.
Proposition 2.5: Consider a finite set X, ϕ?∈ΔX, ϕ!:{0,1}→ΔX, p∈[0,1] and ϕ:=(1−p)ϕ?+pϕ!. Then, p≥dTV(ϕ(0),ϕ(1)). Conversely, consider any ϕ:{0,1}→ΔX. Then, there exist ϕ?∈ΔX and ϕ!:{0,1}→ΔX s.t. ϕ=(1−p)ϕ?+pϕ! for p:=dTV(ϕ(0),ϕ(1)).
The bridge transform should replicate this same sort of analysis. We can interpret the case of "total uncertainty over math, but knowing how physics turns out conditional on knowing how math turns out" by ⊤Γ⋉ϕ for some ϕ:Γ→ΔΦ. Taking the special case where Γ={0,1} for one program that can output two possible answers, it should return the same sort of result, where the two probability distributions can be decomposed into their common overlap, and non-overlapping pieces, and the overlapping chunk of probability measure should be allocated to the event "the computation is not instantiated", while the disjoint ones should be allocated to the event "the computation is instantiated". As it turns out, this does indeed happen, with the same total-variation-distance-based upper bound on the probability that the computation is unrealized (namely, 1−dTV(ϕ(0),ϕ(1)))
Proposition 2.6: Consider any Φ and ϕ:{0,1}→ΔΦ. Denote U:={0,1}×{{0,1}}×Φ (the event "program is unrealized"). Let Λ:=Br(⊤{0,1}⋉ϕ). Then,
Λ(χU)=1−dTV(ϕ(0),ϕ(1))
Bound in terms of support
If the physical state is x∈Φ, then it is intuitively obvious that the physically manifest facts about the computational universe y∈Γ must include the fact that (y,x)∈suppΘ. Formally:
Proposition 2.7: If X is a poset and Θ∈□cX, then Θ↓ will denote downward closure of Θ. For any Γ, Φ and Θ∈□c(Γ×Φ) if (y,α,x)∈suppBr(Θ) and y′∈α, then (y′,x)∈suppΘ. Moreover, define susΘ:Φ→2Γ by susΘ(x):={y∈Γ∣(y,x)∈suppΘ}. Then, Br(Θ)⊆(Θ⋉susΘ)↓ (we slightly abuse notation by treating susΘ as a mapping Γ×Φ→2Γ that doesn't depend on the first argument, and also playing loose with the order of factors in the set on which our HUCs live).
Putting that into words, all the (α,x) that are in the support of the bridge transform have α being a subset of the support of Θ restricted to x. The bridge transform knows not to provide α sets which are too large. Further, the bridge transform is smaller/has more information than the transform that just knows about what the support of Θ is.
Idempotence
Imagine replacing the physical state space Φ by the space of manifest facts 2Γ. Intuitively, the latter should contain the same information about computations as the former. Therefore, if we apply the bridge transform again after making such a replacement, we should get the same thing. Given any set X, we denote idX:X→X the identity mapping and diagX:X→X×X the diagonal mapping.
Proposition 2.8: For any Γ, Φ and Θ∈□c(Γ×Φ)
Br(prΓ×2ΓBr(Θ))=[(idΓ×diag2Γ)∗prΓ×2ΓBr(Θ)]↓
Refinement
If we refine a hypothesis (i.e. make it more informative) in the ultradistribution sense, the bridge transform should also be more refined:
Proposition 2.9: For any Γ, Φ and Θ1,Θ2∈□c(Γ×Φ), if Θ1⊆Θ2 then Br(Θ1)⊆Br(Θ2).
Another way of refining a hypothesis is refining the ontology, i.e. moving to a richer state space. For example, we can imagine one hypothesis that describes the world in terms of macroscopic objects and another hypothesis that describes the worlds in term of atoms which are otherwise perfectly consistent with each other. In this case, we also expect the bridge transform to become more refined. This desiderata is where the downwards closure comes from, because it's always possible to pass to a more detailed view of the world and new computations could manifest then. It's also worth noting that, if you reduce your uncertainty about math, or physics, or move to a more detailed state space, that means more computations are instantiated, and the loss goes down as a result.
Proposition 2.10: For any Γ, Φ1, Φ2, t:Φ2→Φ1 and Θ∈□c(Γ×Φ2)
(idelΓ×t)∗Br(Θ)⊆Br((idΓ×t)∗Θ)
In general, when we refine the ontology, the manifest facts can become strictly refined: the richer state "knowns more" about computations than the poorer state. However, there is a special case when we don't expect this to happen, namely when the rich state depends on the computational universe only through the poor state. Roughly speaking, in this case once we have the poor state we can sample the rich state without evaluating any more programs. To formalize this, it is convenient to introduce pullback of HUCs, which is effectively the probabilistic version of taking a preimage.
Definition 2.2: Let X,Y be finite sets and t:X→Y a mapping. We define the pullback[7] operator t∗:□cY→□cX by
t∗Θ:={θ∈ΔcX∣t∗θ∈Θ}
Proposition 2.11: Consider any Γ, Φ1, Φ2, t:Φ2→Φ1, Ξ:Φ1→□cΦ2 and Θ∈□c(Γ×Φ1) s.t. tΞ=idΦ1. Then,
(idelΓ×Ξ)∗Br(Θ)⊆Br((idΓ×Ξ)∗Θ)⊆(idelΓ×t)∗Br(Θ)
In particular,
prelΓBr(Θ)=prelΓBr((idΓ×Ξ)∗Θ)
In other words, we can think of Φ1 as the coarse-grained version, and Φ2 as the fine-grained version, and t as the coarse-graining function, and Ξ as the function mapping a coarse grained state to some uncertainty over what the corresponding fine-grained state is. Θ is your starting uncertainty over computations and the coarse-grained state. Then, in order from most to least informative, you could apply Ξ after the bridge transform, apply Ξ before the bridge transform, or pull back along t after the bridge transform (because the pullback is the most uninformative inverse, in a sense). But all this stuff only affects the information about the fine-grained state Φ2. If you were to forget about that, the data about computations and how they're instantiated is the same no matter whether you do the bridge transform with the coarse state, or try your best to fill in what the fine state is first, because the "fill in the fine state" function doesn't depend on your uncertainty about computations, so it doesn't contain any more information about how math is connected to physics than the original coarse-grained view had.
Mixing hypotheses
In general, we cannot expect the bridge transform to commute with taking probabilistic mixtures. For example, let Γ={0,1}, Φ={0,1}, Θ1=⊤{00,11}, Θ2=⊤{