Followup to: Updateless intelligence metrics in the multiverse

In the previous post I explained how to define a quantity that I called "the intelligence metric" which allows comparing intelligence of programs written for a given hardware. It is a development of the ideas by Legg and Hutter which accounts for the "physicality" of the agent i.e. that the agent should be aware it is part of the physical universe it is trying to model (this desideratum is known as naturalized induction). My construction of the intelligence metric exploits ideas from UDT, translating them from the realm of decision algorithms to the realm of programs which run on an actual piece of hardware with input and output channels, with all the ensuing limitations (in particular computing resource limitations).

In this post I present a variant of the formalism which overcomes a certain problem implicit in the construction. This problem has to do with overly strong sensitivity to the choice of a universal computing model used in constructing Solomonoff measure. The solution sheds some interesting light on how the development of the seed AI should occur.

Structure of this post:

- A 1-paragraph recap of how the updateless intelligence formalism works. The reader interested in technical details is referred to the previous post.
- Explanation of the deficiencies in the formalism I set out to overcome.
- Explanation of the solution.
- Concluding remarks concerning AI safety and future development.

# TLDR of the previous formalism

The metric is a utility expectation value over a Solomonoff measure in the space of hypotheses describing a "Platonic ideal" version of the target hardware. In other words it is an expectation value over all universes containing this hardware in which the hardware cannot "break" i.e. violate the hardware's intrinsic rules. For example, if the hardware in question is a Turing machine, the rules are the time evolution rules of the Turing machine, if the hardware in question is a cellular automaton, the rules are the rules of the cellular automaton. This is consistent with the agent being Physicalist since the utility function is evaluated on a different universe (also distributed according to a Solomonoff measure) which isn't constrained to contain the hardware or follow its rules. The coupling between these two different universes is achieved via the usual mechanism of interaction between the decision algorithm and the universe in UDT i.e. by evaluating expectation values conditioned on logical counterfactuals.

# Problem

The Solomonoff measure depends on choosing a universal computing model (e.g. a universal Turing machine). Solomonoff induction only depends on this choice weakly in the sense that any Solomonoff predictor converges to the right hypothesis given enough time. This has to do with the fact that Kolmogorov complexity only depends on the choice of universal computing model through an O(1) additive correction. It is thus a natural desideratum for the intelligence metric to depend on the universal computing model weakly in some sense. Intuitively, the agent in question should always converge to the right model of the universe it inhabits regardless of the Solomonoff prior with which it started.

The problem with realizing this expectation has to do with exploration-exploitation tradeoffs. Namely, if the prior strongly expects a given universe, the agent would be optimized for maximal utility generation (exploitation) in this universe. This optimization can be so strong that the agent would lack the faculty to model the universe in any other way. This is markedly different from what happens with AIXI since our agent has limited computing resources to spare *and* it is physicalist therefore its source code might have side effects important to utility generation that have nothing to do with the computation implemented by the source code. For example, imagine that our Solomonoff prior assigns very high probability to a universe inhabited by Snarks. Snarks have the property that once they see a robot programmed with the machine code "000000..." they immediately produce a huge pile of utilons. On the other hand, when they see a robot programmed with any other code they immediately eat it *and* produce a huge pile of negative utilons. Such a prior would result in the code "000000..." being assigned the maximal intelligence value even though it is everything but intelligent. Observe that there is nothing preventing us from producing a Solomonoff prior with such bias since it is possible to set the probabilities of any finite collection of computable universes to any non-zero values with sum < 1.

More precisely, the intelligence metric involves two Solomonoff measures: the measure of the "Platonic" universe and the measure of the physical universe. The latter is not really a problem since it can be regarded to be a part of the utility function. The utility-agnostic version of the formalism assumes a program for computing the utility function is read by the agent from a special storage. There is nothing to stop us from postulating that the agent reads *another* program from that storage which is the universal computer used for defining the Solomonoff measure over the physical universe. However, this doesn't solve our problem since even if the physical universe is distributed with a "reasonable" Solomonoff measure (assuming there is such a thing), the Platonic measure determines in which portions of the physical universe (more precisely multiverse) our agent manifests.

There is another way to think about this problem. If the seed AI knows nothing about the universe except the working of its own hardware and software, the Solomonoff prior might be insufficient "information" to prevent it from making irreversible mistakes early on. What we would *like* to do is to endow it from the first moment with the sum of our own knowledge, but this might prove to be very difficult.

# Solution

Imagine the hardware architecture of our AI to be composed of two machines. One I call the "child machine", the other the "adult machine". The child machine receives data from the same input channels (and "utility storage") as the adult machine and is able to read the internal state of the adult machine itself or at least the content of its output channels. However, the child machine has no output channels of its own. The child machine has special memory called "template memory" into which it has unlimited write access. There a single moment in time ("end of childhood"), determined by factors external to both machines (i.e. the human operator) in which the content of the template memory is copied into the instruction space of the adult machine. Thus, the child machine's entire role is making observations and using them to prepare a program for the adult machine which will be eventually loaded into the latter.

The new intelligence metric assigns intelligence values to programs for the *child* machine. For each hypothesis describing the Platonic universe (which now contains both machines, the end of childhood time value and the entire ruleset of the system) we compute the utility expectation value under the following logical counterfactual condition: "The program loaded into template memory at the end of childhood is the same as would result from the given program for the child machine if this program for the child machine would be run with the inputs actually produced by the given hypothesis regarding the Platonic universe". The intelligence value is then the expectation value of that quantity with respect to a Solomonoff measure over hypotheses describing the Platonic universe.

The important property of the logical counterfactual is that it doesn't state the given program is *actually* loaded into the child machine. It only says the resulting content of the template memory is the same as which *would* be obtained from the given program *assuming all the laws of the Platonic universe hold*. This formulation prevents exploitation of side effects of the child source code since the condition doesn't fix the source code, only its output. Effectively, the child agents considers itself to be Cartesian, i.e. can consider neither the side effects of its computations nor the possibility the physical universe will violate the laws of its machinery. On the other hand the child's output (the mature program) is a physicalist agent since it affects the physical universe by manifesting in it.

If such an AI is implemented in practice, it makes sense to prime the adult machine with a "demo" program which will utilize the output channels in various ways and do some "exploring" using its input channels. This would serve to provide the child with as much as possible information.

To sum up, the new expression for the intelligence metric is:

I(q) = E_{HX}[E_{HY(Ec(X))}[E_{L}[U(Y, Eu(X)) | Q(X, t(X)) = Q*(X; q)]] | N]

Here:

- q is the program priming the child machine
- HX is the hypothesis producing the Platonic universe X (a sequence of bits encoding the state of the hardware as a function of time and the end-of-childhood time t(X)). It is a program for a fixed universal computing model C.
- HY is the hypothesis producing the Physical universe (an abstract sequence of bits). It is a program for the universal computer program ("virtual machine") Ec(X) written into storage E in X.
- E
_{L}is logical expectation value defined e.g. using evidence logic. - Eu(X) is a program for computing the utility function which is written into storage E in X.
- U is the utility function which consists of applying Eu(X) to Y.
- Q(X, t(X)) is the content of template memory at time t(X).
- Q*(X; q) is the content that
*would*be in the template memory if it was generated by program q receiving the inputs going into the child machine under hypothesis HX. - N is the full ruleset of the hardware including the reprogramming of the adult machine that occurs at t(X).

# Concluding Remarks

- It would be very valuable to formulate and prove a mathematical theorem which expresses the sense in which the new formalism depends on the choice of universal computing model weakly (in particular it would validate the notion).
- This formalism might have an interesting implication on AI safety. Since the child agent is Cartesian and has no output channels (it cannot
*create*output channels*because*it is Cartesian) it doesn't present as much risk as an adult AI. Imagine template memory is write-only (which is not a problem for the formalism) and is implemented by a channel that doesn't store the result*anywhere*(in particular the mature program is never run). There can still be risk due to side effects of the mature program that manifest through presence of its partial or full versions in (non-template) memory of the child machine. For example, imagine the mature program is s.t. any person who reads it experiences compulsion to run it. This risk can be mitigated by allowing both machines to interact only with a virtual world which receives no inputs from the external reality. Of course the AI might still be able to deduce external reality. However, this can be prevented by exploiting prior bias: we can equip the AI with a Solomonoff prior that favors the virtual world to such extent that it would have no reason to deduce the real world. This way the AI is safe unless it invents a "generic" box-escaping protocol which would work in a*huge*variety of different universes that might contain the virtual world. - If we factor finite logical uncertainty into evaluation of the logical expectation value E
_{L}, the plot thickens. Namely, a new problem arises related to bias in the "logic prior". To solve this new problem we need to introduce yet another stage into AI development which might be dubbed "fetus". The fetus has no access to external inputs and is responsible for building a sufficient understanding of mathematics in the same sense the child is responsible to build a sufficient understanding of physics. Details will follow in subsequent posts, so stay tuned!

I would find it helpful to see this calculation carried through for some concrete actual computing device that is specified completely. A 32-bit adder circuit, say, or a thermostat.

It is very complicated to do literally since it involves Solomonoff probabilities which are uncomputable and also logical probabilities which is something I'm not even sure we already nailed the correct definition of (and even if we did it involves stuff like counting weighted proofs which is not easy to do in practice).

It might be possible to demonstrate the idea by making some artificial simplifications e.g. replacing the Solomonoff ensemble by some computable ensemble. I'll consider to write something like this, thanks!