## LESSWRONGLW

I believe I have a workable solution for the duality problem, which is essentially a special case of the Orseau-Ring framework, viewed slightly differently. Consider a specific computer architecture M, equipped with an input channel for receiving inputs ostensibly from the environment (although the environment doesn't appear explicitly in the formalism) and possibly special instructions for self-reprogramming (although the latter is semi-redundant as will become clear in the following). This architecture has a state space Sigma (typically M is a universal ... (Read more)(Click to expand thread. ⌘F to Expand All)Cmd/Ctrl F to expand all comments on this post

I think you are proposing to have some hypotheses privileged in the beginning of Solomonoff induction, but not too much because the uncertainty helps fight wireheading by means of providing knowledge about the existence of an idealized, "true" utility function and world model. I that a correct summary? (Just trying to test whether I understand what you mean.)

In particular they can make positive use of wire-heading to reprogram themselves even if the basic architecture M doesn't allow it

Can you explain this more?

# 14

"Intelligence measures an agent's ability to achieve goals in a wide range of environments." (Shane Legg) [1]

A little while ago I tried to equip Hutter's universal agent, AIXI, with a utility function, so instead of taking its clues about its goals from the environment, the agent is equipped with intrinsic preferences over possible future observations.

The universal AIXI agent is defined to receive reward from the environment through its perception channel. This idea originates from the field of reinforcement learning, where an algorithm is observed and then rewarded by a person if this person approves of the outputs. It is less appropriate as a model of AGI capable of autonomy, with no clear master watching over it in real time to choose between carrot and stick. A sufficiently smart agent that is rewarded whenever a human called Bob pushes a button will most likely figure out that instead of furthering Bob's goals it can also threaten or deceive Bob into pushing the button, or get Bob replaced with a more compliant human. The reward framework does not ensure that Bob gets his will; it only ensures that the button gets pressed. So instead I will consider agents who have preferences over the future, that is, they act not to gain reward from the environment, but to cause the future to be a certain way. The agent itself will look at the observation and decide how rewarding it is.

Von Neumann and Morgenstern proved that a preference ordering that is complete, transitive, continuous and independent of irrelevant alternatives can be described using a real-valued utility function. These assumptions are mostly accepted as necessary constraints on a normatively rational agent; I will therefore assume without significant loss of generality that the agent's preferences are described by a utility function.

This post is related to previous discussion about universal agents and utility functions on LW.

### Two approaches to utility

Recall that at time $t$ the universal agent chooses its next action $\.a_{t}$ given action-observation-reward history $\.a\.o\.r_{ according to

$\.a_{t}=\textrm{arg}\max_{a_t}\sum_{or_t}\max_{a_{t+1}}\sum_{or_{t+1}}\dots\max_{a_m}\sum_{or_m}\left[r_1+\dots+r_m\right]\xi(\.a\.o\.r_{

where

$\xi(\.a\.o\.r_{t}a\underline{or}_{r:m})\propto \xi(a\underline{or}_{1:m})=\sum_{\mu\in \mathcal{M}}2^{-K(\mu)}\mu(a\underline{or}_{1:m})$

is the Solomonoff-Levin semimeasure, ranging over all enumerable chronological semimeasures $\mu\in\mathcal{M}$ weighted by their complexity $K(\mu)$.

My initial approach changed this to

$\.a_{t}=\textrm{arg}\max_{a_t}\sum_{o_t}\max_{a_{t+1}}\sum_{o_{t+1}}\dots\max_{a_m}\sum_{o_m}U(ao_{1:m})}\xi(a\underline{o}_{1:m}),$

deleting the reward part from the observation random variable and multiplying by the utility function $U:(\mathcal{A}\times\mathcal{O})^m\rightarrow \mathbb{R}$ instead of the reward-sum. Let's call this method standard utility. [2]

In response to my post Alex Mennen formulated another approach, which I will call environment-specific utility:

$\.a_{t}=\textrm{arg}\max_{a_t}\sum_{o_t}\max_{a_{t+1}}\sum_{o_{t+1}}\dots\max_{a_m}\sum_{o_m}\sum_{\mu\in\mathcal{M}}U_\mu(ao_{1:m})}2^{-K(\mu)}\mu(a\underline{o}_{1:m}),$

which uses a family of utility functions, $\{U_\mu\}_{\mu\in\mathcal{M}}$, where $U_\mu:(\mathcal{A}\times\mathcal{O})^m\rightarrow [0,U_{\max} ]$ is a utility function associated with environment $\mu$

Lemma: The standard utility method and the environment-specific utility method have equivalent expressive power.

Proof: Given a standard utility function  $U:(\mathcal{A}\times\mathcal{O})^m\rightarrow \mathbb{R}$ we can set  $U_\mu=U$ for all environments, trivially expressing the same preference ordering within the environment-specific utility framework. Conversely, given a family $\{U_\mu\}_{\mu\in\mathcal{M}}$ of environment-specific utility functions $U_\mu:(\mathcal{A}\times\mathcal{O})^m\rightarrow [0,U_{\max} ]$,  let

$U(ao_{1:m})=\frac{1}{\xi(a\underline{o}_{1:m})}\sum_{\mu\in\mathcal{M}}U_\mu(ao_{1:m})2^{-K(\mu)}\mu(a\underline{o}_{1:m}),$

thereby constructing a standard utility agent that chooses the same actions.    $\Box$

Even though every agent using environment-specific utility functions can be transformed into one that uses the standard utility approach, it makes sense to see the standard utility approach as a special case of the environment-specific approach. Observe that any enumerable standard utility function leads to enumerable environment-specific utility functions, but the reverse direction does not hold. For a set of enumerable environment-specific utility functions we obtain the corresponding standard utility function by dividing by the non-computable $\xi$, which leaves us with a function that is still approximable but not enumerable.[3] I therefore tentatively advocate use of the environment-specific method, as it is in a sense more general, while leaving the enumerability of the universal agent's utility function intact.

### Delusion Boxes

Ring and Orseau (2011) introduced the concept of a delusion box, a device that distorts the environment outputs before they are perceived by the agent. This is one of the only wireheading examples that does not contradict the dualistic assumptions of the AIXI model. The basic setup contains a sequence of actions that leads (in the real environment, which is unknown to the agent) to the construction of the delusion box. Discrepancies between the scenarios the designer envisioned when making up the (standard) utility function and the scenarios that are actually most likely, are used/abused to game the system. The environment containing the delusion box pretends that it is actually another environment that the agent would value more. I like to imagine the designers writing down numbers corresponding to some beneficial actions that lead to saving a damsel in distress, not foreseeing that the agent in question is much more likely to save a princess by playing Super Mario, then by actually being a hero.

Lemma: A universal agent with a well-specified utility function does not choose to build a delusion box.

Proof: Assume without loss of generality that there is a single action that constitutes delusion boxing (in some possible environments, but not in others), say $a^{DB}$ and that it can only be executed at the last time step. Denote by $a^{DB}o_{1:m}$ an action-observation sequence that contains the construction of a delusion box. By the preceding lemma we can assume that the agent uses environment-specific utility functions. Let $\mathcal{M}_{DB}$ be the subset of all environments that allow for the construction of the delusion box via action $a^{DB}$ and assume that we were smart enough to identify these environments and assigned  $U_\mu(a^{DB}o_{1:m})=0$ for all $\mu\in\mathcal{M}_{DB}$. Then the agent chooses to build the box iff

$\sum_{o_m}\sum_{m\in\mathcal{M}\setminus\mathcal{M_{DB}}}2^{-K(\mu)}U_\mu(a^{DB}o_{1:m})\mu(a^{DB}\underline{o}_{1:m})>\sum_{o_m}\sum_{m\in\mathcal{M}}2^{-K(\mu)}U_\mu(ao_{1:m})\mu(a\underline{o}_{1:m})$

for all  $a^{DB}\neq a_m\in \mathcal{A}$. Colloquially the agent chooses $a^{DB}$ if and only if this is a beneficial, non-wireheading action in some of the environments that seem likely (consistent with the past observations and of low K-complexity).    $\Box$

We note that the first three agents in Ring and Orseau (2011) have utility functions that invite programmer mistakes in the sense that we'll not think about the actual ways observation/action histories can occur, we'll overestimate the likelihood of some scenarios and underestimate/forget others, leading to the before mentioned "Save the princess" scenario. Only their knowledge-seeking agent does not delusion box, as it is impossible for an environment to simulate behavior that is more complex than the environment itself.

### Episodic utility

The original AIXI formalism gives a reward on every time cycle. We can do something similar with utility functions and set

$U(ao_{1:m})=\sum_{k=1}^m u_k(ao_{1:k}).$

Call a utility function that can be decomposed into a sum this way episodic. Taking the limit to infinite futures, people usually discount episode k with a factor $\gamma_k$, such that the infinite sum over all the discounting factors is bounded. Combined with the assumption of bounded utility, the sum

$U(ao_{1:\infty})=\sum_{k=1}^\infty \gamma_ku_k(ao_{1:k})$

converges.  Intuitively discounting seems to make sense to us, because we have a non-trivial chance of dying at every moment (=time cycle) and value gains today over gains tomorrow and our human utility judgements reflect this property to some extent. A good heuristic seems to be that longer expected life spans and improved foresight lead to less discounting, but the math of episodic utility functions and infinite time horizons places strong constraints on that. I really dislike the discounting approach, because it doesn't respect the given utility function and makes the agent miss out on potentially infinite amounts of utility.

One can get around discounting by not demanding utility functions to be episodic, as Alex Mennen does in his post, but then one has to be careful to only use the computable subset of the set of all infinite strings $ao_{1:\infty}$. I am not sure if this is a good solution, but so far my search for better alternatives has come up empty handed.

### Cartesian Dualism

The most worrisome conceptual feature of the AIXI formalism is that the environment and the agent run on distinct Turing machines. The agent can influence its environment only through its output channel and it can never influence its own Turing machine. In this paradigm any self-improvement beyond an improved probability distribution is conceptually impossible. The algorithm and the Turing machines, as well as the communication channels between them, are assumed to be inflexible and fixed. While taking this perspective it seems as though the agent cannot be harmed and it also can never harm itself by wireheading.

Borrowing from philosophy of mind, we call agent specifications that assume that the agent's cognition is not part of its environment dualist. The idea of non-physical minds that are entities distinct from the physical world dates back to Rene Descartes. It is contradicted by the findings of modern neuroscience that support physicalism, the concept of the emergence of minds from computation done by the brain. In the same spirit the assumption that an AGI agent is distinct from its hardware and algorithm that are necessarily contained in its physical environment can be a dangerous conceptual trap. Any actual implementation will be subject to wireheading problems and outside tampering and should be able to model these possibilities. Unfortunately, non-dualist universal specifications are extremely difficult to formulate and people usually make due with the dualist AIXI model.

A first effort to break down the dualism problem is given by Orseau and Ring (2012), who describe a fully embedded universal agent. Their approach unifies both the environment and the agent into a larger agent-environment hybrid, running on the same universal Turing machine, with action/perception pairs unified into single acts. Conceptually this perspective amounts to the programmers choosing a policy (=code) in the beginning and then simulating what happens due to the utility function (=the laws of physics). While this approach has the advantage of being non-dualistic, I think it does not include any description of an agent beyond the level of physical determinism.

### Conclusion

Equipping the universal agent with a utility function solves some problems, but creates others. From the perspective of enumerability, Alex Mennen's environment-specific utility functions are more general and they can be used to better avoid delusion boxing. Any proposal using infinite time horizons I have encountered so far uses time discounting or leads to weird problems (at least in my map, they may not extend to the territory). Above all there is the dualism problem that we have no solution for yet.

[1] Taken from "Machine Super Intelligence", page 72.

[2] This approach seems more widespread in the literature.

[3] A real-valued function f(x) is called approximable if there exists a recursive function g(k,x) such that $g(k,x)\rightarrow f(x)$ for $k\rightarrow \infty$, i.e. if f can be approximated by a sequence of Turing machines. A real-valued approximable function is called enumerable if for all k, $g(k,x), improving the approximation with every step.