I'm Anja Heinisch, the new visiting fellow at SI. I've been researching replacing AIXI's reward system with a proper utility function. Here I will describe my AIXI+utility function model, address concerns about restricting the model to bounded or finite utility, and analyze some of the implications of modifiable utility functions, e.g. wireheading and dynamic consistency. Comments, questions and advice (especially about related research and material) will be highly appreciated.

### Introduction to AIXI

Marcus Hutter's (2003) universal agent AIXI addresses the problem of rational action in a (partially) unknown computable universe, given infinite computing power and a halting oracle. The agent interacts with its environment in discrete time cycles, producing an action-perception sequence with actions (agent outputs) and perceptions (environment outputs) chosen from finite sets and . The perceptions are pairs , where is the observation* *part and denotes a reward*.* At time k the agent chooses its next action according to the expectimax principle:

Here M denotes the updated Solomonoff prior summing over all programs that are consistent with the history* *^{[1]} and which will, when run on the universal Turing machine T with successive inputs , compute outputs , i.e.

AIXI is a dualistic framework in the sense that the algorithm that constitutes the agent is not part of the environment, since it is not computable. Even considering that any running implementation of AIXI would have to be computable, AIXI accurately simulating AIXI accurately simulating AIXI ad infinitem doesn't really seem feasible. Potential consequences of this separation of mind and matter include difficulties the agent may have predicting the effects of its actions on the world.

### Utility vs rewards

So, why is it a bad idea to work with a reward system? Say the AIXI agent is rewarded whenever a human called Bob pushes a button. Then a sufficiently smart AIXI will figure out that instead of furthering Bob’s goals it can also threaten or deceive Bob into pushing the button, or get another human to replace Bob. On the other hand, if the reward is computed in a little box somewhere and then displayed on a screen, it might still be possible to reprogram the box or find a side channel attack. Intuitively you probably wouldn't even blame the agent for doing that -- people try to game the system all the time.

You can visualize AIXI's computation as maximizing bars displayed on this screen; the agent is unable to connect the bars to any pattern in the environment, they are just there. It wants them to be as high as possible and it will utilize any means at its disposal. For a more detailed analysis of the problems arising through reinforcement learning, see Dewey (2011).

Is there a way to bind the optimization process to actual patterns in the environment? To design a framework in which the screen informs the agent about the patterns it should optimize for? The answer is, yes, we can just define a utility function

that assigns a value to every possible future history and use it to replace the reward system in the agent specification:

When I say "we can just define" I am actually referring to the *really hard* question of how to recognize and describe the patterns we value in the universe. Contrasted with the necessity to specify rewards in the original AIXI framework, this is a strictly harder problem, because the utility function has to be known ahead of time and the reward system can always be represented in the framework of utility functions by setting

For the same reasons, this is also a strictly safer approach.

### Infinite utility

The original AIXI framework must necessarily place upper and lower bound on the rewards that are achievable, because the rewards are part of the perceptions and is finite. The utility function approach does not have this problem, as the expected utility

is always finite as long as we stick to a finite set of possible perceptions, even if the utility function is not bounded. Relaxing this constraint and allowing to be infinite and the utility to be unbounded creates divergence of expected utility (for a proof see de Blanc 2008). This closely corresponds to the question of how to be a consequentialist in an infinite universe, discussed by Bostrom (2011). The underlying problem here is that (using the standard approach to infinities) these expected utilities will become incomparable. One possible solution to this problem could be to use a larger subfield than of the surreal numbers, my favorite^{[2]} so far being the Levi-Civita field generated by the infinitesimal :

with the usual power-series addition and multiplication. Levi-Civita numbers can be written and approximated as

(see Berz 1996), which makes them suitable for representation on a computer using floating point arithmetic. If we allow the range of our utility function to be , we gain the possibility of generalizing the framework to work with an infinite set of possible perceptions, therefore allowing for continuous parameters. We also allow for a much broader set of utility functions, no longer excluding the assignment of infinite (or infinitesimal) utility to a single event. I recently met someone who argued convincingly that his (ideal) utility function assigns infinite negative utility to every time instance that he is not alive, therefore making him prefer life to any finite but huge amount of suffering.

Note that finiteness of is still needed to guarantee the existence of actions with maximal expected utility, and the finite (but dynamic) horizon remains a very problematic assumption, as described in Legg (2008).

### Modifiable utility functions

Any implementable approximation of AIXI implies a weakening of the underlying dualism. Now the agent's hardware is part of the environment and at least in the case of a powerful agent, it can no longer afford to neglect the effect its actions may have on its source code and data. One question that has been asked is whether AIXI can protect itself from harm. Hibbard (2012) shows that an agent similar to the one described above, equipped with the ability to modify its policy responsible for choosing future actions, would not do so, given that it starts out with the (meta-)policy to always use the optimal policy, and the additional constraint to change only if that leads to a strict improvement. Ring and Orseau (2011) study under which circumstances a universal agent would try to tamper with the sensory information it receives. They introduce the concept of a delusion box, a device that filters and distorts the perception data before it is written into the part of the memory that is read during the calculation of utility.

A further complication to take into account is the possibility that the part of memory that contains the utility function may get rewritten, either by accident, by deliberate choice (programmers trying to correct a mistake), or in an attempt to wirehead. To analyze this further we will now consider what can happen if the screen flashes different goals in different time cycles. Let

denote the utility function the agent will have at time k.

Even though we will only analyze instances in which the agent knows at time k, which utility function it will have at future times (possibly depending on the actions before that), we note that for every fixed future history the agent knows the utility function that is displayed on the screen because the screen is part of its perception data .

This leads to three different agent models worthy of further investigation:

**Agent 1**will optimize for the goals that are displayed on the screen right now and act as if it would continue to do so in the future. We describe this with the utility function**Agent 2**will try to anticipate future changes to its utility function and maximize the utility it experiences at every time cycle as shown on the screen at that time. This is captured by**Agent 3**will, at time k, try to maximize the utility it derives in hindsight, displayed on the screen at the time horizon

Of course arbitrary mixtures of these are possible.

The type of wireheading that is of interest here is captured by the Simpleton Gambit described by Orseau and Ring (2011), a Faustian deal that offers the agent maximal utility in exchange for its willingness to be turned into a Simpleton that always takes the same default action at all future times. We will first consider a simplified version of this scenario: The Simpleton future, where the agent knows for certain that it will be turned into a Simpleton at time k+1, no matter what it does in the remaining time cycle. Assume that for all possible action-perception combinations the utility given by the current utility function is not maximal, i.e. holds for all . Assume further that the agents actions influence the future outcomes, at least from its current perspective. That is, for all there exist with . Let be the Simpleton utility function, assigning equal but maximal utility to all possible futures. While Agent 1 will optimize as before, not adapting its behavior to the knowledge that its utility function will change, Agent 3 will be paralyzed, having to rely on whatever method its implementation uses to break ties. Agent 2 on the other hand will try to maximize only the utility .

Now consider the actual Simpleton Gambit: At time k the agent gets to choose between changing, , resulting in and (not changing), leading to for all . We assume that has no further effects on the environment. As before, Agent 1 will optimize for business as usual, whether or not it chooses to change depends entirely on whether the screen specifically mentions the memory pointer to the utility function or not.

Agent 2 will change if and only if the utility of changing compared to not changing according to what the screen currently says is strictly smaller than the comparative advantage of always having maximal utility in the future. That is,

is strictly less than

This seems quite analogous to humans, who sometimes tend to choose maximal bliss over future optimization power, especially if the optimization opportunities are meager anyhow. Many people do seem to choose their goals so as to maximize the happiness felt by achieving them at least some of the time; this is also advice that I have frequently encountered in self-help literature, e.g. here. Agent 3 will definitely change, as it only evaluates situations using its final utility function.

Comparing the three proposed agents, we notice that Agent 1 is dynamically inconsistent: it will optimize for future opportunities, that it predictably will not take later. Agent 3 on the other hand will wirehead whenever possible (and we can reasonably assume that opportunities to do so will exist in even moderately complex environments). This leaves us with Agent model 2 and I invite everyone to point out its flaws.

[1] Dotted actions/ perceptions, like denote past events, underlined perceptions denote random variables to be observed at future times.

[2] Bostrom (2011) proposes using hyperreal numbers, which rely heavily on the axiom of choice for the ultrafilter to be used and I don't see how those could be implemented.