I'm Anja Heinisch, the new visiting fellow at SI. I've been researching replacing AIXI's reward system with a proper utility function. Here I will describe my AIXI+utility function model, address concerns about restricting the model to bounded or finite utility, and analyze some of the implications of modifiable utility functions, e.g. wireheading and dynamic consistency. Comments, questions and advice (especially about related research and material) will be highly appreciated.

### Introduction to AIXI

Marcus Hutter's (2003) universal agent AIXI addresses the problem of rational action in a (partially) unknown computable universe, given infinite computing power and a halting oracle. The agent interacts with its environment in discrete time cycles, producing an action-perception sequence with actions (agent outputs) and perceptions (environment outputs) chosen from finite sets and . The perceptions are pairs , where is the observation* *part and denotes a reward*.* At time k the agent chooses its next action according to the expectimax principle:

Here M denotes the updated Solomonoff prior summing over all programs that are consistent with the history* *^{[1]} and which will, when run on the universal Turing machine T with successive inputs , compute outputs , i.e.

AIXI is a dualistic framework in the sense that the algorithm that constitutes the agent is not part of the environment, since it is not computable. Even considering that any running implementation of AIXI would have to be computable, AIXI accurately simulating AIXI accurately simulating AIXI ad infinitem doesn't really seem feasible. Potential consequences of this separation of mind and matter include difficulties the agent may have predicting the effects of its actions on the world.

### Utility vs rewards

So, why is it a bad idea to work with a reward system? Say the AIXI agent is rewarded whenever a human called Bob pushes a button. Then a sufficiently smart AIXI will figure out that instead of furthering Bob’s goals it can also threaten or deceive Bob into pushing the button, or get another human to replace Bob. On the other hand, if the reward is computed in a little box somewhere and then displayed on a screen, it might still be possible to reprogram the box or find a side channel attack. Intuitively you probably wouldn't even blame the agent for doing that -- people try to game the system all the time.

You can visualize AIXI's computation as maximizing bars displayed on this screen; the agent is unable to connect the bars to any pattern in the environment, they are just there. It wants them to be as high as possible and it will utilize any means at its disposal. For a more detailed analysis of the problems arising through reinforcement learning, see Dewey (2011).

Is there a way to bind the optimization process to actual patterns in the environment? To design a framework in which the screen informs the agent about the patterns it should optimize for? The answer is, yes, we can just define a utility function

that assigns a value to every possible future history and use it to replace the reward system in the agent specification:

When I say "we can just define" I am actually referring to the *really hard* question of how to recognize and describe the patterns we value in the universe. Contrasted with the necessity to specify rewards in the original AIXI framework, this is a strictly harder problem, because the utility function has to be known ahead of time and the reward system can always be represented in the framework of utility functions by setting

For the same reasons, this is also a strictly safer approach.

### Infinite utility

The original AIXI framework must necessarily place upper and lower bound on the rewards that are achievable, because the rewards are part of the perceptions and is finite. The utility function approach does not have this problem, as the expected utility

is always finite as long as we stick to a finite set of possible perceptions, even if the utility function is not bounded. Relaxing this constraint and allowing to be infinite and the utility to be unbounded creates divergence of expected utility (for a proof see de Blanc 2008). This closely corresponds to the question of how to be a consequentialist in an infinite universe, discussed by Bostrom (2011). The underlying problem here is that (using the standard approach to infinities) these expected utilities will become incomparable. One possible solution to this problem could be to use a larger subfield than of the surreal numbers, my favorite^{[2]} so far being the Levi-Civita field generated by the infinitesimal :

with the usual power-series addition and multiplication. Levi-Civita numbers can be written and approximated as

(see Berz 1996), which makes them suitable for representation on a computer using floating point arithmetic. If we allow the range of our utility function to be , we gain the possibility of generalizing the framework to work with an infinite set of possible perceptions, therefore allowing for continuous parameters. We also allow for a much broader set of utility functions, no longer excluding the assignment of infinite (or infinitesimal) utility to a single event. I recently met someone who argued convincingly that his (ideal) utility function assigns infinite negative utility to every time instance that he is not alive, therefore making him prefer life to any finite but huge amount of suffering.

Note that finiteness of is still needed to guarantee the existence of actions with maximal expected utility, and the finite (but dynamic) horizon remains a very problematic assumption, as described in Legg (2008).

### Modifiable utility functions

Any implementable approximation of AIXI implies a weakening of the underlying dualism. Now the agent's hardware is part of the environment and at least in the case of a powerful agent, it can no longer afford to neglect the effect its actions may have on its source code and data. One question that has been asked is whether AIXI can protect itself from harm. Hibbard (2012) shows that an agent similar to the one described above, equipped with the ability to modify its policy responsible for choosing future actions, would not do so, given that it starts out with the (meta-)policy to always use the optimal policy, and the additional constraint to change only if that leads to a strict improvement. Ring and Orseau (2011) study under which circumstances a universal agent would try to tamper with the sensory information it receives. They introduce the concept of a delusion box, a device that filters and distorts the perception data before it is written into the part of the memory that is read during the calculation of utility.

A further complication to take into account is the possibility that the part of memory that contains the utility function may get rewritten, either by accident, by deliberate choice (programmers trying to correct a mistake), or in an attempt to wirehead. To analyze this further we will now consider what can happen if the screen flashes different goals in different time cycles. Let

denote the utility function the agent will have at time k.

Even though we will only analyze instances in which the agent knows at time k, which utility function it will have at future times (possibly depending on the actions before that), we note that for every fixed future history the agent knows the utility function that is displayed on the screen because the screen is part of its perception data .

This leads to three different agent models worthy of further investigation:

**Agent 1**will optimize for the goals that are displayed on the screen right now and act as if it would continue to do so in the future. We describe this with the utility function**Agent 2**will try to anticipate future changes to its utility function and maximize the utility it experiences at every time cycle as shown on the screen at that time. This is captured by**Agent 3**will, at time k, try to maximize the utility it derives in hindsight, displayed on the screen at the time horizon

Of course arbitrary mixtures of these are possible.

The type of wireheading that is of interest here is captured by the Simpleton Gambit described by Orseau and Ring (2011), a Faustian deal that offers the agent maximal utility in exchange for its willingness to be turned into a Simpleton that always takes the same default action at all future times. We will first consider a simplified version of this scenario: The Simpleton future, where the agent knows for certain that it will be turned into a Simpleton at time k+1, no matter what it does in the remaining time cycle. Assume that for all possible action-perception combinations the utility given by the current utility function is not maximal, i.e. holds for all . Assume further that the agents actions influence the future outcomes, at least from its current perspective. That is, for all there exist with . Let be the Simpleton utility function, assigning equal but maximal utility to all possible futures. While Agent 1 will optimize as before, not adapting its behavior to the knowledge that its utility function will change, Agent 3 will be paralyzed, having to rely on whatever method its implementation uses to break ties. Agent 2 on the other hand will try to maximize only the utility .

Now consider the actual Simpleton Gambit: At time k the agent gets to choose between changing, , resulting in and (not changing), leading to for all . We assume that has no further effects on the environment. As before, Agent 1 will optimize for business as usual, whether or not it chooses to change depends entirely on whether the screen specifically mentions the memory pointer to the utility function or not.

Agent 2 will change if and only if the utility of changing compared to not changing according to what the screen currently says is strictly smaller than the comparative advantage of always having maximal utility in the future. That is,

is strictly less than

This seems quite analogous to humans, who sometimes tend to choose maximal bliss over future optimization power, especially if the optimization opportunities are meager anyhow. Many people do seem to choose their goals so as to maximize the happiness felt by achieving them at least some of the time; this is also advice that I have frequently encountered in self-help literature, e.g. here. Agent 3 will definitely change, as it only evaluates situations using its final utility function.

Comparing the three proposed agents, we notice that Agent 1 is dynamically inconsistent: it will optimize for future opportunities, that it predictably will not take later. Agent 3 on the other hand will wirehead whenever possible (and we can reasonably assume that opportunities to do so will exist in even moderately complex environments). This leaves us with Agent model 2 and I invite everyone to point out its flaws.

[1] Dotted actions/ perceptions, like denote past events, underlined perceptions denote random variables to be observed at future times.

[2] Bostrom (2011) proposes using hyperreal numbers, which rely heavily on the axiom of choice for the ultrafilter to be used and I don't see how those could be implemented.

How do your proposed changes to the utility function affect e.g. the proof of AIXI's pareto optimality (allured to in pp. 30 in the paper) and other provable properties of AIXI?

I am quite sure that pareto optimality is untouched by the proposed changes, but I haven't written down a proof yet.

If only the problem was that easy. Telling an agent to optimise a utility function over external world states - rather than a reward function - gets into the issue of how you tell a machine the difference between real and apparent utility - when all they have to go on is sensory data.

It isn't easy to get this right when you have a superintelligent agent working to drive a wedge between your best efforts, and the best possible efforts.

I don't think infinitessimal utilities are really Nick Bostrom's idea. To quote from me in 2009:

But early versions of Bostrom's "Infinite Ethics" paper have been online since at least May 2004.

Er, I wasn't trying to take credit. I was saying the idea dates back to Conway. Decision theory was the motivation behind the invention of surreal numbers in the first place.

Game theory motivated surreal numbers. Game theory != decision theory.

Gotcha.

Could you explain more about why you're down on agent 1, and think agent 2 won't wirehead?

My first impression is that agent 1 will take its expected changes into account when trying to maximize the time-summed (current) utility function, and so it won't just purchase options it will never use, or similar "dumb stuff." On the other topic, the only way agent 2 can't wirehead is if there's no possible way for it to influence its likely future utility functions - otherwise it'll act to increase the probability that it chooses big, easy utility functions, and then it will choose same big easy utility functions, and then it's wireheaded.

I am pretty sure that Agent 2 will wirehead on the Simpleton Gambit, depending heavily on the number of time cycles to follow, the comparative advantage that can be gained from wireheading and the negative utility the current utility function assigns to the change.

Agent 1 will have trouble modeling how its decision to change its utility function now will influence its own decisions later, as described in AIXI and existential despair. So basically the two futures look very similar to the agent except that for the part where the screen says something different and then it all comes down to whether the utility function has preferences over that particular fact.

Ah, right, that abstraction thing. I'm still fairly confused by it. Maybe a simple game will help see what's going on.

The simple game can be something like a two-step choice. At time T1, the agent can send either A or B. Then at time T2, the agent can send A or B again, but its utility function might have changed in between.

For the original utility function, our payoff matrix looks like AA: 10, AB: -1, BA: 0, BB: 1. So if the utility function didn't change, the agent would just send A at time T1 and A at time T2, and get a reward of 10.

But suppose in between T1 and T2, a program predictably changes the agent's payoff matrix, as stored in memory, to AA: -1, AB: 10, BA: 0, BB: 1. Now if the agent sent A at time T1, it will send

Bat time T2, to claim the new payoff for AB of 10 units. Even though AB is lowest on the preference ordering of the agent at T1. So if our agent is clever, it sends B at time T1 rather than A, knowing that the future program will also pick B, leading to an outcome (BB, for a reward of 1) that the agent at T1 prefers to AB.So, is our AIXI Agent 1 clever enough to do that?

I would assume that it is not smart enough to forsee its own future actions and therefore dynamically inconsistent. The original AIXI does not allow for the agent to be part of the environment. If we tried to relax the dualism then your question depends strongly on the approximation to AIXI we would use to make it computable. If this approximation can be scaled down in a way such that it is still a good estimator for the agent's future actions, then maybe an environment containing a scaled down, more abstract AIXI model will, after a lot of observations, become one of the consistent programs with lowest complexity. Maybe. That is about the only way I can imagine right now that we would not run into this problem.

Thanks, that helps.

Be warned that that post made practically no sense - and surely isn't a good reference.

This seems so flawed as to be pretty much useless. Specification for an agent that optimizes for its current utility function under the knowledge that its utility function will change:

First, replace the action-perception sequence with an action-perception-utility sequence u1,y1,x1,u2,y2,x2,etc. Let the action-generating function be represented by action(k), where k is the step. This will make use of a recursive helper function modeled_action(n, k), representing what it thinks it will do in the future, where n-k is the number of steps forward it looks.

action(k) = modeled_action(m_k, k).

modeled_action(k, k) = argmax(y_k) u_k(yx_<k, yx_k)*M(uyx_<k, uyx_k)

for n>k: modeled_action(n, k) = argmax(y_k) u_k(yx_k.

Apologies for the lack of LaTeX.

This seems unnecessary. The information u_i is already contained in x_i.

This completely breaks the expectimax principle. I assume you actually mean something like =\textrm{arg}\max_{y_k}\sum_{x_k}u_k(\.{y}\.{x}_{%3Ck}y\underline{x}_{k:n})M(\.{y}\.{x}_{%3Ck}y\underline{x}_{k:n}))

which is just Agent 2 in disguise.

Oops. Yes, that's what I meant. But it is not the same as Agent 2, because this (Agent 4?) uses its current utility function to evaluate the desirability of future observations and actions, even though it knows that it will use a different utility function to choose between them later. For example, Agent 4 will not take the Simpleton's Gambit because it cares about its current utility function getting satisfied in the future, not about its future utility function getting satisfied in the future.

Agent 4 can be seen as a set of agents, one for each possible utility function, that are using game theory with each other.

I second the general sentiment that it would be good for an agent to have these traits, but if I follow your equations I end up with Agent 2.

No, you don't. If you tried to represent Agent 2 in that notation, you would get

modeled_action(n, k) = argmax(y_k) sum(x_k) [u_k(yx_k.

You were using u_k to represent the utility of the last step of its input, so that total utility is the sum of the utilities of its prefixes, while I was using u_k to represent the utility of the whole sequence. If I adapt Agent 4 to your use of u_k, I get

modeled_action(n, k) = argmax(y_k) sum(x_k) [u_k(yx_k.

I am starting to see what you mean. Let's stick with utility functions over histories of length m_k (whole sequences) like you proposed and denote them with a capital U to distinguish them from the prefix utilities. I think your Agent 4 runs into the following problem: modeled_action(n,m) actually depends on the actions and observations yx_{k:m-1} and needs to be calculated for each combination, so y_m is actually

)

which clutters up the notation so much that I don't want to write it down anymore.

We also get into trouble with taking the expectation, the observations x_{k+1:n} are only considered in modeling the actions of the future agents, but not now. What is M(yx_<k,yx_k:n) even supposed to mean, where do the x's come from?

So let's torture some indices:

=\textrm{arg}\max_{y_n}\sum_{x_{n:m_k}}U_n(yx_{1:n}\hat{y}_{n+1,k}(yx_{1:n})x_{n+1}\dots)

x_{m_k})M(\.{y}\.{x}_{%3Ck}yx_{k:n-1}\hat{y}\underline{x}_{n:m_k}))

where n>=k and

This is not really AIXI anymore and I am not sure what to do with it, but I like it.

Yes.

Oops, you are right. The sum should have been over x_{k:n}, not just over x_k.

Yes, that is a cleaner and actually correct version what I was trying to describe. Thanks.

It looks like AIXI is already dynamically inconsistent, since it assumes that on step k+1, it will look m_k - (k+1) steps ahead, when it will in fact look m_(k+1) - (k+1) steps ahead. I suppose if the utility of a prefix of a string is a good heuristic for the utility of the whole string, this isn't a huge problem?

AIXI actually has a configurable horizon function. It's described on page 30 of AIXIgentle.

There is also a more detailed paper by Lattimore and Hutter (2011) on discounting and time consistency that is interesting in that context.

This is a very interesting paper. Reminds me of HIGHLANDER for some reason... those guys lived for thousands of years and weren't even rich? They hadn't usurped control of vast econo-political empires? No hundred-generations-long family of bodyguards?

I think people would get pretty antsy when it became clear that the guy running their town was an immortal. If I were a 13th century peasant with a hankering for revolt and a touch of the plague, I would do terrible, terrible things to someone who was both immortal and rich. Probably best not to get too showy.

If a human line of descent can't do that, why should an immortal be able to do that?

Consistency? And, in fairness, human lines of descent have become monarchies, which worked out pretty well for a while.

This generalizes to the horizon problem: If at time k you only look ahead to time step m_k but have unlimited life span you will make infinitely large mistakes.