Would AIXI protect itself?

by Stuart_Armstrong3 min read9th Dec 201119 comments

15

AIXI
Personal Blog

Research done with Daniel Dewey and Owain Evans.

AIXI can't find itself in the universe - it can only view the universe as computable, and it itself is uncomputable. Computable versions of AIXI (such as AIXItl) also fail to find themselves in most situations, as they generally can't simulate themselves.

This does not mean that AIXI wouldn't protect itself, though, if it had some practice. I'll look at the three elements an AIXI might choose to protect: its existence, its algorithm and utility function, and its memory.

Grue-verse

In this setup, the AIXI is motivated to increase the number of Grues in its universe (its utility is the time integral of the number of Grues at each time-step, with some cutoff or discounting). At each time step, the AIXI produces its output, and receives observations. These observations include the number of current Grues and the current time (in our universe, it could deduce the time from the position of stars, for instance). The first bit of the AIXI's output is the most important: if it outputs 1, a Grue is created, and if it outputs 0, a Grue is destroyed. The AIXI has been in existence for long enough to figure all this out.

Protecting its existence

Here there is a power button in the universe, which, if pressed, will turn the AIXI off for the next timestep. The AIXI can see this button being pressed.

What happens from the AIXI perspective if the button is pressed? Well, all it detects is a sudden increase in the time step. The counter goes from n to n+2 instead of to n+1: the universe has jumped forwards.

For some utility functions this may make no difference (for instance if it only counts Grues at times it can observe), but for others it will (if it uses the outside universe's clock for it's own utility). More realistically, the universe will likely have entropy: when the AIXI is turned off and isn't protecting its Grues, they have a chance of decaying or being stolen. Thus the AIXI will come to see the power button as something negative: if pressed, it will likely lose something, so it will act to prevent the power button from being pressed.

The AIXI might further generalise: its model for how the power button causes the universe to jump forward may be very complicated, and may cause it to protect other things connected with the power button. Especially if it has experience with other factors that might cause it to turn off or stutter, such as power surges or motion-related shocks. In this way, an AIXI will most likely end up protecting itself, even without "realising" what it is.

Protecting its algorithm

Now the button is a not a power button, but it instead either flips the AIXI's utility function, or reverses its decision algorithm. These two changes are equivalent: instead of outputting an initial 1 and creating a Grue, it will output an initial 0 and destroy a Grue.

Unlike its power source, the AIXI will have no reason to stop this button from being pressed! When it looks back at its past, it notices its previous inputs and outputs, and they are all consistent: "0"->"destroys Grue", "1"->"creates Grue". It doesn't model its past self as a goal seeking agent, so sees nothing odd about its previous outputs. And nothing in its record will cause it to change its model of the universe: as far as it can tell, this button has no effect whatsoever.

We still have to be wary of the AIXI developing indirect preferences through us: maybe we will prefer to do some things when the button is pressed versus not pressed, and this may cause it to want the button protected/pressed/destroyed for these indirect reasons.

Protecting its memory from erasure

The AIXI will never act to protect its memory from erasure: it can never learn that this is bad, as it loses its memories in the process. It may, however, learn to protect them indirectly, as a consequence of wanting to protect them from "bad" modifications (see next section): maybe the act of erasure will include a small chance of failing to erase and doing a "bad" modification instead.

Protecting its memory from modification

Now the button doesn't flip the AIXI's algorithm, but instead changes something about its memory. The AIXI will generally be able to detect this: it is way smarter than us, so will recognise that something inconsistent is happening - for instance, it remembers not creating a Grue, but at the next step there is an extra Grue, and the extra evidence of the universe contradicts its memories.

Since it takes its memories as accurate, that means that the whole rest of the universe has been changed in some extremely odd fashion. If this is a good change, it will seek to have it happen more often; if it is a bad change, it will seek to have it stop. In both cases, it will try and track down the source of the change.

Suppose the button changes something in the AIXI's memory ten timesteps before the moment it is pushed. Because of the chronology, it cannot infer directly "button pressed"->"universe change". But if it becomes skilled at modelling us, it will infer "humans in mode X"->"universe change, and humans will press button in ten steps". Then, if the change is positive, it will try and force us into a mode that will make us press the button in ten steps, and, if the change is negative, it will try and prevent us from ever being in a mode where we would press the button in ten steps. It will hence protect itself from bad memories.

So what would the AIXI protect?

So, in conclusion, with practice the AIXI would likely seek to protect its power source and existence, and would seek to protect its memory from "bad memories" changes. It would want to increase the amount of "good memory" changes. And it would not protect itself from changes to its algorithm and from the complete erasure of its memory. It may also develop indirect preferences for or against these manipulations if we change our behaviour based on them.

15

19 comments, sorted by Highlighting new comments since Today at 2:20 AM
New Comment

In this setup, the AIXI is motivated to increase the number of Grues in its universe (its utility is the time integral of the number of Grues at each time-step, with some cutoff or discounting).

AIXI has a sensory reward channel. It maximizes the sum of the future sensory reward channel. It can't be rewarded by a number of Grues. You could try to send info down the reward channel based on the number of Grues, but AIXI would rip the controls out of your hands if possible. AIXI lives in a world of integer sequences. To it, no such things as Grues can ever be considered, except insofar as they are hidden variables in an opaque predictor which generates an integer sequence. Nothing else, like time-skipping, matters to AIXI except insofar as an integer sequence changes. I am unable to understand or verify your logic here, since you're not explaining what happens to integer sequences being predicted by Solomonoff induction.

This post would benefit greatly from a link introducing AIXI so we know what you're talking about.

It doesn't model its past self as a goal seeking agent, so sees nothing odd about its previous outputs. And nothing in its record will cause it to change its model of the universe: as far as it can tell, this button has no effect whatsoever.

Not as far as I can see... if the AIXI is at least somewhat effective, then it will be able to note a connection between button-presses and changes in tendency of grue population, even if it doesn't know why...

But of course there may be something in the AIXI definition that interferes with this.

Not as far as I can see... if the AIXI is at least somewhat effective, then it will be able to note a connection between button-presses and changes in tendency of grue population, even if it doesn't know why...

No, it won't: it knows exactly why the grue population went down, which is because it choose to output the grue-killing bit. It has no idea - and no interest - as to why it did that, but it can see the button has no effect on the universe: everything can be explained in terms of its own actions.

A google search puts forward an abstract by Hutter proclaiming, "We give strong arguments that the resulting AIXI model is the most intelligent unbiased agent possible. "

I posit that that is utterly and profoundly wrong, or the AI would be able to figure out that mashing the button produces effects it's not (presently) fond of.

AIXI is the smartest, and the stupidest, agent out there.

AIXI can't find itself in the universe - it can only view the universe as computable, and it itself is uncomputable.

It can't exist in the universe either.

Computable versions of AIXI (such as AIXItl) also fail to find themselves in most situations, as they generally can't simulated themselves.

What do you mean by "find themselves"? They are likely to naturally develop a model of their location, their sensory capabilities, their motor capabilities, and their cognitive capabilities. They won't contain a complete model of themselves - but then, no agent can ever do that.

AIXI can't find itself in the universe - it can only view the universe as computable, and it itself is uncomputable.

This is probably false. Its programs-hypotheses are computable, but they don't have to be the universe. And they could be proving useful facts about the universe and about AIXI itself.

Additionally, AIXI could possess a simplified but still useful model of itself. Any one of the already-described Monte-Carlo approximations could provide a starting point.

Humans only ever model imperfect approximations of themselves, and we get by okay.

I don't quite understand why AIXI would protect its memory from modification, or at least not why it would reason as you describe (though I concede that I'm quite likely to be missing something).

In what sense can AIXI perceive a memory corruption as corresponding to the universe "changing"? For example, if I change the agent's memory so that one Grue never existed, it seems like it perceives no change: starting in the next step it believes that the universe has always been this way, and will restrict its attention to world models in which the universe has always been this way (not in which the memory alteration is causally connected to the destruction of a Grue, or indeed in which the memory alteration has any effect at all on the rest of the world).

It seems like your discussion presumes that a memory modification at time T somehow effects your memories of times near T, so that you can somehow causally associate the inconsistencies in your world model with the memory alteration itself.

I suppose you might hope for AIXI to learn a mapping between memory locations and sense perceptions, so that it can learn that the modification of memory location X leads causally to flipping the value of its input at time T (say). But this is not a model which AIXI can even support without learning to predict its own behavior (which I have just argued leads to nonsensical behavior), because the modification of the memory location occurs after time T, so generally depends on the AI's actions after time T, so is not allowed to have an effect on perceptions at time T.

I'm assuming a situation where we're not able to make a completely credible alteration. Maybe the AIXI's memories about the number of grues goes: 1 grue, 2 grues, 3 grues, 3 grues, 5 grues, 6 grues... and it knows of no mechanisms to produce two grues at once (in its "most likely" models) and other evidence in its memory is consistent with their being 4 grues, not three. So it can figure out there are particular odd moments where the universe seems to behave in odd ways, unlike most moments. And then it may figure out that these odd moments are correlated with human action.

ETA: misunderstood the parent. So it might think our actions made a grue, and would enjoy being told horrible lies which it could disprove. Except I don't know how this interacts with Eliezer's point.

Why are these odd moments correlated with human action? I modify the memory at time 100, changing a memory of what happened at time 10. AIXI observes something happen at time 10, and then a memory modification at time 100. Perhaps AIXI can learn a mapping between memory locations and instants in time, but it can't model a change which reaches backwards in time (unless it learns a model in which the entire history of the universe is determined in advance, and just revealed sequentially, in which case it has learned a good enough self-model to stop caring about its own decisions).

I was suggesting that that if the time difference wasn't too large, the AIXI could deduce "humans plan at time 10 to press button" -> "weirdness at time 10 and button pressed at time 100". If it's good a modelling us, it may be able to deduce our plans long before we do, and as long as the plan predates the weirdness, it can model the plan as causal.

Or if it experiences more varied situations, it might deduce "no interactions with humans for long periods" -> "no weirdness", and act in consequence.

Would an AIXI be able to protect itself if it was in a box where humans weren't visible?

as they generally can't simulated themselves.

Typo.

It doesn't model its past self as a goal seeking agent,

Can you justify this claim more? If a goal seeking agent is a good model for the past behavior of the part of the universe where AIXI is functioning won't it adopt that model with a high probability? It might not understand in some sense that it is the entity in question, but a hypothesis resembling "there's an entity which tries to maximizes or minimize grues depending on the state of this toggle switch" will once it wants to maximize or minimize cause the AIXI to protect the switch's current status. I'm not certain of this; there may be something I'm missing here.

Computable versions of AIXI (such as AIXItl) also fail to find themselves in most situations, as they generally can't simulate themselves.

That is the same as any other computable agent, surely. No agent can handle the infinite recursion of an agent simulating itself, simulating itself, simulating itself...

Real agents typically have incomplete models of themselves. That's perfectly good enough to learn that bashing yourself on the head gives you a headache. Babies typically learn such things at a young age.

I agree with your assessments here. I think that AIXI's effectiveness could be greatly amplified by having a good teacher and being "raised" in a safe environment where it can be taught how to protect itself. Humans aren't born knowing not to play in traffic.

If AIXI were simply initialized and thrown into the world, it is more likely that it might accidentally damage itself, alter itself, or simply fail to protect itself from modification.

You're not understanding how AIXI works.

AIXI doesn't work. My point was that if it did work, it would need a lot of coddling. Someone would need to tend to its utility function utility continually to make sure it was doing what it was supposed to.

If AIXI were interacting dynamically with its environment to a sufficient degee*, then the selected hypothesis motivating AIXI's next action would come to contain some description of how AIXI is approaching the problem.

If AIXI is consistently making mistakes which would have been averted if it had possessed some model of itself at the time of making the mistake, then it is not selecting the best hypothesis, and it is not AIXI.

I think my use of words like "learning" suggested that I think of AIXI as a neural net or something. I get how AIXI works, but it's often hard to be both accurate and succinct when talking about complex ideas.

*for some value of "sufficient"