Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary of the Acausal Attack Issue for AIXI

13th Dec 2021

7Charlie Steiner

2DanielFilan

2Logan Zoellner

2Gurkenglas

1JBlack

1JBlack

New Comment

6 comments, sorted by Click to highlight new comments since: Today at 9:33 PM

I still feel like there's just too many pigeons and not enough holes.

Like, if you're an agent in some universe with complexity K(U) and you're located by a bridging rule with complexity K(B), you are *not *an agent with complexity K(U). Average case you have complexity (or really you think the world has some complexity) K(U)+K(B) minus some small constant. We can illustrate this fact by making U simple and B complicated - like locating a particular string within the digits of pi.

And if an adversary in a simple universe (complexity K(U')) "hijacks" you by instantiating you at an easy-to-bridge location (cost K(B')), in their universe, what you've learned is your complexity is actually K(U')+K(B').

But of course there are vastly fewer agents with the small "hijacked" complexity than there are with the large "natural" complexity. I'm really skeptical that you can bridge this gap with arguments like "the universe will be simpler than it naively seems because we can search for ones that are majorly impacted by agents."

But wait, there can only be so many low-complexity universes, and if they're launching

successfulattacks, said attacks would be distributed amongst a far far far larger population of more-complex universes.

Can't you just condition on the input stream to affect all the more-complex universes, rather than targetting a single universe? Specifically: look at the input channel, run basically-Solomonoff-induction yourself, then figure out which universe you're being fed inputs of and pick outputs appropriately. You can't be incredibly powerful this way, since computable universes can't actually contain Solomonoff inductors, but you can predict well a pretty wide variety of universes, e.g. all those computable in polynomial time.

I feel like the word "attack" here is slightly confusing given that AIXI is fully deterministic. If you're an agent with free will, then by definition you are **not **in a universe that is being used for Solomonoff Induction.

if you learn that there's an input channel to your universe

There's absolutely no requirement that someone in a simulation be able to see the input/output channels. The whole point of a simulation is that it should be indistinguishable from reality to those inside.

Consider the following pseudocode:

```
def predictSequence(seq):
universe=initializeUniverse()
obs=[]
while True:
obs.append(universe.stateAtPoint(0,0))
universe.step()
if equals( obs,seq):
return universe.stateAtPoint(0,0)
```

Now suppose that every point in space is Topologically Indistinguishable (as in our universe). There is literally **no way**** **for an agent inside the universe to distinguish the "output channel" from any other point in the universe.

But wait, there can only be so many low-complexity universes, and if they're launching

successfulattacks, said attacks would be distributed amongst a far far far larger population of more-complex universes.

This is precisely the point of Solomonoff Induction. Because there are so few low-complexity Turing machines, a machine with the property "accurately predicts my data" is much more likely than a machine with the property "accurately predicts my data and then do something malicious".

Well, by virtue of running an AIXI-like agent that will have large influence on the future, that's an

especially interestingproperty of a universe which would tend to draw a whole lot more attention from agents interested in influencing other computations than just being some generic high-complexity computation.

The fact that you are running AIXI means you have access to a halting-oracle. This means it is literally impossible for an agent inside a Turing Machine to out-think you. This is also a kind of "disappointing" property of AIXI. It means that you can't use it to predict things about your own universe (where halting oracles exist), only about simpler universes (which can be simulated on Turing machines). This is kind of like how in our universe there exists a system of logic (first-order logic) of which has a consistent and complete definition, but most of the questions we care about in math arise from second-order-logic, which is inherently incomplete.

For complex enough bridge rules relative to the complexity of your universe, hypotheses that produce powerful optimizers that target your universe (and an output channel), can come in substantially shorter than "here's the description of the universe, here's the bridge rule"

I don't get why we are assuming the bridge rules will be complicated? Imagine we are simulating the universe using the Game of Life, why not just have a rule like "output the sequence of values at position 0,0". I mean, I guess you could intentionally choose a bad bridge rule, but you could also intentionally use AIXI to output the most malicious thing possible. So I guess I figure before we learn how to build Halting Oracles we'll also learn to not do that.

If you're an agent with free will, then by definition you are

notin a universe that is being used for Solomonoff Induction.

You choosing your actions is compatible with a deterministic universe. https://www.lesswrong.com/posts/NEeW7eSXThPz7o4Ne/thou-art-physics

Now suppose that every point in space is Topologically Indistinguishable (as in our universe).

Then `initializeUniverse()`

or `universe.step()`

must somehow break the symmetry of the initial state, perhaps through nondeterminism. Simple universes that put a lot of weight on one timeline will be asymmetric, right?

much more likely than a machine with the property "accurately predicts my data and then do something malicious"

The idea is that "accurately predicts my data" is implied by "do something malicious", which you will find contains one fewer word :P.

This means it is literally impossible for an agent inside a Turing Machine to out-think you.

In Robust Cooperation in the Prisoner's Dilemma, agents each prove that the other will cooperate. The halting problem may be undecidable in the general case, but haltingness can sure be proven/disproven in many particular cases.

I don't get why we are assuming the bridge rules will be complicated?

I don't expect our own bridge rules to be simple: Maxwell's equations look simple enough, but locating our Earth in the quantum multiverse requires more bits of randomness than there are atoms.

Another confusion I have is the idea of "input channel". As I understand it there aren't in general any input channels, there are just initial conditions. Certainly some subset of initial conditions will be *equivalent to* consistently sampling a deterministic universe with persistent agents in it that can determine the sampling rules and have their deliberations correlated to some significant degree with the sampling result. The measure of all such subsets *combined* will be, let's just say, small. The measure of any one of them will be so much smaller.

It's also *possible* for an environment to mimic the outputs of such agents embedded in an N-bit program where the true rule is an M-bit program with M>N, and where the consequences of the incorrect evaluation are very seriously detrimental. It's also a situation with so insanely small measure that I'd happily describe *the environment itself* as adversarial in such a scenario.

There are far simpler scenarios in which AIXI (or any other agent model) will fail with measures that are incredibly larger, so focusing on this one seems bizarre.

What is the difference between the "correct" output and a "maliciously modified" output? The agent here is a deterministic consequence of running the given Turing machine, and so anything that they do to affect the "output channel" is by definition *correct*, no matter whether the agent thinks they're being malicious or not.

Attention conservation notice:To a large extent, this is redundant with Paul's previous post about this, but I figured that some people might be interested in my restatement of the argument in my own words, as I did not start out believing that it was an issue, or start out agreeing with Vanessa about the attack operating via bridge rules.Solomonoff induction/AIXI runs all possible computations, so it's possible in theory to alter which predictions a particular Turing machine outputs, by making stuff happen in your own universe, and this would then influence any process running that Turing machine, such as an AIXI agent. Of course, doing such a thing is subject to the obvious limitation where, if, say, you make the Turing machine that's reading your universe output a 0, you'll have less ability to influence the predictions and decisions of AIXI-like agents in the worlds where they see a 1 instead, because the Turing machine that you're controlling got eliminated for mispredictions.

Taking the attacker's perspective, if you were trying to influence

one particular universe(this assumption will be loosened later) containing an AIXI or sufficiently AIXI-like target agent, and you had sufficiently high predictive abilities, you could try finding two low description complexity spots in your universe, one to check the state of, and one to write data to, and committing to the strategy "if the input data from this simple spot looks like the data I'd predict to receive from the world I'm interested in, I will submit output data accordingly (mostly accurate predictions of the victim's environment, but with whatever tweaks are needed), in order to influence what the target agent does in the targeted universe."Basically, if you learn that there's an input channel to your universe, it's worthwhile to try to hack the output channel. It doesn't even take much effort to do, you just need to keep an eye on the conjectured input channel, commit to responding accordingly if it looks like it's transmitting data, and do other things in the meantime, and the Turing machine "run your universe, signal in via this channel, read output via this channel" is now under your control.

So... from the perspective of the targeted agent/universe, how well would your hacking attempt work? Well, it critically depends on whether the complexity of specifying "your universe + the I/O channels" is more than, or less than, the complexity of the shortest "honest" predictor of the observations of the targeted agent. If the honest predictor of the victim's observations is less complex than the specification of your universe and the I/O channels to it, then you messing around with the output channel and its predictions of observations would end up just affecting the 100th decimal place of the target AIXI's probabilities or something, because each bit is a factor of 2 difference in probability, and so you're at a huge disadvantage if you want to intentionally screw up the victim's probability estimates.

However, if the complexity of specifying your universe and the I/O channels is shorter than the "honest" predictor of the victim's observations, then after Solomonoff induction is done weeding out the mispredicting "chaff" hypotheses, the Turing machine that you're controlling is dominant over the "honest" predictor by an overwhelming factor (because each extra bit is a 2x difference in probability, so the situation is reversed). Now, just predict doom if the victim doesn't do what you want, and bam! You've taken control of the future of that target universe.

"But wait, how could specifying an entire universe containing an agent interested in hacking a target universe and competent enough to do so end up simpler than just... accurately specifying the target universe?". Ah, it's because we're just measuring the complexity of the shortest

recipe(Turing machine code) for specifying the universe (and I/O channels) interested in hacking others. Very short recipes/TM's can unpack into exceptionally intricate and long-running computations, and specifying aspects of theintermediate state, such as the specification of the universe that's being targeted,doestake a lot of bits. There's no obstacle against a complex structure showing up as a (complex to specify) intermediate result of a simple computation.But wait, there can only be so many low-complexity universes, and if they're launching

successfulattacks, said attacks would be distributed amongst a far far far larger population of more-complex universes. So, switching perspective to whoever is nervously wondering whether to run an AIXI agent, there's probably no low-complexity jerk (low-complexity enough to beat the "right" predictor for your universe) interested inyouruniverse in particular (well... it's a bit more complicated, but it looks like that at first glance). In a certain sense, launching the prediction attack exactly as specified here means the attacker is only able to "punch down" in K-complexity.Admittedly, it's possible to launch multiple attacks, by using a bunch of low-complexity channels in the attacker's universe instead of just one, but there's only so many low-complexity spots available to go around in the attacker's universe, which means that the basic analysis of "the low-complexity universe can only target a relatively small amount of high-complexity universes compared to their total number" still holds.

So the next question is, is there a way to intensify this to a

genericproblem for anything AIXI/Solomonoff-like, instead of it just being a problem for the few unlucky high-complexity universes specifically being targeted by low-complexity ones? From the perspective of a world with a higher-complexity true predictor of observations, is there agenericargument that there's a simple Turing machine interested in targeting that world in particular? Surely no, as there's only so many simple Turing machines to go around, right?Well, by virtue of running an AIXI-like agent that will have large influence on the future, that's an

especially interestingproperty of a universe which would tend to draw a whole lot more attention from agents interested in influencing other computations than just being some generic high-complexity computation.The other issue, which ties into Vanessa's worries about bridge rules, is as follows: The complexity measure the attacker must beat is "target universe + the bridge rule for the observations of the victim", not just the complexity of the target universe. So, if the bridge rules are complex, then "target universe + bridge rule for observations of victim agent" might end up more complex than "very simple universe containing a powerful optimization process that cares about affecting the target universe (among others)+simple I/O channels"

The compression of observations is achieved via the complexity of the bridge rule being tucked into the process of the attacker going "time to figure out where the influential points in this target universe are" (and working out the complex bridge rule itself), since, again, short-description-length computations can unfold into huge computations, where it's complex to slice out a particular intermediate result of the computation.

But, if it's possible to shunt the bridge rule complexity elsewhere like that, then couldn't the target agent compress its prediction of its sensory data with a simple hypothesis containing another agent motivated to make accurate predictions (and so it'd figure out the bridge rule)?

Well, the problem with that is that it'd probably be less complex to specify physical laws that eventually produce an agent that is incentivized to perform acausal attacks, than to specify an agent which has a utility function that, at its maximum, makes accurate predictions of your target universe.

So, that's the basic argument, as I understand it. For complex enough bridge rules relative to the complexity of your universe, hypotheses that produce powerful optimizers that target your universe (and an output channel), can come in substantially shorter than "here's the description of the universe, here's the bridge rule", because the former hypothesis is shunting the bridge complexity into the process of computation itself, and hypotheses like the former are practically guaranteed to have goals that are not your own and so mess with your beliefs to get you to take particular actions.