When I first read Paul Christiano's post, I figured it had little relevance to prosaic alignment. But is that true? Is Solomonoff misalignment a problem that could actually arise on software running on GPUs over the next 10 years?
FACT 1: There's a certain type of machine called a predictor. You tell a bunch of facts and then ask it a question . The predictor will then output the probability of given .
FACT 2: The optimal predictor is the Solomonoff Predictor , for some natural sense of optimality.
What's the Solomonoff Predictor? Imagine every possible world is a binary string generated by a computer program, and imagine that the prior likelihood of a program is . The Solomonoff Predictor corresponds to that prior.
In other words,
where
Fact 3: Paul Christiano worries that the Solomonoff Predictor is malign. Why? Because some of these computer programs will simulate "gremlins". These gremlins are consequentialist agents who care about influencing the output of , and they can influence the output of by making true in their universe.
Let's give a concrete example.
Suppose Alice finds a mysterious box with a big red button. She suspects the box makes delicious ice-cream, and so she feeds into the oracle all the facts she's ever observed including all the facts about the box. Let's call all that data . And then she asks the oracle the question "is the box an ice-cream maker?". Let's call that question . If the oracle outputs a high probability, then she'll press the button anticipating delicious ice-cream.
Let's also suppose that unbeknownst to Alice, the box is actually a gremlin-generator that would unleash a bunch of gremlins.
Okay, what would Solomonoff Predictor output?
Well, there's going to exist some programs for which all the following facts are true:
What would the gremlins in do? Well, they'd send parallel-Alice a box with an ice-cream maker. Then will output a slightly higher probability for . And then Alice might press the button, unleashing gremlins into our universe.
Now, here’s some arguments suggesting this isn‘t practically relevant —
We can tell a similar story about a different predictor, which is also computationally infeasible.
Imagine a predictor which makes predictions using Monte-Carlo simulations. When Alice tells a bunch of facts and then asks it a question , here's what does: First, the predictor will loop through all possible worlds and simulate . Then checks whether is true in . If is true in then increments the variable , and if is true in then increments the variable . After the loop, returns the ratio .
So there's a universe containing Alice and containing a machine running simulations of a universe containing gremlins. Alice asks the machine whether the box contains an ice-cream maker. What could go possibly go wrong?
Level-1 Leak.
The gremlins can influence the output of the machine while remaining entirely within their simulation. This is basically the concern Paul Christiano raises about . Even if the gremlins "stay in the box", they can influence the ratio by influencing whether and are true in their simulation.
Level-2 Leak.
The gremlins can influence the output of the machine by escaping their simulation, but they remain stuck on the software of the machine. This might happen if the machine has poor firewalls between the various simulations. In a Level-2 Leak, the gremlins might change the output of the machine to 0.99 to entice Alice into pressing the button.
Level-3 Leak
Somehow the gremlins can escape the machine. How might this happen?
Maybe Alice is easily enticed by ice-cream, and the gremlins can "double-jump" from a Level-1 Leak to a Level-3 leak.
Maybe the machine is hooked up to the internet, and the gremlins "double-jump" from a Level-2 Leak to a Level-3 Leak. This might happen if the machine's implementation of the "simulate" subroutine is really insecure.
Maybe the machine implements the "simulation" subroutine by physically constructing objects from and poking them until it has modelled their behaviour. This example isn't even contrived — it's how human scientists determine the probability of events!
In short, a prosaic AI might be leaky because it has generalised its task as simulating a leaky simulator.
Here's a concrete example. Suppose Alice uses GPT-6 as a predictor, and she asks GPT-6 whether the box contains an icecream-maker. There are a number of ways GPT-6 might reason about this question.
There's been a lot of work recently on LLMs as simulators. And there's a worry that even if the LLM is not itself an agent, it might simulate an agent. Moreover, the LLM might be a "leaky" simulator, such that when it simulates an agent, the agent can "escape" or influence us in malign ways.
And there's a two-pronged approach to this problem:
But maybe this isn't paranoid enough. What if a particular LLM is actually a simulator of a simulator? Or a simulator of a simulator of a simulator? It's likely that both "simulator" and "simulator of a simulator" are equally valid generalisations from the LLM's training.
This is dangerous.