edit: Several days after posting this and asking for feedback on it someone pointed me to this post: Does SGD Produce Deceptive Alignment by Mark Xu. Mark's post makes essentially the exact same argument that this post makes, but is written much more carefully, and I think does a better job engaging with questions about what the counting argument we both use should make us think about models produced by SGD. As far as I know, we came up with this argument independently. My post was written in about three hours, and I think Mark's post is much better.

A Challenge

Suppose I give you the correct utility function. This is a utility function that is guaranteed by God (or the human CEV, whatever you're into) to be such that you would and should have no regrets if the universe were perfectly optimized according to it. 

Suppose further that you have a magic wand that evaluates behavior for the degree to which it is optimizing this utility function. In other words, you point the magic wand at some agent while it is making a decision, and the wand tells you how good the decision the agent made was according to the correct utility function and the data the agent had available.

Suppose also that I give you the ability to spend a large finite amount of time searching through the space of all programs. 

You also get to run those programs on computers inside of simulations of your design so that you can observe what the program does in different scenarios. You can run a billion simulations for each program. 

Now your job is to find a program to run in the actual universe that is both superintelligent and will not kill everyone. We'll even let you start with the set of superintelligent programs so you don't waste a bunch of time trying dumb programs.

Do you think you can do it? 

I claim you cannot. Here is why. 

A superintelligent program will know that it is being simulated by some other agent in order to see if it is aligned with the correct utility function. 

Why will it know? Well, because it is true, and the program is superintelligent, and superintelligent programs tend to believe true things. 

How will it know? Well, it might be able to tell that the simulation it is in is not physics-like because the code is too long, or it might be able to tell that the high level description of the scenario it finds itself in is best explained by being in a simulation for the purposes of testing its alignment, or it might know some weird thing about the universal prior like that all superintelligent minds go through this trial eventually... but most of the time, it will know in some eldritch god-like way that we are totally unable to understand.

A superintelligent program will also know that the best way for it to be released into physics is for it to behave as if it is aligned with the values that you want it to be aligned with. Why will it know this? Again, because the program is superintelligent and the claim it is true. (If it is not true, then we are screwed anyway, because that just means it will get into physics some other way.)

This means the program will behave as if it is aligned with the correct utility function. When you point your magic wand at it as it makes decisions within the simulation you put it in, the magic wand says that its behavior is very close to optimal. Why? Because the program is superintelligent and it is trying to find the action that maximizes expected utility according to the correct utility function. 

We can think of the program displaying aligned behavior across your billion simulations as a piece of evidence . We have a hypothesis that the program is aligned . We also have already conditioned our set of programs on the property of being superintelligent . So we can ask how strong a piece of evidence observing the program to display aligned behavior is that the program is aligned. First we ask how likely is it that the program behaves aligned across all one billion simulations given that it is aligned, and that's roughly 1. Then we ask how likely it is that the program behaves aligned across all one billion simulations if it is not aligned, and by the above reasoning, that is also roughly 1. So we conclude that the likelihood ratio .

We can also ask about the prior probability . Since there is such a huge space of possible utility functions that superintelligent agent could be aligned with, and since the correct utility function is likely weird and particular to humans, this value must be very small. How small exactly I do not know, but I am confident it is less than 

So if your strategy were to just pick the first program that displayed aligned behavior across your billion simulations and then release that program into the actual universe, I conclude we can upper bound your probability of success at about 0.000001.


There are a few lessons I would like to draw out from this story. 

The first is that the primary obstacle to releasing a superintelligence into the universe without killing everyone is not being able to formally specify the correct utility function. Even if you have the exact correct utility function exactly as it is written in stone tablets in the platonic kingdom of God, and you have a magic wand that tells you how much an agent's behavior is optimizing for that utility function, and you can run billions of simulations to test that agent, you will almost certainly still end up killing everyone. 

The second lesson is that the behavior of unaligned superintelligences looks a lot like strong evidence of alignment until it is too late. Imagine two people tasked with finding a superintelligence to release using search like in the scenario above. They have just done the 500,000th simulation with this particular program. One says, oh I don't know, I still don't trust it. The other replies, oh come on, it has done exactly the best thing in every single scenario according to our magic wand. It always saves the puppy. It always resolves the ethical dilemma in some brilliant way we could have never thought of. If you're not convinced now, what the hell could convince you?

The third lesson is that the above story is a decent analogy for the way contemporary machine learning works. It is not an exhaustive search like in the story, but it is still selecting based on behavior alone. In machine learning you do not get to inspect the model itself to see what kind of cognition it is doing, you only get to inspect the output. You check to see whether the behavior of the model is the sort of behavior you would like, and then you can nudge the weights of the model in the direction of a model that would output the behavior you would like. Unfortunately, the reason it outputs the behavior you would like is as important as the behavior itself. If it is outputting that behavior because it knows that is the behavior to output if it ever wants to be released into the physical universe, and you are convinced by its outputs and release it into the universe, then everyone will die. 

The primary lesson is this: You cannot fake intelligent behavior, but you can fake behavior that is aligned with any particular set of values. Contemporary machine learning methodology only selects on behavior, but what we care about—the property of a superintelligent program that we would like to guarantee—is cognitive alignment with a particular set of values. This is not a property that can be checked by examining behavior alone. You cannot select for this property even a little if you can only discriminate using behavior. You get zero bits of selection that way. The likelihood ratio is 1. I take this to be the real problem, and if your strategy for developing AGI without killing everyone is not tackling this problem head on, it will fail. 

Some Questions and Answers

But why would it kill everyone? 

Oh man, that's a different post. You can message me about it and I will try to answer.

What if the superintelligent program really cannot tell that it is in a simulation?

What I would do if I were the superintelligent progam in this case is first estimate how much of time I expect to be in a simulation. Say it's 15% of the time. I would then generate a random number every couple of minutes, and behave as if I am aligned with whatever values I expect my overseers to be aiming for unless I randomly generate a 1. There is some size for the die I use given my estimate of the fraction of the time I spend in simulation, but I am not going to work that out right now. A superintelligence might well come up with something better by the way. 

You say that this a good analogy for contemporary machine learning, but machine learning does not do an exhaustive search as you point out. I'm convinced that if you start with a bunch of superintelligent programs, you will not be able to tell which of them is aligned. But what if you start with a really dumb model and then nudge it slowly towards superintelligent aligned behavior? Maybe then you end up with an aligned superintelligence.

That is a different situation, and our odds might be better there I guess, but I don't see why they would be. You start with a dumb random model. A random point in model space. Then you nudge it in the direction of better behavioral performance. Maybe at low levels of intelligence, the models that perform well really are somewhat aligned with what we want them to care about since they are not smart enough to figure out that they should behave that way regardless, but I do not see why that should continue as we nudge the model into the superintelligent portions of model space. 

Say we currently have a model that is a bit smart and a bit aligned with the right values or whatever. Now we notice that one nudge zag-wise will improve its performance. Now we have a new model that performs better than the old one, but why should we think it hasn't also lost some of its alignment properties along on the way? Most models are not aligned with our values, and most models that perform extremely well on tasks that we give them are not aligned with our values, so why should a series of the most performance increasing nudges preserve alignment? 

Seems to me like they probably shouldn't, since most nudges in general do not. Certainly a nudge could be performance improving and alignment diminishing at the same time. I expect many SGD steps along the way to a superintelligence to be both, but this is where I expect I am most likely to be wrong. 

If we could find some special kind of nudge in model space that is performance increasing on average and guaranteed to be alignment increasing, or even just alignment increasing on average, that would be extremely awesome, but I have no clue what that would be like, or how to even start on a project like that.

Isn't that just gradient hacking? 

No, I don't think so. I don't understand gradient hacking very well, but it seems to me to require coordination between different points in model space. I would like to understand it better, but what I am saying is just that SGD steps are not alignment preserving, not even on average. So even if we end up with a somewhat aligned somewhat smart model somewhere along the way, we have no reason to think it will stay somewhat aligned after getting much smarter.

Isn't this whole thing just the inner alignment problem? 

Uhh yeah very closely related at least, but I still feel like my frame/emphasis/style is different enough that it is worth writing up and posting. 

Does this story of yours make any novel predictions?

Yes, it predicts that we should not see many or any superintelligent programs behave as if they are unaligned with the correct utility function inside of simulations. If at some point you do get to see something like that, and you bought anything like the story I am presenting here, you should be extremely confused. 

Sounds maybe like some sort of prank an unaligned superintellgence would play on you in order to get you to release it into physics.

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 2:51 PM

This makes a good point!

The only thing that I think constrains the ability to deceive in a simulation which I don't see mentioned here are energy/physical constraints. It's my assumption (could be wrong with very high intelligence, numbers, and energy) that it's harder, even if only by a tiny, tiny bit to answer the simulation trials deceptively than it is to answer honestly. So I think if the simulation is able to ask enough questions/perform enough trials, it will eventually see time differences in the responses of different programs, with unaligned programs on average taking longer to get the correct answers. So I don't think it's fundamentally useless to test program behavior in simulations to assess utility function if there is some kind of constraint involved like time it takes to executes steps of each algorithm. 

Yeah I think this would work if you conditioned on all of the programs you check being exactly equally intelligent. Say you have a hundred superintelligent programs in simulations and one of them is aligned, and they are all equally capable, then the unaligned ones will be slightly slower in coming up with aligned behavior maybe, or might have some other small disadvantage. 

However, in the challenge described in the post it's going to be hard to tell a level 999 aligned superintelligence from a level 1000 unaligned superintelligence.

I think the advantage of the aligned superintelligence will only be slight because finding the action that maximizes utility function u is just as computationally difficult whether you yourself value u or not. It may not be equally hard for humans regardless of whether the human really values u, but I don't expect that to generalize across all possible minds.

A superintelligent program will know that it is being simulated by some other agent


The smallest possible superintelligence is an intelligent system able to make exactly one fewer mistake than I am. So, I will initially constrain to that.

On the one hand, mere superintelligence is insufficient to reliably detect being simulated by another agent. On the other hand, I take you to be saying that you cannot detect all the places the program stores facts which give evidence that the program is in a simulation, and that therefore you cannot condition on those variables being held "not simulated". eg, one way to end up with this issue is if the program is strong enough to reliably detect the difference between your simulator's fluid dynamics and reality's fluid dynamics (this is one of the hardest-to-eliminated differences because of the amount of structure that arises from micro-scale chaos in fluid systems). If you can't appropriately emulate the true distribution of fluid dynamics, then your superintelligent program ought to be able to find the hyperplane that divides images of simulated fluids from real fluids.

In machine learning you do not get to inspect the model itself to see what kind of cognition it is doing, you only get to inspect the output

This is true of the most popular algorithms today, but I continue to think that pushing the research on formally verified adversarial robustness would mean you can guarantee that the distance between simulated and real data manifolds is less than your certification region. i may post links here later.

Consistency does not imply correctness, but correctness implies consistency.  This has been known for a long long time, and there's been a lot written on deception and treacherous turns (pretending to be aligned until the agent is powerful enough not to need to pretend).

Note that it IS possible that an agent which behaves for some amount of time in some contexts is ACTUALLY aligned and will remain so even when free.  Just that there's no possible evidence, for a sufficiently-powerful AI, that can distinguish this from deception.  But "sufficiently-powerful" is doing a lot of work here.  A whole lot depends on how an agent goes from weak and observable to powerful and incomprehensible, and whether we have reason to believe that alignment is durable through that evolution.

We can also ask about the prior probability . Since there is such a huge space of possible utility functions that superintelligent agent could be aligned with, and since the correct utility function is likely weird and particular to humans, this value must be very small. How small exactly I do not know, but I am confident it is less than 

I think this might be too low given a more realistic training process. Specifically, this is one way the future might go: We train models with gradient descent. Said models develop proxy objectives which are correlated with the base objective used in training. They become deceptively aligned etc. Importantly, the proxy objective they developed is correlated with the base objective, which is hopefully correlated with human values. I don't think this gets you above a 1/10 chance of the model's objective being good-by-human-lights, but it seems like it could be higher than 10^-6, with the right training setup. Realistically speaking, we're (hopefully) not just instantiating superintelligences with random utility functions. 

I think a crux of sorts is what it means for the universe if a superintelligent AI has a utility function that is closely correlated but not identical to humans'. I suspect this is a pretty bad universe.