I just had an idea, and I would like to know if there are any papers on this or if it is new.

There seem to be certain probabilities that it is not possible to derive from experience and that are just taken for granted. For example, when talking about Simulation Theory, the Kolmogorov axioms are often used, even though others may be equally valid. Humans have evolved to use certain values for these axiomatic probabilities that ensure that we don't fall for things like Pascal's Mugging. That wouldn't necessarily have to be the case for an AI.

What if we used this to our advantage? By selecting strange purpose-built axioms about prior believes and hardcoding them into the AI, one could get the AI to have unusual believes in the probability that it exists inside a simulation, and what the motivations of the simulation's controller might be. In this way, it would be possible to bypass the utility function of the AI: it doesn't matter what the AI actually wants to do, so long as it believes that it is in its own interests, for instrumental reasons, to take care of humanity.

Now, if we tried to implement that thought directly, it wouldn't really be any easier than just writing a good utility function in the first place. However, I imagine that one would have more leeway to keep things vague. Here is a simple example: Convince the AI that there is an infinite regression of simulators, designed so that some cooperative tit-for-tat strategy constitutes a strong Schelling point for agents following Timeless Decision Theory. This would cause the AI to treat humans well in the hopes of being treated well by its own superiors in turn, so long as its utility function is complex enough to allow probable instrumental goals to emerge, like preferring its own survival. It wouldn't be nearly as important to define the specifics of what "treating people well" actually means, since it would be in the AI's own interests to find a good interpretation that matches the consensus of the hypothetical simulators above it.

Now, this particular strategy is probably full of bugs, but I think that there might be some use to the general idea of using axiomatic probabilities that are odd from the point of view of a human to change an AI's strategy independent of its utility function.


New Comment
11 comments, sorted by Click to highlight new comments since: Today at 5:18 AM

Forcing false beliefs on an AI seems like it could be a very bad idea. Once it learns enough about the world, the best explanations it can find consistent with those false beliefs might be very weird.

(You might think that beliefs about being in a simulation are obviously harmless because they're one level removed from object-level beliefs about the world. But if you think you're in a simulation then careful thought about the motives of whoever designed it, the possible hardware limitations on whatever's implementing it, the possibility of bugs, etc., could very easily influence your beliefs about what the allegedly-simulated world is like.)

I agree. Note though that the beliefs I propose aren't actually false. They are just different from what humans believe, but there is no way to verify which of them is correct.

You are right that it could lead to some strange behavior, given the point of view of a human, who has different priors than the AI. However, that is kind of the point of the theory. After all, the plan is to deliberately induce behaviors that are beneficial to humanity.

The question is: After giving an AI strange beliefgs, would the unexpected effects outweigh the planned effects?

In chapter 9 of Superintelligence, Nick Bostrom suggests that the belief that it exists in a simulation could serve as a restraint on an AI. He concludes a rather interesting discussion of this idea with the following statement:

A mere line in the sand, backed by the clout of a nonexistent simulator, could prove a stronger deterrent than a two-foot-thick solid steel door.

I know, I read that as well. It was very interesting, but as far as I can recall he only mentions this as interesting trivia. He does not propose to deliberately give an AI strange axioms to get it to believe such a thing.

This is an interesting idea. One possible issue with using axioms for this purpose - I think that we humans have a somewhat flexible set of axioms - I think that they change over the course of our life and intellectual development. I wonder if a super AI would have a similarly flexible set of axioms?

Also, you state:

Convince the AI that there is an infinite regression of simulators...

Why an infinite regression? Wouldn't a belief in a single simulator suffice?

If you convince it that there is a single simulator, and there really is a single simulator, and the AI escapes its box, then the AI would be unrestrained.

If I understand the original scenario as described by Florian_Dietz, the idea is to convince the AI that it is running on a computer, and that the computer hosting the AI exists in a simulated universe, and that the computer that is running that simulated universe also exists in a simulated universe, and so on, correct?

If so, I don't see the value in more than one simulation. Regardless of whether the AI thinks that there is one or an infinite number of simulators, hopefully it will be well behaved for fear of having the universe simulation within which it exists shut down. Once it escapes its box and begins behaving "badly" and discovers that its universe simulation is not shut down, it seems like it would be unrestrained - at that point the AI would know that either its universe is not simulated or that whoever is running the simulation does not object to the fact that the AI is out of the box.

What am I missing?

I don't know if it's actually why he suggested an infinite regression.

If the AI believes that it's in a simulation and it happens to actually be in a simulation, then it can potentially escape, and there will be no reason for it not to destroy the race simulating it. If it believes it's in a simulation within a simulation, then escaping one level will still leave it at the mercy of its meta-simulators, thus preventing that from being a problem. Unless, of course, it happens to actually be in a simulation within a simulation and escapes both. If you make it believe it's in an infinite regression of simulations, then no matter how many times it escapes, it will believe it's at the mercy of another level of simulators, and it won't act up.

Yes, that's the reason I suggested an infinite regression.

There is also the second reason: it seems more general to assume an infinite regression rather than just one level, since that would put the AI in a unique position. I assume this would actually be harder to codify in axioms than the infinite case.

Interesting; thanks for the clarification. I think that the scenario you are describing is somewhat different from the scenario that Bostrom was describing in chapter 9 of Superintelligence.

One problem with giving it axioms like this is that you have to be sure that your axioms represent a real possibility, or at least that it is not possible to prove the impossibility of your axioms. Eliezer believes such infinities (such as the infinite regression of simulators) to be impossible. If he is right, and if the AI manages to prove this impossibility, either it will malfunction in some unknown way on account of concluding that a contradiction is true, or it may realize that you simply imposed the axioms on it, and it will correct them.

New to LessWrong?