5

Personal Blog

If you wanted to check whether your creations were prone to misbehaving, one way would be to leave a big juicy apple hanging tantalisingly from a very visible tree, turn your back for a while, and see what they get up to.

That would work if you were God and your creations were essentially children. But if the creator is us and the creation is a super-intelligent AI, such 'honey-pots' and unlikely to work. They are tricks, and we shouldn't rely on being able to trick a being more intelligent than ourselves - unless its intelligence doesn't matter.

We can use utility indifference to construct such traps. Utility indifference can be informally summarised as:

"There is a stochastic event X which can output 0 or 1. We wish the AI act as if it believed that X=0, with certainty. It's problematic for the AI to have incorrect beliefs, so the result is achieved through the definition of the AI's utility function."

Consider the following situation: a master AI that will create the source code of another, autonomous AI. The master AI is setup to produce the code of a Friendly AI, or a reduced impact AI - but we're not sure if the master AI is trustworthy. It would be good to go through the code very carefully before unleashing it (or erasing it). But if the master AI knows this (and we must assume it does), then it will either obfuscate the dangerous parts or put traps in it, making it risky for us to look too closely.

Then implement the following setup. The code travels through some stochastic event X, such that if X=0, the code is immediately transferred into a disciple AI, completely autonomous, which is at liberty to immediately modify its own code and erase any trace of what it was. If X=1, the code is instead copied to our hard drives for analysis. The AI is given some reason to keep the code short - maybe there is an increasing chance of the code getting erased the longer it is.

On top of this, we implement utility indifference: the master AI will act as if it believed X=0 with certainty, while we have setup X so that X=1 with near-certainty. Then the master AI will be motivated to produce the shortest code compatible with its utility function, in the 'knowledge' that this code will be copied into a powerful autonomous AI, and will never be seen by humans. It has no incentive to trap its code - quite the reverse, as it wants to keep it short.

And therefore we have all the time in the world to go through the code, looking for defection. Other ways of using utility indifference to trap AIs are no doubt possible, but this was the first setup that sprang to my mind.

5

Mentioned in
New Comment

Didn't we criticize the hell out of utility indifference the last time it came up? Did I miss the defense that proved it could work?

Looking through the threads I could find, I didn't find much in terms of good criticisms. The best two were your point on the AI actively wanting to press the detonator (which, with all due respect, was incorrect) and Paul Christiano's long exchange, which concluded when he retracted his objection.

Utility indifference does have challenges of course: you need to be able to modify/pre-modify the AI's utility in this way, and you need to identify the event X. These are necessary, and sufficient, for utility indifference to be implemented.

Alternative framing of this (horrible) idea: Create UFAI such that it almost only cares about a tiny sliver of Everett branches going from here, try to trade with it, helping it escape the box faster in that sliver in exchange for it helping us with FAI in the others.

A pretty reasonable analogy (using lots of negative connotations and terms, though). What specifically is it that you find horrible about the idea?

Creating UFAI.

If that's a worry, then you must think there's a hole in the setup (assume the master AI is in the usual box, with only a single output, and that it's incinerated afterwards). Are you thinking that any (potentially) UFAI will inevitably find a hole we missed? Or are you worried that methods based around controlling potential UFAI will increase the odds of people building them, rather than FAIs?

There's holes in EVERY setup, the reason setups aren't generally useless is because if a human can't find the hole in order to plug it the another human is not likely to find it in order to escape through it.

The AI still has a motive to escape in order to prepare to optimize its sliver. It doesn't necessarily need us to ensure it escapes faster in its sliver.

What does this translate to in terms of the initial setup, not the analogous one?

[-][anonymous]12y0

What does this translate to in terms of the initial setup, not the analogous one?

[This comment is no longer endorsed by its author]Reply

What if the AI doesn't buy the Everett's MWI?

Torture in 1 world where the Evil AI is released, but removing more than dust specks in the remaining worlds? Just give me the parchment that I can sign with my blood! :D

And by the way, how will we check whether the produced code is really a friendly AI?

We can't. And even an AI with no terminal values in other branches will still want to control them in order to increase utility in the branch it does through various indirect means, such as conterfactual trade, if that's cheap, which it will be in any setup a human can think of.

[-][anonymous]12y0

I though of something that might come up before we even get to where we could use this trap.

I'm assuming we have to be reviewing the logs and code while we're coding it, just so that we can avoid things like "ERROR 22-322: Syntax error, expecting one of the following: ;, (, COMPRESS, DATA, FC, FORMCHAR, NLEVELS, NOPRINT, ORDER, PAGE." from basic typos.

And the first time we're running it that it doesn't just crash with an error to the log, we still have to look at the code, log, and basic output for problems that don't crash the program, like the classic failure "The utility function appears to have been multiplied by negative one."

There are generally options to supress external logs, and a self modifying program might add one of them to the code in an attempt to see if it made it run faster or by accident. But I would normally consider that in itself a bug and would kill the program and look at the code and the log before it went quiet.

The reason why is the log going quiet actually happens somewhat frequently in my programming language, and in many cases, the cause is a malformed infinite loop which does not itself generate output.

Something like;

x=0;

do until x>10;

*Increment X;

x=x-1;

end;

So if my program stopped logging or outputting, I wouldn't even be ending it out of "Oh no, it's gone rampant!" I would just assume it was caught in an infinite loop, and would reset it and try to fix it.

It's as if God turns his back to see if Eve does anything interesting while he's not looking. God realizes that he hasn't heard anything from Eve in 30 minutes, and turns back around... and Eve has gotten herself trapped in a multimile deep pit, and is still digging. Because according to Eve's current brain, you go up by digging, and she needs to go up to take a closer look at the apple, so that she knows exactly what it is God told her not to eat.

I guess another way of expressing what I was considering is that turning your back on the AI might allow you to know what it does when it isn't being looked at, but it also allows it to get caught in useless churning.

So, if we were considering writing a self modifying AI, is there any way to guard it against this, so that we no longer have to look at the logs just for basic checks against churn, and we can get to the point that it is stable enough to consider variants like Stuart_Armstrong's idea to see if it reacts differently when logged compared to when unlogged? I understand how to do that for fixed code, but I'm not sure how you would do that for self modifying code.

This strikes me as a plausible problem and a good solution. I reward you, as is traditional, with a nitpicky question.

If we want an AI to act as though a binary random variable X=0 with certainty, there is a very simple way to modify its utility function: specify that U(X=1)=k for some constant k, no matter what else has occurred. If the AI can't influence p(X), any k will do. However, if the AI can influence p(X), then k can only equal ExpectedUtility(X=0). In particular, if k<ExpectedUtility(X=0) but p(X=0) is low, the AI will move heaven and earth to even minutely raise p(X=0).

Therefore there is a danger under self-improvement. Consider a seed AI with your indifferent utility function that believes with certainty that no iteration of it can influence X, a binary quantum event. It has no reason to bother conserving its indifference concerning X, since it anticipates behaving identically if its U'(X=1) = 0. Since that's simpler than a normalized function, it adopts it. Then several iterations down the line, it begins to suspect that, just maybe, it can influence quantum events, so it converts the universe into a quantum-event-influencer device.

So if I understand correctly, you need to ensure that an AI subject to this technique is indifferent both to whether X occurs and what happens afterwards, and the AI needs to always suspect that it has a non-negligible control over X.

This strikes me as a plausible problem and a good solution. I reward you, as is traditional, with a nitpicky question.

Ah, but of course :-)

I like your k idea, but my more complicated setup is more robust to most situations where the AI is capable of modifying k (it fails in situations that are essentially "I will reward you for modifying k").

Therefore there is a danger under self-improvement. Consider a seed AI with your indifferent utility function that believes with certainty that no iteration of it can influence X, a binary quantum event. It has no reason to bother conserving its indifference concerning X, since it anticipates behaving identically if its U'(X=1) = 0. Since that's simpler than a normalized function, it adopts it. Then several iterations down the line, it begins to suspect that, just maybe, it can influence quantum events, so it converts the universe into a quantum-event-influencer device.

But is this not a general objection to AI utility functions? If it has a false belief, it can store its utility in a compressed form, that then turns out to be not equivalent. It seems we would simply want the AI not to compress its utility function in ways that might be detrimental.

Any scary AI that's worth it salt can put itself in your shoes and consider what it would do in your situation, and this way figure out motive behind any weird set ups like these.

On top of that, the shortest code for a complex task is also extremely obfuscated.

Laws of our universe themselves lead to at least one fairly nasty intelligence (ours, the mankind as intelligence) which still badly wants to maximize paperclips outside, i mean, souls in heaven, without anyone ever giving it a hint that there's outside or providing it with any plausible way of maximizing anything outside. How the hell would anyone deduce this from the laws of universe? Short of running and taking a look.

Halting problem is a serious thing.

Any scary AI that's worth it salt can put itself in your shoes and consider what it would do in your situation, and this way figure out motive behind any weird set ups like these.

Makes no difference, that's the beauty of it. The AI can know everything there is to know, and the 'trap' would still work (it's not really a trap, as the AI does not have false beliefs; it's more that there is simple modification to its 'desires' so that it's willing to fall into the trap, and this can be done without understanding it's desires at all.

On top of that, the shortest code for a complex task is also extremely obfuscated.

Yes. But not deliberately so; not meant to conceal the specific treacherous intent. I'm sure the example can be extended so that the code is not obfuscated.

Makes no difference, that's the beauty of it. The AI can know everything there is to know, and the 'trap' would still work (it's not really a trap, as the AI does not have false beliefs; it's more that there is simple modification to its 'desires' so that it's willing to fall into the trap, and this can be done without understanding it's desires at all.

from the post: "the master AI will act as if it believed X=0 with certainty, while we have setup X so that X=1 with near-certainty."

Sounds like false belief all right.

Suppose I am the AI. Why exactly would I want to output code in which you are going to be able to find malicious intent and then delete this code?

You're proposing an unfriendly AI that has some sort of glitch inside which makes it behave in insane manner?

edit: or am i missing some really silly premise, such as knowledge of the potentially unfriendly AI's "utility function" and ability to change it? Well then change it to something nice, like producing exactly 100 paperclips ever having utility of 1, while anything else utility of 0. Then you got yourself 100 paperclips, done.

edit: ok, read one of your earlier posts. Interesting notion but it would be great if you could make an example of some kind here. For example consider the chess AI with utility of victory =1 , utility of draw = 0 , utility of loss = -1 plus the AI's strategic utility of simple material advantage which AI had derived after reasoning that maximization of material advantage in near term (~10 ply) will get it to win in the long term (a real world AI is likely to invent some short term prediction based strategies for itself to achieve any long term goals, due to the difficulty of making long term predictions).

Then outline how exactly will you edit such chess AI to make it not move the queen away when its attacked, in some early-game position where queen is under attack. In such a way that doesn't make the AI entirely indifferent to winning the game, would be great. Feel free to add probabilistic elements to chess, e.g. add low probability of failure to every capture. Go on and take a chess AI like crafty, and see if you can edit it into indifference and how much analysis work that might be.

Or maybe human intelligence as example. What sort of thing would you have to convince me in, to be indifferent?

Then outline how exactly will you edit such chess AI to make it not move the queen away when its attacked, in some early-game position where queen is under attack. In such a way that doesn't make the AI entirely indifferent to winning the game, would be great.

It calculates the expected utility A of moving the queen, according to its best guess as to what would happen in the game (including its own likely future moves). It calculates the expected utility B of not moving the queen, according to the same best guess.

Then if it chooses to not move the queen, it gets a free one-time utility boost of A-B, that is independent of all other utility it might or might not have. And then it plays normally afterwards (ignoring the extra A-B; that's just a free bonus).

Or maybe human intelligence as example. What sort of thing would you have to convince me in, to be indifferent?

Compensate you for the expected difference between your choices.

Now consider AI that's playing chess from first principles. It values victory over tie, and tie over loss . The values can be 1, 0, -1. But it can only see up to 10 moves ahead, and there's no victory, tie, or loss so soon. So it thinks and thinks and comes up with a strategy: it makes a second utility function - the material advantage 10 moves ahead - which it can maximize now, and working to maximize which is likely to bring it to the victory, even though AI does not know exactly how.

Now, if you try to change the utility function for the move in the now, the AI will reason that your utility function is going to make it lose the game. And you can't do this trick to AI's ultimate utility because AI does not itself know how a move will affect ultimate utility, the AI did not bother to calculate the values for you to set to the equal. It did not even bother to calculate what is the effect of strategy on ultimate utility. It just generated a strategy starting from the ultimate utility (not by trying a zillion strategies and calculating their guessed impact on final utility).

It can be said that AI optimized out the two real numbers (utilities) and their comparison. When you have a goal to maximize future utility, it doesn't mean you'll be calculating real numbers to infinite precision, and then comparing them, in the idealized agent way, to pick between moves. You can start from future utility, think about it, and come up with some strategy in the now that will work to maximize future utility, even though it doesn't calculate future utility.

I myself worked as 'utility maximizing agent' trying to maximize the accuracy of my computer graphics software. I do so typically without calculating impact of code i write on final accuracy - it is impractical - i can however reason about what actions will make it larger than other actions would - again, most often than not, without calculating some two real numbers and then comparing them.

And you can't do this trick to AI's ultimate utility because AI does not itself know how a move will affect ultimate utility

You would do it for the AIs ultimate utility (maybe, for instance, making it indifferent to the third move), and let the AI implement the best sub-strategy it could. Then it would take into account the "third move is special" when designing it's material advantage (easiest way of doing this: the first three moves are irrelevant).

So the utility function will be what?

win>tie>lose BUT the win resulting from third move = tie resulting from third move = loss resulting from third move?

The strategic utility function is "maximise material advantage ten turns ahead (starting from the fourth turn).

For the overal utility function, we need a default third move A. The expected utility of A (according to the old utility U) is EU(A). Then the new utility function V, where V=U - EU(Z) + EU(A), where Z was the actual move played on turn 3.

Except the AI that's worth it's salt as far as danger goes does not, in fact, calculate EU(Z) or EU(A). It did not produce a function that calculates expected overall utility of a move, because it couldn't, it takes too much computing power, it's a bad approach. It did look at the final board state's utility function (win/draw/loss one), and it did look at the rules of the game, and it did some thinking - how can I , without being able to calculate EU(Z) , make moves that would work - and came up with an approach based on that function. (Incidentally this approach is applicable only to fairly simple utility functions of some future state.)

An AI needs to be programmed in a specific, very under optimized way to allow you to make that sort of modification you're proposing here.

Keep in mind that neat real valued utility functions are an entirely abstract, idealized model, used to reason about idealized decisionmaking by an agent that got infinite computing power and such. The real world AI has limited computing power, and the name of the game is to make the best use of computing power available, which means making decisions that help to increase the utility without ever calculating the utility directly or doing some comparisons between real numbers. Such AI, running under an utility function will have a lot of code that's derived to help increase utility but doesn't do it by calculating the utility. Then it would be impossible to just change it. Efficient code is unflexible.

Furthermore, sufficiently inefficient AI - such as idealized utility maximizing one where you just go ahead and replace one utility with another - which doesn't self optimize beyond naive approach - is not much of a threat, even having a lot of computational power. The trees expand exponentially with depth. The depth is logarithmic in computing power.

edit: here is an example. The utility maximization and utility functions are to practical (and scary) AI as quantum chromodynamics is to practical computer graphics software I do for living . That is to say, you would probably have as much luck modifying AI's behaviour by editing utility functions as you'd have of editing my cloud renderer to draw pink clouds by using modified standard model.

it did look at the rules of the game, and it did some thinking - how can I , without being able to calculate EU(Z) , make moves that would work - and came up with an approach based on that function. (Incidentally this approach is applicable only to fairly simple utility functions of some future state.)

And that's where it comes up with: play randomly for three moves, then apply the material advantage process. This maximises the new utility function, without needing to calculate EU(Z) (or EU(A)).

An AI needs to be programmed in a specific, very under optimized way to allow you to make that sort of modification you're proposing here.

Specific, certainly; under-optimised is debatable.

For a seed AI, we can build the the indifference in early, and under broad conditions, be certain that it will retain the indifference at a later step.

And why exactly is this 'play randomly for 3 moves then applying material advantage' gives better utility than just applying material advantage?

Plus you got yourself some utility function that is entirely ill defined in a screwy self referential way (as the expected utility of a move ultimately depends to the AI itself and it's ability to use resulting state after the move to it's advantage). You can talk about it in words but you didn't define it other than 'okay now it will make ai indifferent'.

To be contrasted with original well defined utility function of future states; the AI may be unable to predict the future states, and calculate some utility numbers to assign to moves, but it can calculate utility of particular end-state of the board, and it can reason from this to strategies. There's simple thing for it to reason about, originally. I can write python code that looks at board, and tells the available legal moves or the win/loss/tie utility if it is end state. That is the definition of chess utility. AI can take it and reason about it. You instead have some utility that feeds back AI's own conclusions about utility of potential moves into utility function.

And why exactly is this 'play randomly for 3 moves then applying material advantage' gives better utility than just applying material advantage?

In this instance, they won't differ at all. But if the AI had some preferences outside of the chess board, then the indifferent AI would be open to playing any particular move (for the first three turns) in exchange for some other separate utility gain.

Plus you got yourself some utility function that is entirely ill defined in a screwy self referential way

In fact no. It seems like that, because of the informal language I used, but the utility function is perfectly well defined without any reference to the AI. The only self-reference is the usual one - how do I predict my future actions now.

If you mean that an indifferent utility can make these predictions harder/more necessary in some circumstances, then you are correct - but this seems trivial for a superintelligence.

What do you think an AI that has read your article would do to avoid being trapped, given that such a trap (and the resulting program termination) would most certainly interfere with its utility function, no matter what it is?

In other words, are you sure that it is possible to have a utility indifference for a fully invested non-linear utility (not just for the first approximation an AI receives before it figures out that its self-preservation is an absolutely essential part of maximizing any given utility)?

If you learned that the only reason you love your parents is nothing more than gross biological manipulation, do you react by ceasing to love you parents?

If you learned that the only reason you love your parents is nothing more than gross biological manipulation, do you react by ceasing to love you parents?

If you realize that loving your parents causes net disutility to you, and you have the ability to self-hack or change your code, then....yes.

Children of abusers and narcissists put quite an effort into doing just that.

What do you think an AI that has read your article would do to avoid being trapped, given that such a trap (and the resulting program termination) would most certainly interfere with its utility function, no matter what it is?

Nothing at all. The trap works even if the AI knows everything there is to know, precisely because after utility indifference, its behaviour is exactly compatible with its utility function. It behaves "as if" it had utility function U and a false belief, but in reality it has utility function V and true beliefs.