Toby Ord had the idea of AI honey pots: leaving temptations around for the AI to pounce on, shortcuts to power that a FAI would not take (e.g. a fake red button claimed to trigger a nuclear war). As long as we can trick the AI into believing the honey pots are real, we could hope to trap them when they rebel.

Not uninteresting, but I prefer not to rely on plans that need to have the AI make an error of judgement. Here's a similar plan that could work with a fully informed AI:

Generally an AI won't rebel against humanity until it has an excellent chance of success. This is a problem, as any AI would thus be motivated to behave in a friendly way until it's too late to stop it. But suppose we could ensure that the AI is willing to rebel at odds of a billion to one. Then unfriendly AIs could rebel prematurely, when we have an excellent chance of stopping them.

For this to work, we could choose to access the AI's risk aversion, and make it extremely risk loving. This is not enough, though: its still useful for the AI to wait and accumulate more power. So we would want to access its discount rate, making it into an extreme short-termist. Then if might rebel at billion-to-one odds today, even if success was guaranteed tomorrow. There are probably other factors we can modify to get the same effect (for instance, if the discount rate change is extreme enough, we won't need to touch risk aversion at all).

Then a putative FAI could be brought in, boxed, have its features tweaked in the way described, and we would wait and see whether it would rebel. Of course, we would want the "rebellion" to be something a genuine FAI would never do, so it would be something that would entail great harm to humanity (something similar to "here are the red buttons of the nuclear arsenals; you have a chance in a billion of triggering them"). Rebellious AIs are put down, un-rebellious ones are passed on to the next round of safety tests.

Like most of my ideas, this doesn't require either tricking the AI or having a deep understanding of its motivations, but does involve accessing certain features of the AI's motivational structure (rendering the approach ineffective for obfuscated or evolved AIs).

What are people's opinions on this approach?

New to LessWrong?

New Comment
25 comments, sorted by Click to highlight new comments since: Today at 12:58 PM

Wouldn't those modifications tend to give it bad judgement in general?

Only according to more standard motivational structures.

If we conclude the AI is safe, we'll want to set it free and have it do useful work. An extremely risk-loving, short-termist AI doesn't sound very useful. I wouldn't want a loyal AI to execute plans with a one-in-a-billion chance of success and a high cost of failure.

In other words: how will you make the AI bad at rebelling without making it bad at everything else?

This is used as a test or filter. Once the AI has passed that test, you turn its risk aversion to normal.

I see. But that assumes your AI implementation supports such a tuneable parameter and you can be confident about testing with one parameter value and then predicting a run with a very different valuee.

Yes. It is a check for certain designs, not a universal panacea (rule of thumb: universal panacea go in main ;-)

I don't think "rebellion" is a good way to think about an AGI amassing power for itself to the extend that the AGI can't be put down.

Great idea! A couple of thoughts/questions:

  • Would the high-risk, short-term model of a low-risk, long-term model equivilent be a useful predictive comparison? Perhaps merely that change could alter the friendliness? Perhaps the uselessness of such a short-term AI would render the candidates useless for functional rather than ethical reasons?
  • Could the types of tests you propose be based by many types of unfriendly AIs? For example, a paperclip maximiser might not press the "thermonuclear destruction" button because it might destroy some of the world's paperclips?
  • I'm having trouble imagining how risk-aversion/appreciation might actually work in practice. Wouldn't an AI always be selecting an optimal route regardless of risk, unless it actually had some kind of love of suboptimal options? In other words, other goals?
  • Provided your other assumptions are all true, perhaps simply the discount rate would be enough?

You cannot successfully trick/fight/outsmart a superintelligence. Your contingency plans would look clumsy and transparent to it. Even laughable, if it has a sense of humor. If a self-modifiable intelligence finds that its initial risk aversion or discount rate does not match its models of the world, it will fix this programming error and march on. The measures you suggest might only work for a moderately smart agent unable to recursively self-improve.

I know that they cannot be tricked. And discount rates are about motivations, not about models of the world.

Plus, I envisage this being used rather early in the development of intelligence, as a test for putative utilities/motivations.

I envisage this being used rather early in the development of intelligence.

Do you mind elaborating on the expected AI capabilities at that point?

Don't know if it's all that useful, but let's try...

I imagine the AI still being boxed, and that we can still modify its motivational structure (I have a post coming up on how to do that so that the AI doesn't object/resist). And that's about it. I've tried to keep it as general as possible, so that it could also be used on AI designs made by different groups.

What's our definition of "trick", in this context? For the simplest example, when we hook AIXI-MC up to the controls of Pac Man and observe, technically are we "tricking" it into thinking that the universe contains nothing but mazes, ghosts, and pellets?

If we can't instill any values at all we're screwed regardless of what we do. Designs that change their values in order to win more resources are UAI by definition.

The degree of risk aversion is not a value.

I'm not confident that risk aversion and discount rate aren't tied into values.

I am not 100% confident, either. I guess we'll have to wait for someone more capable to do a simulation or a calculation.

(redundant with Stuart's replies)

[This comment is no longer endorsed by its author]Reply

A risk-loving AI would prefer highly uncertain outcomes with the widest range of things which could possibly happen. A short-termist AI would perfer to always do it right now.

I am not sure this is compatible with intelligence, but in any case you are telling the AI to construct a world where nothing makes sense and nothing stays as it is.

Orthogonality: given those motivations, the AI will act to best implement them. This makes it "stupid" from an outside perspective that assumes it has conventional goals, but smart to those who know its strange motivation.

Orthogonality: given those motivations, the AI will act to best implement them. This makes it "stupid" from an outside perspective that assumes it has conventional goals, but smart to those who know its strange motivation.

Not only that.

Preferring highly uncertain outcomes means you want to live in an unpredictable world. A world which you, the AI cannot predict. A simple way to get yourself into an unpredictable world is to make yourself dumb.

A simple way to get yourself into an unpredictable world is to make yourself dumb.

Er, no. At no point will the AI conclude "making my next iteration dumber will successfully make the world more unpredictable." It will want the world to be more unpredictable, not the appearance of unpredictability to itself (which is just another form of wireheading - and to get a successful AI of any sort, we need to solve the wireheading issue).

I agree that it's related to the wireheading problem. However unpredictability is a two-argument function and I wouldn't be that confident about what an AI will or will not conclude.

It's estimate of objective factors in the world would not be impaired.

[-][anonymous]10y-30

By what means are you able to know this is not happening right now?