Encourage premature AI rebellion

Wouldn't those modifications tend to give it bad judgement in general?

Only according to more standard motivational structures.

If we conclude the AI is safe, we'll want to set it free and have it do useful work. An extremely risk-loving, short-termist AI doesn't sound very useful. I wouldn't want a loyal AI to execute plans with a one-in-a-billion chance of success and a high cost of failure.

In other words: how will you make the AI bad at rebelling without making it bad at everything else?

[-]Stuart_Armstrong12y20

This is used as a test or filter. Once the AI has passed that test, you turn its risk aversion to normal.

[-]DanArmak12y60

I see. But that assumes your AI implementation supports such a tuneable parameter and you can be confident about testing with one parameter value and then predicting a run with a very different valuee.

[-]Stuart_Armstrong12y00

Yes. It is a check for certain designs, not a universal panacea (rule of thumb: universal panacea go in main ;-)

[-]ChristianKl12y40

I don't think "rebellion" is a good way to think about an AGI amassing power for itself to the extend that the AGI can't be put down.

[-]the-citizen11y20

Great idea! A couple of thoughts/questions:

Would the high-risk, short-term model of a low-risk, long-term model equivilent be a useful predictive comparison? Perhaps merely that change could alter the friendliness? Perhaps the uselessness of such a short-term AI would render the candidates useless for functional rather than ethical reasons?
Could the types of tests you propose be based by many types of unfriendly AIs? For example, a paperclip maximiser might not press the "thermonuclear destruction" button because it might destroy some of the world's paperclips?
I'm having trouble imagining how risk-aversion/appreciation might actually work in practice. Wouldn't an AI always be selecting an optimal route regardless of risk, unless it actually had some kind of love of suboptimal options? In other words, other goals?
Provided your other assumptions are all true, perhaps simply the discount rate would be enough?

[-]Shmi12y00

You cannot successfully trick/fight/outsmart a superintelligence. Your contingency plans would look clumsy and transparent to it. Even laughable, if it has a sense of humor. If a self-modifiable intelligence finds that its initial risk aversion or discount rate does not match its models of the world, it will fix this programming error and march on. The measures you suggest might only work for a moderately smart agent unable to recursively self-improve.

[-]Stuart_Armstrong12y50

I know that they cannot be tricked. And discount rates are about motivations, not about models of the world.

Plus, I envisage this being used rather early in the development of intelligence, as a test for putative utilities/motivations.

[-]Shmi12y20

I envisage this being used rather early in the development of intelligence.

Do you mind elaborating on the expected AI capabilities at that point?

[-]Stuart_Armstrong12y10

Don't know if it's all that useful, but let's try...

I imagine the AI still being boxed, and that we can still modify its motivational structure (I have a post coming up on how to do that so that the AI doesn't object/resist). And that's about it. I've tried to keep it as general as possible, so that it could also be used on AI designs made by different groups.

[-]roystgnr12y00

What's our definition of "trick", in this context? For the simplest example, when we hook AIXI-MC up to the controls of Pac Man and observe, technically are we "tricking" it into thinking that the universe contains nothing but mazes, ghosts, and pellets?

[-]RomeoStevens12y10

If we can't instill any values at all we're screwed regardless of what we do. Designs that change their values in order to win more resources are UAI by definition.

[-]Shmi12y10

The degree of risk aversion is not a value.

[-]RomeoStevens12y00

I'm not confident that risk aversion and discount rate aren't tied into values.

[-]Shmi12y00

I am not 100% confident, either. I guess we'll have to wait for someone more capable to do a simulation or a calculation.

[-]Luke_A_Somers12y00

(redundant with Stuart's replies)

[This comment is no longer endorsed by its author]Reply

[-]Lumifer12y00

A risk-loving AI would prefer highly uncertain outcomes with the widest range of things which could possibly happen. A short-termist AI would perfer to always do it right now.

I am not sure this is compatible with intelligence, but in any case you are telling the AI to construct a world where nothing makes sense and nothing stays as it is.

[-]Stuart_Armstrong12y70

Orthogonality: given those motivations, the AI will act to best implement them. This makes it "stupid" from an outside perspective that assumes it has conventional goals, but smart to those who know its strange motivation.

[-]Lumifer12y-20

Orthogonality: given those motivations, the AI will act to best implement them. This makes it "stupid" from an outside perspective that assumes it has conventional goals, but smart to those who know its strange motivation.

Not only that.

Preferring highly uncertain outcomes means you want to live in an unpredictable world. A world which you, the AI cannot predict. A simple way to get yourself into an unpredictable world is to make yourself dumb.

[-]Stuart_Armstrong12y50

A simple way to get yourself into an unpredictable world is to make yourself dumb.

Er, no. At no point will the AI conclude "making my next iteration dumber will successfully make the world more unpredictable." It will want the world to be more unpredictable, not the appearance of unpredictability to itself (which is just another form of wireheading - and to get a successful AI of any sort, we need to solve the wireheading issue).

[-]Lumifer12y00

I agree that it's related to the wireheading problem. However unpredictability is a two-argument function and I wouldn't be that confident about what an AI will or will not conclude.

[-]Stuart_Armstrong12y00

It's estimate of objective factors in the world would not be impaired.

[-][anonymous]12y-30

By what means are you able to know this is not happening right now?

LESSWRONG
LW

LESSWRONG
LW

14

Encourage premature AI rebellion

14

14