The Scenario

A programmer has constructed an artificial intelligence based on an architecture similar to Marcus Hutter's AIXI model (see below for a few details). This AI will maximize the reward given by a utility function the programmer has given it. Just as a test, he connects it to a 3D printer and sets the utility function to give reward proportional to the number of manufactured paper-clips.

At first nothing seems to happen: the AI zooms through various possibilities. It notices that smarter systems generally can make more paper-clips, so making itself smarter will likely increase the number of paper-clips that will eventually be made. It does so. It considers how it can make paper-clips using the 3D printer, estimating the number of possible paper-clips. It notes that if it could get more raw materials it could make more paper-clips. It hence figures out a plan to manufacture devices that will make it much smarter, prevent interference with its plan, and will turn all of Earth (and later the universe) into paper-clips. It does so.

Only paper-clips remain.

Link: aleph.se/andart/archives/2011/02/why_we_should_fear_the_paperclipper.html

New to LessWrong?

New Comment
20 comments, sorted by Click to highlight new comments since: Today at 7:19 PM

Agree with timtyler. AIXI doesn't care about paperclips, it cares about the input string of bits. Given enough smarts, it can just disconnect the input wire from the 3D printer and replace it with a tiny machine that always says "you win".

If anyone here knows how to make AIXI or any other type of AI care about the "actual" number of paperclips produced (the article uses this assumption a lot), rather than some pattern of input into the algorithm, then please enlighten me.

Of course you can't maximise the expected value of U by maximising the expected value of F (which happens to at this point in time empirically correlate to U) instead. If the AI takes the action at t=0 which maximises the expected integral of Reward from t=0 to {far-future}, it can be accurately described as caring about (and only about) Reward.

If you want to make it care about Utility you have to program it to take the action at t=0 which maximises the expected integral of Utility from t=0 to {far-future}. Maximising "Utility" or [the output of this box which implements Utility, I swear!] is not the same thing. Replace the quotation with the referent, etc. Kind of like optimizing for [happiness, freedom, love, complex fun...] instead of "morality" (because you know what a babyeater would reply if you ask what "right" is).

Of course, I don't know how you'd actually program AIXI to do that, but it's what would have to be done...

I can sketch a solution which I think would work: Instead of a simple agent which maximizes a utility function, you need a more complex agent which maximizes utility subject to a constraint. The constraint, in this case, is that the maximizer is not allowed to fiddle with the relation between 'reality' and the input stream. (The AI is forbidden to wear rose-colored glasses.) But the maximizer is permitted to fiddle with the relation between the output stream and 'reality'. (The AI is encouraged to design and build waldos.)

So how do you get constraints into the general framework of optimization? Well, the theory of Lagrange multipliers is one well known technique. Another, which I think might work well, is to build the AI as containing multiple simple utility maximizing subagents, who have negotiated a Nash bargain that forces adherence to a fairness constraint.

Many details to be worked out, but I think that this is the right general approach. Of course we know that a multi-agent coalition can never be as perfectly 'rational' as a simple single agent with a unified utility function. But I don't think that this lack of 'rationality' should frighten us. 'Rationality' as defined in economic decision theory is just a word attached to a set of axioms. If those axioms don't win, pick some different axioms.

So how do you get constraints into the general framework of optimization? Well, the theory of Lagrange multipliers is one well known technique.

No adjustment to the theory is needed - you can just use a different utility function with U=0 if the constraints are violated.

Conventional wisdom, I believe, is that setting up a constraint that has the desired effect is really difficult. If you forbid the agent from putting on spectacles, it just makes another agent that puts them on for it. If spectacles are made painful, a screen is constructed with the desired high-utility display on it. Saying precisely what counts as "fiddling with the input stream" turns out to be a difficult problem.

Constraints just become barriers between the superintelligence and its goal, problems to be worked around - and often it can find a way.

you can just use a different utility function with U=0 if the constraints are violated.

I assume you meant "U = large negative number".

Conventional wisdom, I believe, is that setting up a constraint that has the desired effect is really difficult.

My intuition is that it becomes less difficult if you assign the responsibility of maintaining the constraint to a different sub-agent than the one who is trying to maximize unconstrained U. And have those two sub-agents interact by bargaining to resolve their non-zero-sum game.

It is just an intuition. I'll be happy to clarify it, but less happy if someone insists that I rigorously defend it.

you can just use a different utility function with U=0 if the constraints are violated.

I assume you meant "U = large negative number".

I was thinking about bounded utiliity - normalized on [0,1].

From what I understand about AIXI specifically, I agree, but it seems that the most horrible thing that can be built from the corpse of this idea would involve an AI architecture capable of evaluating its utility function on hypothetical states of the world it predicts, so it can compute and choose actions that maximise expected utility.

That's hard. You'd have to teach AIXI what's a "paperclip" in every mathematically possible "world", or somehow single out our "world". Both problems seem to be outside of our reach at the moment.

AIXI could self-improve through its actions into something that cares about paperclips or something else in the environment, beside its own reward (or maybe it'll just inadvertently self-destruct). This could happen since it lacks the appropriate notions (heuristics) to care about the results of experimenting with own hardware.

AIXI is a bad example for preservation of own goals (for one, it doesn't have a goal unless it self-modifies to introduce one, thus ceasing to be AIXI, it only has a sequence of past observations/rewards).

Or it could decide to give me chocolate and then self-destruct. You're privileging hypotheses without good reason.

You're right, I forgot about the chocolate.

(Those were examples of changing the original algorithm, in particular lifting its limitations. I don't think the particular details are probable, no more than paperclips in particular are the convergent morality of all random AGIs, but starting from an AIXI engine, some kind of powerful intelligent outcome, as opposed to nothing happening, seems possible.)

There are several problems here, IMO, but one is that AIXI is being taken rather too seriously.

AIXI is a RL agent with no real conception of its own aims - and so looks as though it will probably wirehead itself at the first available opportunity - and so fail to turn the universe into paperclips.

Wouldn't a wirehead AI still kill us all, either in order to increase the size of the numbers it can take in or to prevent the feed of high numbers from being switched off?

Possibly. It may depend on the wirehead.

Heroin addicts sometimes engage in crime - but at other times, they are "on the nod" - and their danger level varies accordingly.

It seems quite possible to imagine a class of machine that loses motivation to do anything - once it has "adjusted" itself.

make that a large class of all possible optimizers.

I thought this site was going to do something to reduce the incidence of trollish posts. Voted down.

There is plenty of room for reasoned debate about failure modes for different kinds of intelligences, as viewed from other value systems. Indeed, it would be dangerous to produce more AGIs without such debate. But it should be carried out with respect toward all intelligent beings, not with such inflammatory jabs as "fear the papperclipper".

Bad humans. (_/

I'm not sure this needs further criticism - the author seems to say he doesn't really think the described scenario would happen in real life at the bottom:

In reality I expect that if programming such a system did produce a harmful result it would not be through this kind of easily foreseen mistake.