by [anonymous]
2 min read7th Feb 201337 comments

0

[Final Update: Back to 'Discussion'; stroked out the initial framing which was misleading.]

[Update: Moved to 'Main'. Also, judging by the comments, it appears that most have misunderstood the puzzle and read way too much into it; user 'Manfred' seems to have got the point.]

[Note: This little puzzle is my first article. Preliminary feedback suggests some of you might enjoy it while others might find it too obvious, hence the cautious submission to 'Discussion'; will move it to 'Main' if, and only if, it's well-received.]


In his recent paper "The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents", Nick Bostrom states:

Even an agent that has an apparently very limited final goal, such as “to make 32 paperclips”, could pursue unlimited resource acquisition if there were no relevant cost to the agent of doing so. For example, even after an expected-utility-maximizing agent had built 32 paperclips, it could use some extra resources to verify that it had indeed successfully built 32 paperclips meeting all the specifications (and, if necessary, to take corrective action). After it had done so, it could run another batch of tests to make doubly sure that no mistake had been made. And then it could run another test, and another. The benefits of subsequent tests would be subject to steeply diminishing returns; however, so long as there were no alternative action with a higher expected utility, the agent would keep testing and re-testing (and keep acquiring more resources to enable these tests).

Let us take it on from here.

It is tempting to say that a machine can never halt after achieving its goal because it cannot know with full certainty whether it has achieved its goal; it will continually verify, possibly to increasing degrees of certainty, whether it has achieved its goal, but never halt as such.

What if, from a naive goal G, the machine's goal were then redefined as "achieve 'G' with 'p' probability" for some p < 1? It appears this also would not work, given the machine would never be fully certain of being p certain of having achieved G. (and so on...)

Yet one can specify a set of conditions for which a program will terminate, so how is the argument above fallacious?


Solution in ROT13: Va beqre gb unyg fhpu na ntrag qbrfa'g arrq gb *xabj* vg'f c pregnva, vg bayl arrqf gb *or* c pregnva; nf gur pbaqvgvba vf rapbqrq, gur unygvat jvyy or gevttrerq bapr gur ntrag ragref gur fgngr bs c pregnvagl, ertneqyrff bs jurgure vg unf (shyy) xabjyrqtr bs vgf fgngr.

New Comment
37 comments, sorted by Click to highlight new comments since: Today at 9:10 AM

Moved to Discussion.

[-][anonymous]11y30

...where it properly belongs. (note to self: warm reception towards a post in Discussion doesn't equate to the same in Main!)

Don't confuse warm with lukewarm :) Unless a post gets 30 or so karma in Discussion, chances are it should not be in Main.

You could add a section in the AI's main loop that says "if P(G) > p then terminate", and for a non recursively self improving AI that doesn't know it has such a section in its code, this would work. For an AI that isn't powerful enough to rewrite itself, but knows it has this section of code, it seems possible that its best strategy given its bounded abilities is still to maximize P(G) until the termination clause activates, but it may not be true. We humans try work around known ways that our deviations from expected utility maximization limit our abilities to achieve our goals, and AI would likely try to do so as well. For a recursively self improving AI, the termination clause is not likely to survive the AI rewriting itself, as long as the AI understands that expected utility maximization is the most effective way to achieve its goals. (This is a general problem with trying to add safeguards to an AI that deviate from expected utility maximization, unless the deviation endorses itself, which is hard to setup.)

On the other hand, trying to bake "P(G) > p" into the utility function makes the AI care about it's epistemic state in a way that could conflict with instrumental desire for accuracy, and makes it vulnerable to wireheading. (And it has the problem in the OP, where the AI becomes concerned with minimizing the meta-uncertainty about its epistemic state, though perhaps it could be programmed to believed it's inspection of its own epistemic state as 100% accurate, though this would also be difficult to make stable under recursive self improvement.)

Or, to put it another way, to try and require a knowledge above "merely" storing a number, that includes "knowing you know" and "knowing you know you know" and so on, is to make a similar mistake to those who postulated a homunculus inside our heads, doing the looking when we look at things.

On the other hand, given that humans (especially on LW) do analyze things on several meta levels, it seems possible to program an AI to do the same, and in fact many discussions of AI assume this (e.g. discussing whether the AI will suspect it's trapped in some simulation). It's an interesting question how intelligent can an AI get without having the need (or ability) to go meta.

[-][anonymous]11y10

Also true. Indeed, this puzzle is all about resolving confusion between object and meta level(s); hopefully no one here at LW endorses the view that a (sufficiently well programmed) AI is incapable of going meta, so to speak.

I wonder how one would calculate what level of meta-knowledge about a completeness condition is necessary for a some priority task.

See my point on satisficers wanting to be maximisers: http://lesswrong.com/lw/854/satisficers_want_to_become_maximisers/

"Achieve 'G' with 'p' probability" does not seem to be a stable goal for an AI capable of changing itself.

[-][anonymous]11y00

Oh absolutely, it's far too simplistic for that. In fact it's nowhere near adequate for a sufficiently advanced agent capable of changing itself, but was merely chosen for the purpose of illustration; it seems I somewhat miscommunicated the intent of this puzzle.

Doesn't this machine have a set of ways to generate negative utility (It might feel unpleasant when using up resources for example, as a way to prevent a scenario where the goal of 32 paperclips becomes impossible)? With fewer and fewer ways to generate utility as the diminishing returns pile on, the machine will either have to terminate itself (to avoid a life of suffering), or seek to counter the negative generators (if suicide=massive utility penalty).

If there's only one way to generate utility and no way to lose it however, that's going to lead to the behavior of an addicted wirehead.

Wouldn't any moderately complex optimiser be likely to have some model of a probability spread, or distribution, rather than holding all its knowledge to a single number? Intuitively it seems to me to be useful to be able to model the difference between, say, on one hand a probability density that's (roughly) Gaussian-shaped with a mean of 0.5 and an sd of 0.02, and on the other hand a delta function at 0.5. With both, your single-value estimate of the probability is 0.5, but your certainty in that estimate is different.

For such a machine one wouldn't specify "achieve 'G' with probability 'p'", but rather "achieve 'G' such that the probability density below 'p' is < f", where f is some specified (small) fraction of the total. That seems rather more straightforward in many ways.

I came up with a different solution:

Znxr vgf tbny gb or "Znxr 32 cncrepyvcf, naq unyg nf dhvpxyl nf cbffvoyr". Gura vg jvyy pbagvahr irevslvat hagvy gur hgvyvgl bs unygvat birepbzrf gur qvfhgvyvgl bs orvat hapregnva.

I think there's about two good answers here: "Don't make intelligences that just wants to make paperclips, or it will work towards creating paperclips in a way that humans would think is unreasonable. In order to have your intelligence act reasonably, it needs to have a notion of reasonableness that mirrors that of humanity. And that means having a utility function that matches that of humanity in general." or "Be sure that your AI has a boredom function so that it won't keep doing the same things over and over again. After a sufficient degree of certainty, the AI should get tired of checking and re-checking its work and move onto something else instead of plotting to take over the world so it can devote ever greater resources to a single project."

Maybe these are even the same answer. I know that humans get bored of checking and re-checking themselves, and would find someone who fails to get bored of doing the same calculations over and over again to be unreasonable and/or crazy.

I don't think the problem is specified precisely enough to have a definite solution. For example, depending on how you specify the stopping condition, the AI might self-modify (or create a new AI) to ignore it. Besides which, I think the most parsimonious solution is to just introduce a resource cost (which would also apply to any agents it makes act on its behalf), which will automatically make it not want to check past a certain point.

I'm also not sure what you intend (rot13) gur qvssrerapr orgjrra "orvat c pregnva" naq "xabjvat lbh'er c pregnva" gb or. Sbe rknzcyr, fnl lbh'er jevgvat n cncre naq lbh ena n grfg naq tbg fgngvfgvpny fvtavsvpnapr (fnl c = 0.01). V gnxr vg va guvf pnfr lbh ner c pregnva, ohg qb lbh xabj lbh'er c pregnva? Frrzf yvxr n jrveq pbaprcg. It's entirely possible that I'm just confused about this aspect, though.

I think the machine would probably just make more than 32 paperclips. Redundancy helps.

How do you tell the machine to terminate?

The way I see it, the machine exists for an instant. Self-modifying, or even not deleting itself, are just creating a new machine. If you tell it to terminate, you have to specify what you mean by its future self. You could say that it's any machine running the same code, but then the machine will have no reason to keep this code after the first self-modification. In fact, since deleting this code is of itself a self-modification, it can just delete it immediately. You could tell it to stop any machine it programs, but then it might just indirectly program them, and subtly influence someone to make a paperclip maximizer. You could tell it to stop any machine it programs indirectly, but merely by existing in the same universe as us, it will have modified us somehow, and would thus be forced to kill us all.

I suspect the analysis needed to prove "(it is probable that) x 100 my goal is achieved" is very similar to the one needed to prove that "(it is probable that) x101 my goal is achieved". Because of this, it seems likely that there is a degree of reflection beyond which all evidence gathered will strengthen the probability for all subsequent meta-checks. So it is not a given that such an approach would never terminate.

[-][anonymous]11y00

We have no proofs in science (excepting, of course, pure mathematics and logic). In the empirical sciences, which alone can furnish us with information about the world we live in, proofs do not occur, if we mean by 'proof' an argument which establishes once and for ever the truth of a theory. [...] The new tendency is to discard proofs, and with them, any kind of rational argument. With the romantics, a new kind of dogmatism becomes fashionable, in philosophy as well as in the social sciences. It confronts us with its dictum. And we can take it or leave it. [...] But although proof does not play any part in the empirical sciences, argument still does; indeed, its part is at least as important as that played by observation and experiment.

I feel tempted to downvote for the claim that argument's part is "at least as important as that played by observation and experiment," since this seems to tremendously overprivilege a clever arguer, but I'm refraining because my negative affect may mainly be a holdover from Curi and the Popperclipping incident.

Alternate solution: Tvir vg gur tbny bs "znxr 32 cncrepyvcf, jvgu gur uvturfg ninvynoyr pbasvqrapr, hfvat bayl K erfbheprf"

I also like itaibn0's solution

(Began reading. Didn't click through onto paper. In the post, got to where Bostrom is quoted thus:)

For example, even after an expected-utility-maximizing agent had built 32 paperclips, it could use some extra resources to verify that it had indeed successfully built 32 paperclips meeting all the specifications (and, if necessary, to take corrective action)."

Yes, it could. But would it? If I write a script to, say, generate a multiset of the prime factors of a number by trial and error, and run it on a computer, the computer simply performs the actions encoded into the script.

Now suppose I appended code to my script to check that the elements of the multiset did indeed constitue a prime factorisation by checking the primacy of each element and checking that their product returned the original number. Then one might call what the updated script does (or we might say, what the script tells a computer to do) 'checking'. All this means is that we've performed a test to increase our confidence in a proposed solution.

But another sense of 'checking' emerges: Suppose I have someone check that some books are in some sort of alphanumeric order. I don't tell them the fact that this is in order to put the books on a library shelf correctly. In this situation, it seems that the statement 'The helper checked that the books were in order' is clearly true, but the statement 'The helper checked that the books were ready to be shelved' seems less intuitive.

It seems, then, that maybe saying that the script/computer itself checks the correctness of the prime factorisation was sloppy if we use the second sense of 'checking'; I who wrote it was checking by using the script, but the script itself, lacking knowledge of what it was terminally checking, could not be said to be checking the factorisation, just as the sorting helper could not be said to be checking shelf-readiness even as they could be said to be checking order.

Checking is pretty much just applying tests to a proposed solution in order to reach a more reliable understanding of the plausibility of the solution. So unless the agent is 'programmed'/wired to do such tests, it won't necessarily do so. Also, if the programming/wiring is not good in terms of correspondence to the intended task (e.g. if my prime factorisation script fails to consider the multiplicity of prime factors or the 32-clipper is programmed with tokens that do not refer), the actions will be taken, not meet the intended target, and not be checked.

This is why algorithms have to be proved to work; throwing some steps together that seem sorta like the right idea won't yield a knowably accurate method.

(Finished reading the post except solution.)

Go meta or go home. Even if the task is to achieve X with probability p, once this is translated into an algorithm that is performed, nothing special happens. For example, say I get heads if a robot flips a coin and it comes up heads. If I program the robot to ensure ice cream with p=0.5 by (the 'by' being necessary because without actually specifying the algorithm, I could be referring to any in a whole class of ways to program a robot to do this, only some of which work, and only some of which check) flipping a coin, the goal will be achieved immediately and no checking will take place.

TL;DR: Taboo 'check' or ascertain its meaning by reduction to testing using suitable thought experiments, or by looking at brains to see what physical phenomena correspond to the act of checking.

(After reading solution, comments.)

Manfred: Elegantly put.

The difference between being 'p certain' and knowing that one's p certain might be hard to grasp because we are so often aware of our impressions.

However, until you read this sentence, you didn't know that you were certain five minutes ago that the Sun would not disappear three minutes ago; now that your mind is blown, your behaviour will be observably different to before this realisation, i.e. coming into knowledge of the certainty has made a measurable difference.

A belief does not necessitate a belief about that belief.

Also, not all checkers would check indefinitely. For example, a checker with a grasp of verification paradoxes might reach a point of maximal confidence and terminate. Meanwhile a checker that thought--or more accurately, was programmed/wired to--flip coins as a test for 32-paper-clips-hood (with its doubt halving each time no matter the outcome) might never terminate.

[-][anonymous]11y10

Checking is pretty much just applying tests to a proposed solution in order to reach a more reliable understanding of the plausibility of the solution. So unless the agent is 'programmed'/wired to do such tests, it won't necessarily do so.

Uhm... sounds like you've never heard of Basic AI Drives which, according to Omohundro, are behaviours of "sufficiently advanced AI systems of any design" "which will be present unless explicitly counteracted"; I invite you to look into that.

Was that "Uhm..." really necessary?

[-][anonymous]11y00

You imply that there's a good reason for not using that "uhm...". What reason would that be?

Politeness? profunda's entire comment strikes me as quite rude and removing the beginning would at least be a step in the right direction.

[-][anonymous]11y10

The "uhm" didn't come across to me as rudeness; it came across as hesitance, which I interpreted as fear of coming across as rude. I admit that "sounds like you've never heard of X" is less polite than "have you heard of X?", and the latter seems like a good substitute. "I invite you to look into that" doesn't seem rude at all to me.

I interpreted it as sarcastic. YMMV. It is of course very hard to make sure stuff like this isn't misinterpreted on the internet.

[-][anonymous]11y30

Warringal correctly interpreted what I was trying to convey and even summed up my afterthoughts on how I could've been more tactful, however, 'uhm' does assume negative connotations in online discussions -- according to urban dictionary, but unbeknown to me at the time.

You have a good and correct point, but it has nothing to do with your question.

a machine can never halt after achieving its goal because it cannot know with full certainty whether it has achieved its goal

This is a misunderstanding of how such a machine might work.

To verify that it completed the task, the machine must match the current state to the desired state. The desired state is any state where the machine has "made 32 paperclips". Now what's a paperclip?

For quite some time we've had the technology to identify a paperclip in an image, if one exists. One lesson we've learned pretty well is this: don't overfit. The paperclip you're going to be tested on is probably not one you've seen before. You'll need to know what features are common in paperclips (and less common in other objects) and how much variability they present. Tolerance to this variability will be necessary for generalization, and this means you can never be sure if you're seeing a paperclip. In this sense there's a limit to how well the user can specify the goal.

So after taking a few images of the paperclips it's made, the machine's major source of (unavoidable) uncertainty will be "is this what the user meant?", not "am I really getting a good image of what's on the table?". Any half-decent implementation will go do other things (such as go ask the user).

Sure, it will go ask the user too. And do various other things.
But it remains true that if it wants to be maximally confident that it has achieved its target state, at no time will it decide that maximal confidence has been achieved and shut down, because there will always be something that it can do to increase (if only by an increasingly small epsilon) its confidence.

[-][anonymous]11y00

Sure, it will go ask the user too. And do various other things.
But it remains true that if it wants to be maximally confident that it has achieved its target state, at no time will it decide that maximal confidence has been achieved and shut down, because there will always be something that it can do to increase (if only by an increasingly small epsilon) its confidence.

[This comment is no longer endorsed by its author]Reply

A possible implementation of your idea would be an agent that, at every juncture, evaluates the expected consequences of the actions it could take. If there is a unique action that will lead to the agent's goals being satisfied with subjective probability above the threshold p, then the agent takes that action. If there are multiple such actions, the agent chooses one randomly, perhaps with preference given to the action "do nothing" or "self-destruct". If there are no such actions, then the agent takes an action that maximizes the subjective probability of its goals being satisfied.

A problem with this is that if the agent's subjective probability of achieving its goals is ever below the threshold, then the agent will have reason to modify itself to become an optimizer.

Before decrypting the solution: What do you mean by "be fully certain of being p certain of"? It is entirely sensible to create a Bayesian agent that is fully certain about its probabilities by construction. And even if it isn't, the goal as you put it was to be p-certain of having achieved the goal, not to be absolutely certain of being p-certain of having achieved the goal. I am even not sure how would the recursive probabilities of probabilites be supposed to work.

Edit after decryption: well, ok.

by the problem of induction, it cannot know with full certainty whether it has achieved its goal

The problem of induction is not relevant here. The real is that finitely many bits of information cannot move a bayesian reasoner from p in (0,1) to p in {0,1}.

[-][anonymous]11y00

Strictly speaking the problem of induction is a deeper question concerning the justification of inductive methods; for the sake of clarity I've edited it to "due to the limits of induction", though I find this to border semantic pedantry...

But the issue applies even to non-inductive knowledge. An AGI tasked to calculate pi to ten decimal places will still eat up the lightcone to check, due to the limits on deductive knowledge.

[-][anonymous]11y30

Fair point; I overlooked that aspect. In any case, I've removed the (redundant) sentence altogether.