I was reading Tom Chivers book "The AI does not hate you" and in a discussion about avoiding bad side effects when asking a magic broomstick to fill a water bucket, it was suggested that somehow instead of asking the broomstick to fill the bucket you could do something like ask it to become 95 percent sure that it was full, and that might make it less likely to flood the house.

Apparently Tom asked Eliezer at the time and he said there was no known problem with that solution.

Are there any posts on this? Is the reason why we don't know this won't work just because it's hard to make this precise?

New Answer
New Comment

1 Answers sorted by

Apparently Tom asked Eliezer at the time and he said there was no known problem with that solution.

Bringing up my own copy of the book, here is the full context for this:

You might think there are obvious solutions to each of these problems, and you can just add little patches – assign a –40 to ‘room gets flooded’, say, or a 1 value to ‘if you are 95 per cent sure the cauldron is full’ rather than ‘if the cauldron is full’. And maybe they’d help. But the question is: Did you think of them in advance? And if not, What else have you missed? Patching it afterwards might be a bit late, if you’re worried about water damage to your decor and electricals. And it’s not certain those patches would work,anyway.
I asked Eliezer Yudkowsky about the 95 per cent one and he said: ‘There aren’t any predictable failures from that patch as far as I know.’ But it’s indicative of a larger problem: Mickey thought that he was setting the broom a task, a simple, one-off, clearly limited job, but, in subtle ways that he didn’t foresee, he ended up leaving it with an open-ended goal.
This problem of giving an AI something that looks task-like but is in fact open-ended ‘is an idea that’s about the whole AI, not just the surface goal,’ said Yudkowsky. There could be all sorts of loops that develop as a consequence of how the AI thinks about a problem: for instance, one class of algorithm, known as the ‘generative adversarial network’ (GAN), involves setting two neural networks against each other, one trying to produce something (say, an image) and the other looking for problems with it; the idea is that this adversarial process will lead to better outputs. ‘To give a somewhat dumb example that captures the general idea,’ he said, ‘a taskish AGI shouldn’t contain [a simple] GAN because [a simple] GAN contains two opposed processes both trying to exert an unlimited amount of optimisation power against each other.’ That is, just as Mickey’s broom ended up interpreting a simple task as open-ended, a GAN might dedicate, paperclip-maximiser-style, all the resources of the solar system into both creating and undermining the things it’s supposed to produce. That’s a GAN-specific problem, but it illustrates the deeper one, which is that unless you know how the whole AI works, simply adding patches to its utility function probably won't help.

My understanding of the writing here is that Eliezer was intending say "there is no absolutely obvious problem with that solution that I can think of immediately, but I bet I could find one with a few minutes or hours of thinking, as will I be able to with almost any unprincipled solution you come up with. Just as I can be quite sure that if you haven't thought extremely carefully about cryptography that your system will have some security flaw I can exploit, even if I can't tell you immediately what that flaw might be".

To complete this argument, how things will go wrong will depend a bit on how exactly magic works in the hypothetical sorcerer's apprentice world, but here are some ways they could go wrong depending on the precise mechanics:

  • The AI fills the cauldron, then realizes that depending on its future observations it will probably not continue to assign exactly 95% probability to the cauldron being full. The AI decides to destroy absolutely everything but the cauldron and install a perfect quantum coin that destroys the cauldron with slightly less than 5% probability while trying to shield the cauldron from any potential external causal influence to ensure that it will forever assign as close as possible to 95% probability to the cauldron being full.
  • You underspecified what the "cauldron" is, and while there is indeed a cauldron in your room that is full with 95% probability, there could be a more archetypical cauldron that could be full. A cauldron that fits your specification of the filled cauldron a tiny bit better. The AI decides to destroy your cauldron and everything else around it in the hunt for resources to build the perfect cauldron according to your specifications.

Thanks I didn't have the copy of the book on hand as I was listening on audible.

I assume this won't work, just because the whole thing is a hard problem so a throwaway thought about a solution in a popular press book is unlikely to solve this. I was mainly interested in either why it won't work or why it's hard to make this precise.

I guess I was thinking that making it precise would be either something like

Only do an action if the probability of success of given the action has been done is 95% or greater and don't do an action if the probability of success doing nothing is 95%.

or,

Make the Utility be 1 if at a defined point the probability of success is 95% or greater and 0 otherwise and just maximise Expected Utility.

I'm not sure how the first proposal is going to come up against either of the problems you mention. Though it does seem to need a definition of inaction and p... (read more)

95% or more.

[This comment is no longer endorsed by its author]Reply
5 comments, sorted by Click to highlight new comments since: Today at 9:41 AM

A random action that ranks in the top 5% is not the same as the action that maximizes the chance that you will end up at least 95% certain the cauldron is full.

For EU quantilization. But you could apply quantilization to beliefs rather than the utilities. (I don't think it would work as well because I immediately wonder how many different ways belief quantilization breaks the probability axioms and renders any such agent inconsistent & Dutch-bookable, and this illustrates why people generally discuss EU quantilization instead.)

Thanks Gwern that seems pretty similar to what I had in mind.

do something like ask it to become 95 percent sure that it was full, and that might make it less likely to flood the house.

Sounds reasonable.

The AI fills the cauldron, then realizes that depending on its future observations it will probably not continue to assign exactly 95% probability to the cauldron being full.

Patch - use a probability of 95% or more. (The user's manual should include a warning, not to use it to sell insurance to people, as achieving high probability may be difficult without the use of force.)