misterbailey - LessWrong

Debunking Fallacies in the Theory of AI Motivation

My bizarre question was just an illustrative example. It seems neither you nor I believe that would be an adequate criterion (though perhaps for different reasons).

If I may translate what you're saying into my own terms, you're saying that for a problem like "shoot first or ask first?" the criteria (i.e., constraints) would be highly complex and highly contextual. Ok. I'll grant that's a defensible design choice.

Earlier in the thread you said

the AI is supposed to take an action in spite of the fact that it is getting '''massive feedback''' from all the humans on the planet, that they do not want this action to be executed.

This is why I have honed in on scenarios where the AI has not yet received feedback on its plan. In these scenarios, the AI presumably must decide (even if the decision is only implicit) whether to consult humans about its plan first, or to go ahead with its plan first (and halt or change course in response to human feedback). To lay my cards on the table, I want to consider three possible policies the AI could have regarding this choice.

Always (or usually) consult first. We can rule this out as impractical, if the AI is making a large number of atomic actions.
Always (or usually) shoot first, and see what the response is. Unless the AI only makes friendly plans, I think this policy is catastrophic, since I believe there are many scenarios where an AI could initiate a plan and before we know what hit us we're in an unrecoverably bad situation. Therefore, implementing this policy in a non-catastrophic way is FAI-complete.
Have some good critera for picking between "shoot first" or "ask first" on any given chunk of planning. This is what you seem to be favoring in your answer above. (Correct me if I'm wrong.) These criteria will tend to be complex, and not necessarily formulated internally in an axiomatic way. Regardless, I fear making good choices between "shoot first" or "ask first" is hard, even FAI-complete. Screw up once, and you are in a catastrophe like in case 2.

Can you let me know: have I understood you correctly? More importantly, do you agree with my framing of the dilemma for the AI? Do you agree with my assessment of the pitfalls of each of the 3 policies?

Debunking Fallacies in the Theory of AI Motivation

misterbailey9y50

I understand your desire to stick to an exegesis of your own essay, but part of a critical examination of your essay is seeing whether or not it is on point, so these sorts of questions really are "about" your essay.

Regardng your preliminary answer, I by "correct" I assume you mean "correctly reflecting the desires of the human supervisors"? (In which case, this discussion feeds into our other thread.)

Debunking Fallacies in the Theory of AI Motivation

misterbailey9y10

With respect, your first point doesn't answer my question. My question was, what criteria would cause the AI to submit a given proposed action or plan for human approval? You might say that the AI submits every proposed atomic action for approval (in this case, the criterion is the trivial one, "always submit proposal"), but this seems unlikely. Regardless, it doesn't make sense to say the humans have already heard of the plan about which the AI is just now deciding whether to tell them.

In your second point you seem to be suggesting an answer to my question. (Correct me if I'm wrong.) You seem to be suggesting "context." I'm not sure what is meant by this. Is it reasonable to suppose that the AI would make the decision about whether to "shoot first" or "ask first" based on things like, eg., the lower end of its 99% confidence interval for how satisfied its supervisors will be?

Debunking Fallacies in the Theory of AI Motivation

misterbailey9y10

There are other methods than "sitting around thinking of as many exotic disaster scenarios as possible" by which one could seek to make AI friendly. Thus, believing that "sitting around [...]" will not be sufficient does not imply that we should halt AI research.

Debunking Fallacies in the Theory of AI Motivation

misterbailey9y10

My question was about what criteria would cause the AI to make a proposal to the human supervisors before executing its plan. In this case, I don’t think the criteria can be that humans are objecting, since they haven’t heard its plan yet.

(Regarding the point that you're only addressing the scenarios proposed by Yudkowsky et al, see my remark here .)

Debunking Fallacies in the Theory of AI Motivation

misterbailey9y10

The problem with you objecting to the particular scenarios Yudkowsky et al propose is that the scenarios are merely illustrative. Of course, you can probably guard against any specific failure mode. The claim is that there will be a lot of failure modes, and we can’t expect to guard against all of them by just sitting around thinking of as many exotic disaster scenarios as possible.

Mind you, I know your argument is more than just “I can see why these particular disasters could be avoided”. You’re claiming that certain features of AI will in general tend to make it careful and benevolent. Still, I don’t think it’s valid for you to complain about bait-and-switch, since that’s precisely the problem.

Debunking Fallacies in the Theory of AI Motivation

misterbailey9y20

Yudkowsky et al don't argue that the problem is unsolvable, only that it is hard. In particular, Yudkowsky fears it may be harder than creating AI in the first place, which would mean that in the natural evolution of things, UFAI appears before FAI. However, I needn't factor what I'm saying through the views of Yudkowsky. For an even more modest claim, we don't have to believe that FAI is hard in hindsight in order to claim that AI will be unfriendly unless certain failure modes are guarded against. On this view of the FAI project, a large part of the effort is just noticing the possible failure modes that were only obvious in hindsight, and convincing people that the problem is important and won't solve itself.

Debunking Fallacies in the Theory of AI Motivation

misterbailey9y10

Thanks for replying. Yes it does help. My apologies. I think I misunderstood your argument initially. I confess I still don't see how it works though.

You criticize the doctrine of logical infallibility, claiming that a truly intelligent AI would not believe such a thing. Maybe so. I'll set the question aside for now. My concern is that I don't think this doctrine is an essential part of the arguments or scenarios that Yudkowsky et al present.

An intelligent AI might come to a conclusion about what it ought to do, and then recognize "yes, I might be wrong about this" (whatever is meant by "wrong"---this is not at all clear). The AI might always recognize this possibility about every one of its conclusions. Still, so what? Does this mean it won't act?

Can you tell me how you feel about the following two options? Or, if you prefer a third option, could you explain it? You could

1) explicitly program the AI to ask the programmers about every single one of its atomic actions before executing them. I think this is unrealistic. ("Should I move this articulator arm .5 degrees clockwise?")

2) or, expect the AI to conclude, through its own intelligence, that the programmers would want it to check in about some particular plan, P, before executing it. Presumably, the reason the AI would have for this checking-in would be that it sees that, as a result of its fallibility, there is a high chance that this course of action, P, might actually be unsatisfying to the programmers. But the point is that this checking-in is triggered by a specific concern the AI has about the risk to programmer satisfaction. This checking-in would not be triggered by plan Q that the AI didn't have a reasonable concern was a risk to programmer satisfaction.

Do you agree with either of these options? Can you suggest alternatives?

Debunking Fallacies in the Theory of AI Motivation

misterbailey9y110

I see a fair amount of back-and-forth where someone says "What about this?" and you say "I addressed that in several places; clearly you didn't read it." Unfortunately, while you may think you have addressed the various issues, I don't think you did (and presumably your interlocutors don't). Perhaps you will humor me in responding to my comment. Let me try and make the issue as sharp as possible by pointing out what I think is an out-and-out mistake made by you. In the section you call the heart of your argument, you say.

If the AI is superintelligent (and therefore unstoppable), it will be smart enough to know all about its own limitations when it comes to the business of reasoning about the world and making plans of action. But if it is also programmed to utterly ignore that fallibility—for example, when it follows its compulsion to put everyone on a dopamine drip, even though this plan is clearly a result of a programming error—then we must ask the question: how can the machine be both superintelligent and able to ignore a gigantic inconsistency in its reasoning?

Yes, the outcome is clearly the result of a "programming error" (in some sense). However, you then ask how a superintelligent machine could ignore such an "inconsistency in its reasoning." But a programming error is not the same thing as an inconsistency in reasoning.

Note: I want to test your argument (at least at first), so I would rather not get a response from you claiming I've failed to take into account other arguments or other evidence, therefore my objection is invalid. Let me propose that you either 1) dispute that this was, in fact, a mistake, 2) explain how I have misunderstood, 3) grant that it was a mistake, and reformulate the claim here, or 4) state that this claim is not necessary for your argument.

If you can help me understand this point, I would be happy to continue to engage.

Welcome to Less Wrong! (7th thread, December 2014)

misterbailey9y60

Hi. I'm a long time lurker (a few years now), and I finally joined so that I could participate in the community and the discussions. This was borne partly out of a sense that I'm at a place in my life where I could really benefit from this community (and it could benefit from me), and partly out of a specific interest in some of the things that have been posted recently: the MIRI technical research agenda.

In particular, once I've had more time to digest it, I want to post comments and questions about Reasoning Under Logical Uncertainty.

More about me: I'm currently working as a postdoctoral fellow in mathematics. My professional work is in physics-y differential geometry, so only connected to the LW material indirectly via things like quantum mechanics. I practice Buhddist meditation, without definitively endorsing any of the doctrines. I'm surprised meditation hasn't gotten more airtime in the rationalist community.

My IRL exposure to the LWverse is limited (hi Critch!), but I gather there's a meetup group in Utrecht, where I'm living now.

Anyway, I look forward to good discussions. Hello everyone!

LESSWRONG
LW

Posts

Wiki Contributions

Comments