CuriouslyNuclear — LessWrong

Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")

A goal of "make paperclips while you are switched in" does not. "Make paperclip while that's your goal", even less so.

This proposed solution is addressed directly in the article:

“I still don’t see why this is hard,” says the somewhat more experienced computer scientist who is not quite thinking fast enough. “Let V equal U₁ in worlds where the button has never been pressed, and let it equal U₂ in worlds where the button has been pressed at least once. Then if the original AI is a V-maximizer building more AIs, it will build them to follow V and not U₁; it won’t want the successor AI to go on maximizing U₁ after the button gets pressed because then it would expect a lower V-score. And the same would apply to modifying itself.”

But here’s the trick: A V-maximizer’s preferences are a mixture of U₁ and U₂ depending on whether the button is pressed, and so if a V-maximizer finds that it’s easier to score well under U₂ than it is to score well under U₁, then it has an incentive to cause the button to be pressed (and thus, to scare the user). And vice versa; if the AI finds that U₁ is easier to score well under than U₂, then a V-maximizer tries to prevent the user from pressing the button.

Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")

CuriouslyNuclear16d10

Explain a non-VNM-rational architecture which is very intelligent, but has goals that are toggleable with a button in a way that is immune to the failures discussed in the article (as well as the related failures).

Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")

CuriouslyNuclear16d10

But we learned something from the exercise. We learned not just about the problem itself, but also about how hard it was to get outside grantmakers or journal editors to be able to understand what the problem was.

True and unfortunately extends beyond just grantmakers and journal editors.

This part was interesting:

“Ah,” says the computer scientist. “Well, in that case, how about if [some other clever idea]?”
Well, you see, that clever idea is isomorphic to the AI believing that it’s impossible for the button to ever be pressed, which incentivizes it to terrify the user whenever it gets a setback, so as to correlate setbacks with button-presses, which (relative to its injured belief system) causes it to think the setbacks can’t happen.

I would love to see Yudkowsky's or Soares's thoughts on Wentworth's Shutdown Problem Proposal, which seems to avoid the problems discussed here. At first glance it appears to fall under the above failure mode. But since it uses do operations instead of conditional probability, Wentworth argues it doesn't have this problem.

Discontinuous Linear Functions?!

CuriouslyNuclear4mo*50

This is a good post. I remember being so confused in a real analysis class when my professor started talking about how important it is that we restrict our attention to continuous linear functions (what on Earth was a discontinuous linear function supposed to be?). This post explains what's going on better than my professor or textbook did.

I agree with one of the other commenters that this part is not technically phrased accurately:

One way to think about continuity is that a function not having any "jumps" implies that it can't have an "infinitely steep" slope

Because eg the derivative of at $x = 0$ is $+ \infty$ despite the fact that it's continuous there.

Discontinuous Linear Functions?!

CuriouslyNuclear4mo60

For your example to work you need to restrict the domain of your functions to some compact e.g. because the uniform norm requires the functions to be bounded.

Hmm, is this really a substantive problem? Call it an "extended norm" instead of a norm and everything in the post works, right? My reasoning: An extended norm yields an extended metric space, which still generates a topology — it's just that points which are infinitely far apart are in different connected components. Since you get a perfectly valid topology, it makes perfect sense for the post to talk about continuity. Or at least I think so; am I missing something?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments