Corrigibility or DWIM is an attractive primary goal for AGI

Seth Herd

16 Corrigibility or DWIM is an attractive primary goal for AGI

25th Nov 2023

2 min read

16

While rereading the List of Lethalities (LoL), I was compelled by the argument against corrigibility. It's really hard to make a goal of "maximize X, except if someone tells you to shut down". I think the same argument applies to Christiano's goal of achieving corrigibility through RL by rewarding correlates of corrigibility. If other things are rewarded more reliably, you may not get your AGI to shut down when you need it to.

But those arguments don't apply if corrigibility in the broad sense is the primary goal. "Doing what this guy means by what he says" is a perfectly coherent goal. And it's a highly attractive one, for a few reasons. Perhaps corrigibility shouldn't be used in this sense and do what I mean (DWIM) is a better term. But it's closely related. It accomplishes corrigibility, and has other advantages. I think it's fairly likely to be the first goal someone actually gives an AGI.

"Do what I mean" sidesteps the difficulty of outer alignment. The difficulty of outer alignment is another point in the LoL. One common plan, which seems sensible, is to keep humans in the loop; to have a Long Reflection to decide what we want. "DWIM" allows you to contemplate and change your mind as much as you like.

Of course, the problem here is: do what WHO means? We'd like an AGI that serves all of humanity, not just one guy or board of directors. And we'd like to not have power struggles.

But from the point of view of a team actually deciding what goal to give their shot at AGI, DWIM will be incredibly attractive for practical reasons. The outer alignment problem is hard. Specifying one person (or a few) to take instructions from is vastly simpler than deciding and specifying a goal that captures all of human flourishing for all time. You don't want to trust an AGI to interpret that goal correctly. Intepreting DWIM is still fraught, but it is naturally self-correcting, and becomes more useful as the AGI gets more capable. A smarter AGI will be better at understanding what you probably mean, and better at realizing when it's not sure what you mean so it can ask for clarification.

This doesn't at all address inner alignment. But when somebody thinks they have good-enough inner alignment to launch a goal-directed, sapient AGI, DWIM is likely to be the goal they'll choose. This could be good or bad, depending on how well they've implemented inner alignment, and what type of people they are.

Outer AlignmentAI

Frontpage

16

Mentioned in

1370. CAST: Corrigibility as Singular Target

652. Corrigibility Intuition

474. Existing Writing on Corrigibility

46Conflating value alignment and intent alignment is causing confusion

41Goals selected from learned knowledge: an alternative to RL alignment

Load More (5/8)

Corrigibility or DWIM is an attractive primary goal for AGI

5Bogdan Ionut Cirstea

5Seth Herd

4RogerDearnaley

5Seth Herd

New Comment

4 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:54 AM

[-]Bogdan Ionut Cirstea1y50

Agree, and I've had similar/related thoughts on how DWIM seems like a pretty natural target for LLM alignment: https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=65czxJGyBuhqhBRex https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=GRjfMwLDFgw6qLnDv

[-]Seth Herd1y51

Thanks! This seems pretty obvious, from this perspective, right? But there's a lot of concern that outer alignment being hard makes the alignment problem much harder. It seems like you can easily just punt on outer alignment, so I think it's very likely that's what people will do.

[-]RogerDearnaley1y4-1

Have you looked into Value Learning? It's basically "figure out what we mean, then do it"

[-]Seth Herd1y53

I hadn't seen value learning, thank you! I am familiar with Stuart Russel's inverse reinforcement learning, which I think is very similar, and closer to a implementable proposal. I am not enthusiastic about IRL. The proposal there is to infer a human's value function from their behavior, or from the behavior they reward in their agents. To me this seems like a very clumsy solution relative to asking the human what they want when it's unclear and the consequences are important. That's what I'm proposing is the obvious and simple approach that will likely be tried. That could be coupled with IRL.

My mental model here is not "figure out what we mean, then do it", but "infer what I mean based on your models of human language, then check with me if your estimate of consequences are past this threshold I set, or if you have conflicting models of what I might mean". You probably would want some cumulative learning of likely intentions, but you would not want to relax the criteria for checking before executing consequential plans by very much.

IRL or other value learning alone puts the weight of understanding human ethics/value function on the AI system. Even if it works, current human ethics/value functions might be an extraordinarily bad outer alignment target. It could be that maximizing our revealed preference leads to all-against-all competition or war, or the elimination of humanity in favor of better fits to our inferred value function. We don't know what we want, so we don't know what we'd get from having an AGI figure out what we really want. See Moral Reality Check (a short story) and my comment on it. So I'd prefer we figure out what we want for ourselves, and I think that's going to be a very common motivation among humans. The "long contemplation" suggestion appears to be a common one among people thinking about outer alignment targets.

Moderation Log