x

LESSWRONG
LW

Inner Alignment

Discuss the wikitag on this page. Here is the place to ask questions and propose changes.

New Comment

5 comments, sorted by

Inner alignment asks the question - “Is the model trying to do what humans want it to do?”

This seems inaccurate to me. An AI can be inner aligned and still not aligned if we solve inner aliment but mess up outer alignment.

This text also shows up in the outer alignment tag: Outer Alignment - LessWrong

Reply

[-]Linda Linsefors2y22

I've made an edit to remove this part.

Reply

[-]Seth Herd2y20

I think the better phrasing would be "is the model going to do what the humans trained (or told) it to do?" (specifying a goal you really want is outer alignment).

Reply

[-]Raemon4y20

I'm not actually sure about the difference here between this tag and Mesaoptimizers

Reply

[-]Rob Bensinger4y20

I'm guessing the distinction was intended to be:

Mesa-Optimizers: Under what condition do mesa-optimizers arise, and how can we detect or prevent them (if we want to, and if that's possible)?
Inner Alignment: How do you cause mesa-optimizers to have the same goal as the base optimizer? (Or maybe, more generally, how do you cause mesa-optimizers to have good desired properties?)

Or 'Inner Alignment' is meant to be a subcategory of 'Mesa-Optimizers'?

Reply

Moderation Log