Inner Alignment

Discuss the wiki-tag on this page. Here is the place to ask questions and propose changes.
New Comment
4 comments, sorted by

Inner alignment asks the question - “Is the model trying to do what humans want it to do?”

This seems inaccurate to me. An AI can be inner aligned and still not aligned if we solve inner aliment but mess up outer alignment. 

This text also shows up in the outer alignment tag: Outer Alignment - LessWrong 

I've made an edit to remove this part.

I'm not actually sure about the difference here between this tag and Mesaoptimizers

I'm guessing the distinction was intended to be:

  • Mesa-Optimizers: Under what condition do mesa-optimizers arise, and how can we detect or prevent them (if we want to, and if that's possible)?
  • Inner Alignment: How do you cause mesa-optimizers to have the same goal as the base optimizer? (Or maybe, more generally, how do you cause mesa-optimizers to have good desired properties?)

Or 'Inner Alignment' is meant to be a subcategory of 'Mesa-Optimizers'?