My model is that

  1. Alignment = an AI uses the correct model of interpretation of goals
  2. Succeeding in this model design leads to something akin to CEV
  3. Errors under this correct model (e.g. mesa-optimisation) are unlikely because high intelligence + a correct model of interpretation = correct extrapolation of goals
  4. The choice of interpretation model seems almost unrelated to intelligence, as it seems equivalent to the chosen philosophies of meaning.

a) Is my model accurate?

b) Any recommendations for reading that explores alignment from a similar angle (for ex. which philosophy of meaning is most likely to emerge in LLM)? So far, Alex Flint's posts come up the most. Which research agendas sound closest to this framing?

Thanks!

New to LessWrong?

New Answer
New Comment

1 Answers sorted by

Anon User

30

Note that your "1" has two words that both carry very heavy load - "uses" and "correct". What does it mean for a model to be correct? How do you create one? How do you ensure that the model you implemented in software is indeed correct? How do you create AI that actually uses that model under all circumstances? In patcicular, how do you ensure that it is stable under self-improvement, out-of-distribution environments, etc? Your "2-4" seem to indicate that you are focusing more on the "correct" part, and not enough on the "uses" part. My understanding is that if both "correct" and "uses" could be solved, it would indeed likely be a solution to the alignment problem, but it's probably not the only path, and not necessarily the most promising one. Other paths could potentially emerge from the work on AI corrigibility, negative side-effect minimization, etc.

Yes, it's a tough problem! :) However, your points seem to expand, rather than correct my points, which makes me think it's not a bad way to compress the problem into a few words. Thanks!

Edit: (It seems to me that if an AI can correct for its mistakes in misinterpretation, when you look at it from the outside, it's accurate to say it uses the correct model of interpretation, but I can see why you could disagree)

1 comment, sorted by Click to highlight new comments since:

Certainly a significant portion of it involves having accurate meaning binding, in the sense of ensuring that references in fact refer to what they are intended to refer to, including references to valence words such as "good" and "bad" according to a particular person. Some keywords that seem to refer to closely related concepts in my head: concept binding/word binding; map vs territory; predictiveness of description; honesty. You might try dropping these into an academic search engine eg arxivxplorer.