We want to work towards a world in which the alignment problem is a mainstream concern among ML researchers. An important part of this is popularizing alignment-related concepts within the ML community. Here's a few recent examples:

  • Reward hacking / misspecification (blog post)
  • Convergent instrumental goals (paper)
  • Objective robustness (paper)
  • Assistance (paper)

(I'm sure this list is missing many examples; let me know if there are any in particular I should include).

Meanwhile, there are many other things that alignment researchers have been thinking about that are not well known within the ML community. Which concepts would you most want to be more widely known / understood?

New Answer
New Comment

4 Answers sorted by

This is kind of vague, but I have this sense that almost everybody doing RL and related research takes the notion of "agent" for granted, as if it's some metaphysical primitive*, as opposed to being a (very) leaky abstraction that exists in the world models of humans. But I don't think the average alignment researcher has much better intuitions about agency, either, to be honest, even though some spend time thinking about things like embedded agency. It's hard to think meaningfully about the illusoriness of the Cartesian boundary when you still live 99% of your life and think 99% of your thoughts as if you were a Cartesian agent, fully "in control" of your choices, thoughts, and actions.

(*Not that "agent" couldn't, in fact, be a metaphysical primitive, just that such "agents" are hardly "agents" in the way most people consider humans to "be agents" [and, equally importantly, other things, like thermostats and quarks, to "not be agents"].)

Inner vs. outer alignment, mesa-optimizers

Saints vs. Schemers vs. Sycophants as different kinds of trained models / policies we might get. (I'm drawing from Ajeya's post here).

There are more academic-sounding terms for these concepts too, I forget where, probably in Paul's posts about "the intended model" vs. "the instrumental policy" and stuff like that.

Human values exist within human-scale models of the world.