A common complaint of particularly theoretical alignment research is that it doesn't seem compatible with deep learning. For example, Tammy's QACI proposal seems pretty clearly not compatible with deep learning. More extremely is early MIRI work.

Prosaic alignment seems more clearly compatible with deep learning, and so does work that rests on interpretability.

Ideally, we'd have both an alignment solution that seems compatible with deep learning, and is at least as theoretically sound (in terms of alignment properties) as QACI (though I'm skeptical of its soundness).

Sometimes clearly defining the constraints of a problem can help with solving it. So, what are the currently known (or probable) constraints that deep learning imposes on alignment solutions (and plans)?

New to LessWrong?

New Answer
New Comment