Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Paul Christiano makes an amazing tree of subproblems in alignment. I saw this talk live and enjoyed it.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 12:27 AM

Nice find!

I forget what got me thinking about this recently, but seeing this branching tree reminded me of something important: when something with many contributions goes well, it's probably because all the contributions were a bit above average, not because one of them blew the end off the scale. For example, if I learn that the sum of two normal distributions is above average, I expect that contribution to be divided evenly between the two components, in units of standard deviations.

Which is not to say Paul implies anything to the contrary. I was just reminded.

Thought 2 is that in this way of presenting it, I really didn't see a difference between inner and outer alignment. If you try to teach an AI something, and the concept it learns isn't the concept you wanted to teach, this is not necessarily inner or outer failure. I'd thought "inner alignment" was normally used in the context of an "inner optimizer" that we might expect to get by applying lots of optimization pressure to a black box.

Hm. You could reframe the "inner alignment" stuff without reference to an inner optimizer by talking about all methods that would work on a heavily optimized black box, but then I think the category becomes broad and includes Paul's work. But maybe this is for the best? "Transparent box" alignment could still include important problems in designing alignment schemes where the agent is treated as having separable world-models and planning faculties, though given the leakiness of the "agent" abstraction any solution will require "black box" alignment as well.