I forget what got me thinking about this recently, but seeing this branching tree reminded me of something important: when something with many contributions goes well, it's probably because all the contributions were a bit above average, not because one of them blew the end off the scale. For example, if I learn that the sum of two normal distributions is above average, I expect that contribution to be divided evenly between the two components, in units of standard deviations.

Which is not to say Paul implies anything to the contrary. I was just reminded.

Thought 2 is that in this way of presenting it, I really didn't see a difference between inner and outer alignment. If you try to teach an AI something, and the concept it learns isn't the concept you wanted to teach, this is not necessarily inner or outer failure. I'd thought "inner alignment" was normally used in the context of an "inner optimizer" that we might expect to get by applying lots of optimization pressure to a black box.

Hm. You could reframe the "inner alignment" stuff without reference to an inner optimizer by talking about all methods that would work on a heavily optimized black box, but then I think the category becomes broad and includes Paul's work. But maybe this is for the best? "Transparent box" alignment could still include important problems in designing alignment schemes where the agent is treated as having separable world-models and planning faculties, though given the leakiness of the "agent" abstraction any solution will require "black box" alignment as well.

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

31

[Talk] Paul Christiano on his alignment taxonomy

31

Ω 10

31

Ω 10