benjaminrachbach — LessWrong

I'm interested in how this post should update us re: "How does pursuing RLHF help/hurt our chances of creating aligned transformative AI?", assuming that we take the post at face value and assume that it's true, reasonable, etc. (FYI I'm interested in this for weird reasons, including personal curiosity and investigating using alignment forum posts to answer alignment questions on https://elicit.org/)

Currently I think this update is: The following is true of imitation learning but not RLHF: "you are mimicking human actions, and doing so is useful precisely because they are already optimized for promoting human values." So perhaps RLHF bakes in human values less so than imitation learning does, and so TAI created with RLHF fine-tuning is potentially less likely to be aligned than TAI created via pure imitation learning.

Does that seem reasonable? Do you agree with it?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments