LESSWRONG
LW

2178
benjaminrachbach
1Ω1010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
No posts to display.
Where do you get your capabilities from?
benjaminrachbach3yΩ120

I'm interested in how this post should update us re: "How does pursuing RLHF help/hurt our chances of creating aligned transformative AI?", assuming that we take the post at face value and assume that it's true, reasonable, etc. (FYI I'm interested in this for weird reasons, including personal curiosity and investigating using alignment forum posts to answer alignment questions on https://elicit.org/)

Currently I think this update is: The following is true of imitation learning but not RLHF: "you are mimicking human actions, and doing so is useful precisely because they are already optimized for promoting human values." So perhaps RLHF bakes in human values less so than imitation learning does, and so TAI created with RLHF fine-tuning is potentially less likely to be aligned than TAI created via pure imitation learning.

Does that seem reasonable? Do you agree with it?

Reply