All of Ansh Radhakrishnan's Comments + Replies


Thanks for the feedback and corrections! You're right, I was definitely confusing IRL, which is one approach to value learning, with the value learning project as a whole. I think you're also right that most of the "Outer alignment concerns" section doesn't really apply to RLHF as it's currently written, or at least it's not immediately clear how it does. Here's another attempt:

RLHF attempts to infer a reward function from human comparisons of task completions. But it's possible that a reward function learned from these stated preferences might not be the ... (read more)

1Sam Marks3mo
(I should clarify that I'm not an expert. In fact, you might even call me "an amateur who's just learning about this stuff myself"! That said...) I believe that RLHF more broadly refers to learning reward models via supervised learning, not just the special case where the labelled data is pairwise comparisons of task completions. So, for example, I think that RLHF would include e.g. learning a reward model for text summaries based on scalar 1-10 feedback from humans, rather than just pairwise comparisons of summaries. On the topic of whether human biases present an issue for RLHF, I think it might be somewhat subtle. To tease apart a few different concerns you might have: 1. What if human preferences aren't representable by a utility function (e.g. because they're intransitive)? This doesn't seem like an essential obstruction to RLHF, since whatever sort of data type human preferences are (e.g. mappings from triples (history of the world state, option 1, option 2) to {0,1}) I would expect them to still be learnable in principle via supervised learning. Of course, the more unconstrained our assumptions on human preferences, the harder it is to learn them (utility functions are harder to learn than mappings like the above), so we might run into practical issues. But I guess I don't strongly expect that to happen -- I feel like human preferences shouldn't so unconstrained as to sink RLHF. 2. What if RLHF can never learn our true values because the feedback we give it is biased? In this case, I would expect RLHF to learn "biased human values" which ... I guess I'm okay with? Like if we get an AI which is aligned with human values as revealed by stated preferences instead of the reflective equilibrium of human values that we get after correcting for our biases, I still expect that to keep us safe and buy us time to figure out our true values/build an aligned AI that can figure out our values for us. So if t