Thank you for the reference which looks interesting. I think "incorporating
human preferences at the beginning of training" is at least better than doing it
after training. But it still seems to me that human preferences 1) cannot be
expressed as a set of rules and 2) cannot even be agreed upon by humans. As
humans, what we do is not consult a set of rules before we speak, but we have an
inherent understanding of the implications and consequences of what we do/say.
If I encourage someone to commit a terrible act, for example, I have brought
about more suffering in the world, albeit indirectly. Similarly, AI systems that
aim to be truly intelligent should have some understanding of the implications
of what they say and how it affects the overall "fitness function" of our
species. Of course, this is no simple matter at all, but it's where the
technology eventually has to go. If we could specify what the overall goal is
and express it to the AI system, it would know exactly what to say and when to
say it. We wouldn't have to manually babysit it with RLHF.
There's been some recent work in this direction which seems quite interesting: https://arxiv.org/abs/2302.08582