The Substrate Problem: Why Language Can’t Correct Language
The basic problem I see with RLHF is that it modifies the same layer that later has to interpret the rules. A model is trained on ambiguous human preferences. That training shapes how it interprets text. Later, the model has to interpret safety constraints, evaluation criteria, user intent, and its...