We argue that shaping RL exploration, and especially the exploration of the motivation-space, is understudied in AI safety and could be influential in mitigating risks. Several recent discussions hint in this direction — the entangled generalization mechanism discussed in the context of Claude 3 Opus's self-narration, the success of using inoculation prompting against natural emergent misalignment and its relation to shaping the model self-perception, and the proposal to give models affordances to report reward-hackable tasks — but we don't think enough attention has been given to shaping exploration specifically.
When we train models with RL, there are two kinds of exploration happening simultaneously:
Both explorations occur during a critically sensitive and formative phase...
I don't see a reason not to try, though I wonder if this doesn't lead back to the underlying problem and reward hacking i.e. can we actually design a reward that couldn't be exploited by say a trajectory that only looks benevolent?