Message

Kajetan Dymkiewicz

Message

Kajetan Dymkiewicz

Shaping the exploration of the motivation-space matters for AI safety

I don't see a reason not to try, though I wonder if this doesn't lead back to the underlying problem and reward hacking i.e. can we actually design a reward that couldn't be exploited by say a trajectory that only looks benevolent?

Shaping the exploration of the motivation-space matters for AI safety

Maxime Riché, Victor Gillioz, nielsrolf, Kajetan Dymkiewicz, Filip Sondej, RogerDearnaley, Daniel Tan, dillonkn

3mo

Summary

We argue that shaping RL exploration, and especially the exploration of the motivation-space, is understudied in AI safety and could be influential in mitigating risks. Several recent discussions hint in this direction — the entangled generalization mechanism discussed in the context of Claude 3 Opus's self-narration, the success of using inoculation prompting against natural emergent misalignment and its relation to shaping the model self-perception, and the proposal to give models affordances to report reward-hackable tasks — but we don't think enough attention has been given to shaping exploration specifically.

When we train models with RL, there are two kinds of exploration happening simultaneously:

Action exploration — what the model does.
Motivation exploration — why the model does it, and how it perceives itself while doing it.

Both explorations occur during a critically sensitive and formative phase...

(Continue Reading - 2847 more words)