Shaping the exploration of the motivation-space matters for AI safety
by Maxime Riché, Victor Gillioz, nielsrolf, Kajetan Dymkiewicz, Filip Sondej, RogerDearnaley, Daniel Tan, and dillonkn
Summary We argue that shaping RL exploration, and especially the exploration of the motivation-space, is understudied in AI safety and could be influential in mitigating risks. Several recent discussions hint in this direction — the entangled generalization mechanism discussed in the context of Claude 3 Opus's self-narration, the success of...
Mar 683