x

LESSWRONG

LW

Kajetan Dymkiewicz — LessWrong

Kajetan Dymkiewicz

Kajetan Dymkiewicz

Message

61

1

2y

Kajetan Dymkiewicz

61

2y

Shaping the exploration of the motivation-space matters for AI safety

by Maxime Riché, Victor Gillioz, nielsrolf, Kajetan Dymkiewicz, Filip Sondej, RogerDearnaley, Daniel Tan, and dillonkn

Summary We argue that shaping RL exploration, and especially the exploration of the motivation-space, is understudied in AI safety and could be influential in mitigating risks. Several recent discussions hint in this direction — the entangled generalization mechanism discussed in the context of Claude 3 Opus's self-narration, the success of...