Summary We argue that shaping RL exploration, and especially the exploration of the motivation-space, is understudied in AI safety and could be influential in mitigating risks. Several recent discussions hint in this direction — the entangled generalization mechanism discussed in the context of Claude 3 Opus's self-narration, the success of...
Epistemic status: untested but seems plausible TL;DR: making honesty the best policy during RL reasoning training Reward hacking during Reinforcement Learning (RL) reasoning training[1] in insecure or hackably-judged training environments not only allows the model to cheat on tasks rather than learning to solve them, and teaches the model to...
Epistemic status: the other thing that keeps me up at night TL;DR: Even if we solve Alignment, we could well still lose everything. There’s an AI-related existential risk I don’t see discussed much on LessWrong. In fact, it’s so little discussed that it doesn’t even have a good name yet,...
Epistemic status: I just thought this up There is a well-known style of reasoning called the anthropic argument (which has nothing to do with the AI frontier lab of the same name). It goes something like this: > Scientist 1: “X seems really unlikely! How come I’m observing it?" >...
Alignment Pretraining Shows Promise TL;DR: A new paper shows that pretraining language models on data about AI behaving well dramatically reduces misaligned behavior, and this effect persists through post-training. The major labs appear to be taking notice. It’s now the third paper on this idea, and excitement seems to be...
Epistemic status: I've been thinking about this topic for over 15 years, which led me to some counterintuitive conclusions, and I'm now writing up my thoughts concisely. [If you disagree, I'd find it very useful to know which step you think fails: even a short comment or crux is helpful.]...
This is a link-post for a new paper I read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al. For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety...