Summary We argue that shaping RL exploration, and especially the exploration of the motivation-space, is understudied in AI safety and could be influential in mitigating risks. Several recent discussions hint in this direction — the entangled generalization mechanism discussed in the context of Claude 3 Opus's self-narration, the success of...
Summary We’d like to share our ongoing work on improving LLM unlearning. [arXiv] [github] There’s a myriad of approaches for unlearning, so over the past 8 months we conducted hundreds of small-scale experiments, comparing many loss functions, variants of meta-learning, various neuron or weight ablations, representation engineering and many exotic...
Imagine a scenario where you can switch bodies with someone: both of you are anesthetized, then undergo plastic surgery to match each other’s appearance. You both have your brains rewired to essentially switch brains, and finally, you swap houses, names, bank accounts and all social and legal identities. Let's say...
Summary We tried to figure out how a model's beliefs change during a chain-of-thought (CoT) when solving a logical problem. Measuring this could reveal which parts of the CoT actually causally influence the final answer and which are just fake reasoning manufactured to sound plausible. (Note that prevention of such...
This is a follow-up to my previous post about hidden serial reasoning (but you don't need to have read it). There, I focused on serial reasoning happening internally in the model. Jozdien pointed out that a model could also use it's chain-of-thought for hidden serial reasoning, by using steganography. So...
Summary * Recurrence enables hidden serial reasoning. * Not every recurrence though - connections between channels are needed. Notably Mamba architecture isn't capable of hidden reasoning. * Non-linearity isn’t needed for hidden reasoning. * It’s hard for transformers to learn to use all the layers for serial computation. For example...
TL;DR GPT-3 and GPT-4 understand the concept of keeping a password and can simulate (or write a story about) characters keeping the password. However, this is highly contingent on the prompt (including the characters' names, or previously asked questions). The prompt may contain subtle cues regarding what kind of characters...