We're launching the AE Alignment Podcast, a new series from AE Studio's alignment research team where we talk with researchers about their work on AI safety and alignment.
In our first episode, host James Bowler sits down with Alex McKenzie to discuss Endogenous Steering Resistance (ESR), a phenomenon where large language models spontaneously resist activation steering during inference, sometimes recovering mid-generation to produce improved responses even while steering remains active.
When you artificially perturb a language model's internal activations using sparse autoencoder (SAE) latents to push it off-topic, you'd expect the model to just go along with it. Smaller models do, but Llama-3.3-70B does something unexpected: it sometimes catches itself mid-generation, says something like "Wait, that's not right," and course-corrects back to the original task.
The paper identifies
RSS feed is here: https://anchor.fm/s/1109c6f74/podcast/rss
If there are other platforms it would be helpful for us to publish the podcast to, let me know!