James Bowler

Message

13d

James Bowler

Message

13d

James Bowler

13d

Introducing the AE Alignment Podcast (Ep. 1: Endogenous Steering Resistance with Alex McKenzie)

James Bowler1d10

RSS feed is here: https://anchor.fm/s/1109c6f74/podcast/rss

If there are other platforms it would be helpful for us to publish the podcast to, let me know!

Introducing the AE Alignment Podcast (Ep. 1: Endogenous Steering Resistance with Alex McKenzie)

Trent Hodgeson, Arrrlex, James Bowler

13d

We're launching the AE Alignment Podcast, a new series from AE Studio's alignment research team where we talk with researchers about their work on AI safety and alignment.

In our first episode, host James Bowler sits down with Alex McKenzie to discuss Endogenous Steering Resistance (ESR), a phenomenon where large language models spontaneously resist activation steering during inference, sometimes recovering mid-generation to produce improved responses even while steering remains active.

What is ESR?

When you artificially perturb a language model's internal activations using sparse autoencoder (SAE) latents to push it off-topic, you'd expect the model to just go along with it. Smaller models do, but Llama-3.3-70B does something unexpected: it sometimes catches itself mid-generation, says something like "Wait, that's not right," and course-corrects back to the original task.

The paper identifies

...

(See More - 249 more words)

LESSWRONG
LW

LESSWRONG
LW

James Bowler

James Bowler

James Bowler

James Bowler

What is ESR?