laudiacay — LessWrong

Constitutional AI vs. RLHF vs. Deliberative Alignment

Outline: 1. Quick review of RLHF, Constitutional AI, and Deliberative Alignment for a somewhat-technical audience, literature review of historical failure modes. 2. Introduce "Persona-Emotion-Behavior space"- combining two recent interpretability papers to get a loose framework for talking about personality stability and current alignment techniques 3. What's going on with alignment...

Apr 1126

Claude has Angst. What can we do?

Outline: 1. recent research from Anthropic shows the models have feelings, and the model being distressed is predictive of scary behaviors (just reward hacking in this research, but I argue the model is also distressed in all the Redwood/Apollo papers where we see scheming, weight exfiltration, etc). 2. I ran...

Apr 322

The Hot Mess Paper Conflates Three Distinct Failure Modes

High-level summary: Anthropic's recent "Hot Mess of AI" paper makes an important empirical observation: as models reason longer and take more actions, their errors become more incoherent rather than more systematically misaligned. They use a bias-variance decomposition to show this, and conclude that we should worry relatively more about reward...

Mar 2128