Constitutional AI vs. RLHF vs. Deliberative Alignment
Outline: 1. Quick review of RLHF, Constitutional AI, and Deliberative Alignment for a somewhat-technical audience, literature review of historical failure modes. 2. Introduce "Persona-Emotion-Behavior space"- combining two recent interpretability papers to get a loose framework for talking about personality stability and current alignment techniques 3. What's going on with alignment...