LESSWRONGTags
LW

Deceptive Alignment

•

Applied to Inducing Unprompted Misalignment in LLMs by Sam Svenningsen 8d ago

•

Applied to Invitation to the Princeton AI Alignment and Safety Seminar by Sadhika Malladi 1mo ago

•

Applied to Strong-Misalignment: Does Yudkowsky (or Christiano, or TurnTrout, or Wolfram, or…etc.) Have an Elevator Speech I’m Missing? by Benjamin Bourlier 2mo ago

•

Applied to Two Tales of AI Takeover: My Doubts by Violet Hour 2mo ago

•

Applied to Anomalous Concept Detection for Detecting Hidden Cognition by Paul Colognese 2mo ago

•

Applied to Counting arguments provide no evidence for AI doom by Nora Belrose 2mo ago

•

Applied to Hidden Cognition Detection Methods and Benchmarks by Paul Colognese 2mo ago

•

Applied to Instrumental deception and manipulation in LLMs - a case study by Olli Järviniemi 2mo ago

•

Applied to The Shutdown Problem: Incomplete Preferences as a Solution by EJT 2mo ago

•

Applied to The Gemini Incident by the gears to ascension 2mo ago

•

Applied to Difficulty classes for alignment properties by Jozdien 2mo ago

•

Applied to Many arguments for AI x-risk are wrong by TurnTrout 2mo ago

•

Applied to Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems by Florian_Dietz 2mo ago

•

Applied to Critiques of the AI control agenda by Jozdien 2mo ago

•

Applied to How to train your own "Sleeper Agents" by jacobjacob 3mo ago

•

Applied to Selfish AI Inevitable by Davey Morse 3mo ago

•

Applied to Introducing Alignment Stress-Testing at Anthropic by Gunnar_Zarncke 3mo ago

•

Applied to On Anthropic’s Sleeper Agents Paper by Gunnar_Zarncke 3mo ago