Produced as the capstone project for AI Safety Fundamentals Course Oct 2024 - Jan 2025
Overview
Anthropic's paper Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training demonstrated that it is possible to create a misaligned AI that is resilient to our current best safety practices (RLHF, SFT, Adversarial training, etc.) -- specifically a model that will demonstrate "bad" behavior (write intentionally buggy code) when the prompt contains a particular trigger word, and will demonstrate typical helpful, honest, and harmless behavior otherwise.
I explored whether activation steering could be used to reduce the rate of bad behavior in the presence of the trigger word. My preliminary results show that application of steering vectors are not... (read 1871 more words →)