Alignment Does Not Need to Be Opaque! An Introduction to Feature Steering with Reinforcement Learning
Reinforcement Learning (RL) has become the driving force behind modern artificial intelligence, as evidenced by rewarding models from human feedback or powerful new systems like OpenAI's o3 and DeepSeek R1, which are at the top of several capabilities benchmarks[1]. While these advances are exciting, they've exposed critical vulnerabilities in our...