x

LESSWRONG

LW

Jeremias Ferrao — LessWrong

Jeremias Ferrao

Jeremias Ferrao

Message

9

1

1y

Jeremias Ferrao

9

1y

Alignment Does Not Need to Be Opaque! An Introduction to Feature Steering with Reinforcement Learning

Reinforcement Learning (RL) has become the driving force behind modern artificial intelligence, as evidenced by rewarding models from human feedback or powerful new systems like OpenAI's o3 and DeepSeek R1, which are at the top of several capabilities benchmarks[1]. While these advances are exciting, they've exposed critical vulnerabilities in our...

Apr 18, 2025•10