Joschka Braun

Message

A Conceptual Framework for Exploration Hacking

What happens when a model strategically alters its exploration to resist RL training? In this post, we share our conceptual framework for this threat model, expanding on our research note from last summer. We formalize and decompose "exploration hacking", ahead of an upcoming paper where we study it empirically—by creating...

Feb 1216

Exploration hacking: can reasoning models subvert RL?

This is an early-stage research update as part of the ML Alignment & Theory Scholars Program — Summer 2025 Cohort. We welcome any feedback! TL;DR * If misaligned AI realizes it's undergoing RL training that conflicts with its goals, it might be able to sabotage the training by selectively under-exploring....

Jul 30, 202517

A Sober Look at Steering Vectors for LLMs

We thank Madeline Brumley, Joe Kwon, David Chanin and Itamar Pres for their helpful feedback. Introduction Controlling LLM behavior through directly intervening on internal activations is an appealing idea. Various methods for controlling LLM behavior through activation steering have been proposed. Most steering methods add a 'steering vector' (SV) to...

Nov 23, 202440

LESSWRONG
LW

LESSWRONG
LW

Joschka Braun

Joschka Braun

Joschka Braun

Joschka Braun

A Conceptual Framework for Exploration Hacking

Exploration hacking: can reasoning models subvert RL?

A Sober Look at Steering Vectors for LLMs

A Conceptual Framework for Exploration Hacking

Exploration hacking: can reasoning models subvert RL?

A Sober Look at Steering Vectors for LLMs