x

LESSWRONG

LW

tuphs — LessWrong

tuphs

tuphs

Message

13

1

4y

tuphs

13

4y

Can We Change the Goals of a Toy RL Agent?

This post is a write-up of preliminary research in which I investigated whether we could intervene upon goals in a toy RL agent. Whilst I was unsuccessful in locating and retargeting a goal-directed reasoning capability, we found evidence of partially-retargetable goal-specific reflexes. Produced as part of the ML Alignment &...

Jun 15, 2025•20