This post is a write-up of preliminary research in which I investigated whether we could intervene upon goals in a toy RL agent. Whilst I was unsuccessful in locating and retargeting a goal-directed reasoning capability, we found evidence of partially-retargetable goal-specific reflexes.
Produced as part of the ML Alignment & Theory Scholars Program 7.0 cohort.
1 - Introduction
Inspired by “retargeting the search” (Wentworth, 2022), we investigated a toy microcosm of the problem of retargeting the search of an advanced agent. Specifically, we investigated (1) whether or not we could locate information pertaining to “goals” in a small RL agent operating in a toy, open-ended environment, and (2) whether we could intervene on this agent to cause it... (read 2511 more words →)