792

LESSWRONG
LW

791
Embedded AgencyReinforcement learningReward FunctionsWireheadingAI
Frontpage

47

Draft papers for REALab and Decoupled Approval on tampering

by Jonathan Uesato, Ramana Kumar
28th Oct 2020
AI Alignment Forum
1 min read
2

47

Ω 28

47

Ω 28

Draft papers for REALab and Decoupled Approval on tampering
6Ben Pace
3Charlie Steiner
New Comment
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 1:08 PM
[-]Ben Pace5yΩ460

PSA: You can write comment on PDFs in google drive!

There's a button in the top right that says "Add a comment" on hover-over, then you get to click-and-drag to highlight a box in the PDF where your comment goes. I will leave a test comment on the first PDF so everyone can see that.

(I literally just found this out.)

Reply
[-]Charlie Steiner5yΩ230

Very interesting. Naturalizing feedback (as opposed to directly accessing True Reward) seems like it could lead to a lot of desirable emergent behaviors, though I'm somewhat nervous about reliance on a handwritten model of what reliable feedback is.

Reply
Moderation Log
More from Jonathan Uesato
View more
Curated and popular this week
2Comments
Embedded AgencyReinforcement learningReward FunctionsWireheadingAI
Frontpage
Mentioned in
150DeepMind is hiring for the Scalable Alignment and Alignment Teams

Hi everyone, we (Ramana Kumar, Jonathan Uesato, Victoria Krakovna, Tom Everitt, and Richard Ngo) have been working on a strand of work researching tampering problems, and we've written up our progress in two papers. We're sharing drafts in advance here because we'd like to get feedback from everyone here.

The first paper covers:

  • How and when tampering problems might arise in the real world
  • Key assumptions in standard RL frameworks we relax to allow modeling tampering
  • How we model and measure tampering empirically, through our internal platform REALab, and
  • How we formalize tampering problems, through our Corrupt Feedback MDP formalism

We particularly hope it clears up the concept of tampering (and why "but the agent maximized its given reward function" typically assumes the wrong framing), and internally, we've found REALab to be a useful mental model.

The second paper describes:

  • Decoupled approval, an algorithm closely related to approval direction and Counterfactual Oracles, and designed to be straightforwardly compatible with standard deep RL
  • An analysis of this algorithm (within the CFMDP formalism), and
  • Empirical validation (in REALab)

We'd love to get feedback on these; the current drafts are viewable in this Google Drive folder. We're happy to discuss these on whichever of LessWrong/Alignment Forum/Google Drive comments, and would prefer to keep discussion on these forums for now, as we'll share the papers more widely after they're posted on arXiv in a few weeks. Looking forward to hearing your thoughts!