LESSWRONG
LW

AI
Frontpage

10

Reward splintering as reverse of interpretability

by Stuart_Armstrong
31st Aug 2021
AI Alignment Forum
1 min read
0

10

Ω 6

AI
Frontpage

10

Ω 6

New Comment
Moderation Log
More from Stuart_Armstrong
View more
Curated and popular this week
0Comments

There is a sense in which reward splintering is the reverse of interpretability.

Interpretability is basically:

  • "This algorithm is doing something really complicated; nevertheless, I want a simple model that explains essentially what it is doing. If there is no obvious simple model, I want an explanatory model to be taught to me with the least amount of complexity, distortion, or manipulation."

Reward splintering is:

  • "Here is my simple model of what the algorithm should be doing. I want the algorithm to essentially do that, even if its underlying behaviour is really complicated. If it must deviate from this simple model, I want it to deviate in a way that has the least amount of complexity, distortion, or manipulation."