Model Mis-specification and Inverse Reinforcement Learning

Overall I think this piece is great and does a nice job of intuitively explaining ways our attempts to model human values can fail. I notice a bit of friction when I read this part, though:

How do we choose between the theory that Bob values smoking and the theory that he does not (but smokes anyway because of the powerful addiction)? Humans choose between these theories based on our experience with addictive behaviours and our insights into people’s preferences and values. This kind of insight can’t easily be captured as formal assumptions about a model, or even as a criterion about counterfactual generalization. (The theory that Bob values smoking does make accurate predictions across a wide range of counterfactuals.) Because of this, learning human values from IRL has a more profound kind of model mis-specification than the examples in Jacob’s previous post. Even in the limit of data generated from an infinite series of random counterfactual scenarios, standard IRL algorithms would not infer someone’s true values.

I see this kind of thing often in people's thinking: they intuitively have a sense that people can seem to value something and yet not value it because they don't endorse that value. I think this is a confused view, though, taken from our phenomenology of values and preferences when we are also identified with (subject to) those values and preferences and then, not liking what we see in ourselves, creating an ontology that suggests that some values and preferences is not endorsed, we would prefer for them to be otherwise, but find ourselves doing something we don't like anyway.

This sets up an interesting dialectic, because on one hand we have the very real, felt experience of wanting to do one thing (say, not smoke) and then doing the other (smoking anyway) and feeling as if doing the action (smoking) is not really what we want to do and not being "authentic" to our "true" or real self, and on the other we have the very real sense in which we are getting information about values and preferences based on behavior that suggests despite what we say ("I don't want to smoke") we don't act on it. Partly we might attribute this to a lack of reflective equilibrium resulting in irrational preference ordering, although I think that abstract away most of the interesting human psychology that produces this result. Anyway, I point this out because I think there is a useful synthesis that gets us beyond these two conflicting approaches that seem to get in our way of understanding human values: it's correct that we prefer, in this example, to smoke rather than not smoke, but it's also true that we believe we prefer to not smoke rather than smoke, and this is only a problem in that our model may be trying to assume that our preferences match our beliefs.

Now of course our beliefs can change our preferences, but that sounds a bit confusing if we just talk about beliefs and preferences because preferences would seem to be a special kind of belief relating to an ordering over actions, which I think shows that beliefs and preferences are a leaky abstraction. To resolve this we have to look a bit deeper, probably in the direction of Friston.

Reply

[-]rmoehn7y30

Humans can be assigned any values whatsoever… is a great basis for understanding the last section of this article.

Reply

[-]Charlie Steiner7y10

So, to sum up (?):

We want the AI to take the "right" action. In the IRL framework, we think of getting there by a series of ~4 steps - (observations of human behavior) -> (inferred human decision in model) -> (inferred human values) -> (right action).

Going from step 1 to 2 is hard, and ditto with 2 to 3, and we'll probably learn new reasons why 3 to 4 is hard when try to do it more realistically. You mostly use model mis-specification to illustrate this - because very different models of step 2 can predict similar step 1, the inference is hard in a certain way. Because very different models of step 3 can predict similar step 2, that inference is also hard.

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

34

Model Mis-specification and Inverse Reinforcement Learning

34

Ω 11

34

Ω 11

Specific Pitfalls for Inverse Reinforcement Learning

Inverse Reinforcement Learning: Definition and Notations

Recognizing Human Actions in Data

Inferring Reward Functions from Video Frames

Inferring Policies From Video Frames

IRL Needs Curated Data

Information and Biases

Time-inconsistency and Procrastination

Long-term Plans

Learning Values != Robustly Predicting Human Behaviour

Further reading

Acknowledgments