[ Question ]

Can coherent extrapolated volition be estimated with Inverse Reinforcement Learning?

byJade Bishop8d15th Apr 20195 comments


Given the following conditions, is it possible to approximate the coherent extrapolated value of humanity to a "good enough" level?:

  • Some form of reward/cost function estimation is used, such as inverse reinforcement learning or inverse optimal control. The details of the specific IRL/IOC algorithm in question are not important, just the fact that the reward/cost function is estimated. For the unfamiliar, IRL is essentially the opposite of traditional reinforcement learning in that, given a set of observations and actions, it tries to determine the reward, utility, or value function (all interchangeable) of the agent(s) that generated the set.
  • An agent is able to observe another (presumably human) agent's behaviour and update their estimate of the reward function based on that, without having direct sensory input from them. This is essentially what mirror neurons do. Technologically-speaking, this is probably the most difficult part to achieve, but not too important for the purpose of this question.

Here is my reasoning to believe that this approximation will in fact work:

First, we assume that all these constraints are true.

The estimated reward function is continuously updated with data from every individual it meets, using some form of weighted experience replay system so as to not overwrite previously-learned information.

Given that IRL/IOC can already estimate the reward function of one agent, or even a specific class of agents such as streaked shearwater birds¹, with a sufficiently complex system this algorithm should be able to extend to complex (read: human) agents.

As the number of observations n approaches infinity (or some sufficiently large number), the reward function should approach a reward function that is a "good enough" approximation of the coherent extrapolated value of humanity.

Note that there does not need to exist some actual reward function that is natively used by real humans, evaluated by their brain. As long as human behaviour can be sufficiently approximated by a neural network, this will hold; given the wide abilities of neural networks, from classifiers to learning agents to machine translation, I don't see this as too much of a stretch.

However, I do anticipate certain objections to this explanation. Let me run through a few of them.

  1. Humans are too complex to have their behaviours estimated by inverse reinforcement learning.
    • This seems to me like an argument from human exceptionalism, or anthropocentrism. I don't see any reason for this to be true. Various animals already demonstrate many behaviours considered by anthropocentrists to be unique to humans, such as tool use in various primates and birds, as well as the ability of crows to recognise faces, parrots to mimic speech and perform math. From these examples, I don't see any compelling arguments for an anthropocentric objection to this approach.
  2. Getting the input and output necessary to perform online (i.e. real-time) inverse reinforcement learning is infeasible.
    • This is one of the most compelling counterarguments to this approach. However, I think that even if recreating "mirror neurons" (i.e. sensory neurons that fire both when the agent does something or observes someone do something) is too difficult, another approach could be used. A sufficiently-realistic VRMMORPG-like environment (Virtual Reality Massively Multiplayer Online Role-Playing Game) could be used to collect input sensory data and behaviours from players. If players are properly incentivised to act as they would in a real environment, then with a sufficient amount of pre-training, a "close-enough" approximation of the CEV should be possible.
  3. "Close-enough" doesn't even mean anything!
    • This is also an issue, yes. There are a number of ways to define "close-enough", but I choose to leave the choice of which up to you. Some examples are: "functionally indistinguishable", "functionally indistinguishable within a society", "functionally indistinguishable within an intra-societal community", or "functionally indistinguishable within a small group". These aren't exhaustive, and I can see any number of ways to define "close-enough".
  4. What do you mean by approximating the CEV? Isn't it by definition incomprehensible to ourselves when extrapolated so far out? Doesn't that mean it would be impossible to approximate it from individual observations?
    • This is where it gets dicey. Since we don't know the CEV, how do we know if we have successfully approximated it? Is it even able to be approximated? One of the issues I thought of while writing this is that individual human behaviour may not converge to the CEV. My expectation is that as the number of humans behaviours have been sampled from, as well as the number of samples taken from each individual human, approaches the volition an individual would have if they had the same resources as the entirety of the observed population. My assumption is that this is equivalent to the CEV, which may not be true.

However, I'd be interested to see if there are any rebuttals to my responses to these counterarguments, as well as any counterarguments that I didn't bring up, of which there are definitely many. Also, if I made any mistakes or if anything in this post isn't clear, feel free to ask and I'll clarify it.


  1. Hirakawa, Tsubasa, Takayoshi Yamashita, Toru Tamaki, Hironobu Fujiyoshi, Yuta Umezu, Ichiro Takeuchi, Sakiko Matsumoto, and Ken Yoda. "Can AI Predict Animal Movements? Filling Gaps in Animal Trajectories Using Inverse Reinforcement Learning." Ecosphere 9, no. 10 (2018): N/a.


New Answer
New Comment
Ask Related Question

1 Answers

Did you read Rohin Sha's value learning sequence? It covers this whole area in a good amount of detail, and I think answers your question pretty straightforwardly:

Existing error models for inverse reinforcement learning tend to be very simple, ranging from Gaussian noise in observations of the expert’s behavior or sensor readings, to the assumption that the expert’s choices are randomized with a bias towards better actions.
In fact humans are not rational agents with some noise on top. Our decisions are the product of a complicated mess of interacting process, optimized by evolution for the reproduction of our children’s children. It’s not clear there is any good answer to what a “perfect” human would do. If you were to find any principled answer to “what is the human brain optimizing?” the single most likely bet is probably something like “reproductive success.” But this isn’t the answer we are looking for.
I don’t think that writing down a model of human imperfections, which describes how humans depart from the rational pursuit of fixed goals, is likely to be any easier than writing down a complete model of human behavior.
We can’t use normal AI techniques to learn this kind of model, either — what is it that makes a model good or bad? The standard view — “more accurate models are better” — is fine as long as your goal is just to emulate human performance. But this view doesn’t provide guidance about how to separate the “good” part of human decisions from the “bad” part.

Here is a link to the full sequence: https://www.lesswrong.com/s/4dHMdK5TLN6xcqtyc