I’ve finally uploaded to arXiv our work on inferring human biases alongside IRL, which was published at ICML 2019.
Summary of the paper
The IRL Debate
Here’s a quick tour of the debate about inverse reinforcement learning (IRL) and cognitive biases, featuring many of the ideas from the first chapter of the Value Learning sequence:
I had the intuition that the impossibility theorem was like the other no-free-lunch theorems in ML: not actually relevant for what ML could do in practice. So we tried to learn and correct for systematic biases in IRL.
The idea behind the algorithms
The basic idea was to learn the planning algorithm by which the human produces demonstrations, and try to ensure that the planning algorithm captured the appropriate systematic biases. We used a Value Iteration Network to give an inductive bias towards “planners” but otherwise did not assume anything about the form of the systematic bias.  Then, we could perform IRL by figuring out which reward would cause the planning algorithm to output the given demonstrations. The reward would be “debiased” because the effect of the biases on the policy would already be accounted for in the planning algorithm.
How could we learn the planning algorithm? Well, one baseline method is to assume that we have access to some tasks where the rewards are known, and use those tasks to learn what the planning algorithm is. Then, once that is learned, we can infer the rewards for new tasks that we haven’t seen before. This requires the planner to generalize across tasks.
However, it’s kind of cheating to assume access to ground truth rewards, since we usually wouldn’t have them. What if we learned the planning algorithm and rewards simultaneously? Well, the no-free-lunch theorem gets us then: maximizing the true reward and minimizing the negative of the true reward would lead to the same policy, and so you can’t distinguish between them, and so the output of your IRL algorithm could be the true reward or the negative of the true reward. It would be really bad if our IRL algorithm said exactly the opposite of what we want. But surely we can at least assume that humans are not expected utility minimizers in order to eliminate this possibility.
So, we make the assumption that the human is “near-optimal”. We initialize the planning algorithm to be optimal, and then optimize for a planning algorithm that is “near” the optimal planner, in gradient-descent-space, that combined with the (learned) reward function explains the demonstrations. You might think that a minimizer is in fact “near” a maximizer; empirically this didn’t turn out to be the case, but I don’t have a particularly compelling reason why that happened.
Here’s the graph from our paper, showing the performance of various algorithms on some simulated human biases (higher = better). Both of our algorithms get access to the simulated human policies on multiple tasks. Algorithm 1 is the one that gets access to ground-truth rewards for some tasks, while Algorithm 2 is the one that instead tries to ensure that the learned planner is “near” the optimal planner. “Boltzmann” and “Optimal” mean that the algorithm assumes that the human is Boltzmann rational and optimal respectively.
Our algorithms work better on average, mostly by being robust to the specific kind of bias that the demonstrator had -- they tend to perform on par with the better of the Boltzmann and Optimal baseline algorithms. Surprisingly (to me), the second algorithm sometimes outperforms the first, even though the first algorithm has access to more data (since it gets access to the ground truth rewards in some tasks). This could be because it exploits the assumption that the demonstrator is near-optimal, which the first algorithm doesn't do, even though the assumption is correct for most of the models we test. On the other hand, maybe it’s just random noise.
Superintelligent AI alignment
The most obvious way that this is relevant to AI alignment is that it is progress on ambitious value learning, where we try to learn a utility function that encodes all of human values.
“But wait,'' you say, “didn’t you argue that ambitious value learning is unlikely to work?”
Well, yes. At the time that I was doing this work, I believed that ambitious value learning was the only option, and seemed hard but not doomed. This was the obvious thing to do to try and advance it. But this was over a year ago, the reason it’s only now coming out is that it took a while to publish the paper. (In fact, it predates my state of the world work.) But it’s true that now I’m not very hopeful about ambitious value learning, and so this paper’s contribution towards it doesn’t seem particularly valuable to me. However, a few others remain optimistic about ambitious value learning, and if they’re right, this research might be useful for that pathway to aligned AI.
I do think that the paper contributes to narrow value learning, and I still think that this very plausibly will be relevant to AI alignment. It’s a particularly direct attack on the specification problem, with the goal of inferring a specification that leads to a policy that would outperform the demonstrator. That said, I am no longer very optimistic about approaches that require a specific structure (in this case, world models fed into a differentiable planner with an inductive bias that then produces actions), and I am also less optimistic about using approaches that try to mimic expected value calculations, rather than trying to do something more like norm inference.
(However, I still expect that the impossibility result in preference learning will only be a problem in theory, not in practice. It’s just that this particular method of dealing with it doesn’t seem like it will work.)
Near-term AI issues
In the near term, we will need better ways than reward functions to specify the behavior that we want to an AI system. Inverse reinforcement learning is probably the leading example of how we could do this. However, since the specific algorithms require much better differentiable planners before they will perform on par with existing algorithms, it may be some time before they are useful. In addition, it’s probably better to use specific bias models in the near term. Overall, I think these methods or ideas are about as likely to be used in the near term as the average paper (which is to say, not very likely).