(Re)Posted as part of the AI Alignment Forum sequence on Value Learning.
Rohin’s note: In the last post, we saw that a good broad value learning approach would need to understand the systematic biases in human planning in order to achieve superhuman performance. Perhaps we can just use machine learning again and learn the biases and reward simultaneously? This post by Stuart Armstrong (original here) and the associated paper say: “Not without more assumptions.”
This post comes from a theoretical perspective that may be alien to ML researchers; in particular, it makes an argument that simplicity priors do not solve the problem pointed out here, where simplicity is based on Kolmogorov complexity (which is an instantiation of the Minimum Description Length principle). The analog in machine learning would be an argument that regularization would not work. The proof used is specific to Kolmogorov complexity and does not clearly generalize to arbitrary regularization techniques; however, I view the argument as being suggestive that regularization techniques would also be insufficient to address the problems raised here.
Humans have no values… nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values.
An agent with no clear preferences
There are three buttons in this world, , , and , and one agent .
and can be operated by , while can be operated by an outside observer. will initially press button ; if ever is pressed, the agent will switch to pressing . If is pressed again, the agent will switch back to pressing , and so on. After a large number of turns , will shut off. That’s the full algorithm for .
So the question is, what are the values/preferences/rewards of ? There are three natural reward functions that are plausible:
- , which is linear in the number of times is pressed.
- , which is linear in the number of times is pressed.
- , where is the indicator function for being pressed an even number of times, being the indicator function for being pressed an odd number of times.
For , we can interpret as an maximising agent which overrides. For , we can interpret as an maximising agent which releases from constraints. And is the “ is always fully rational” reward. Semantically, these make sense for the various ’s being a true and natural reward, with “coercive brain surgery” in the first case, “release H from annoying social obligations” in the second, and “switch which of and gives you pleasure” in the last case.
But note that there is no semantic implications here, all that we know is , with its full algorithm. If we wanted to deduce its true reward for the purpose of something like Inverse Reinforcement Learning (IRL), what would it be?
Modelling human (ir)rationality and reward
Now let’s talk about the preferences of an actual human. We all know that humans are not always rational. But even if humans were fully rational, the fact remains that we are physical, and vulnerable to things like coercive brain surgery (and in practice, to a whole host of other more or less manipulative techniques). So there will be the equivalent of “button ” that overrides human preferences. Thus, “not immortal and unchangeable” is in practice enough for the agent to be considered “not fully rational”.
Now assume that we’ve thoroughly observed a given human h (including their internal brain wiring), so we know the human policy (which determines their actions in all circumstances). This is, in practice all that we can ever observe - once we know perfectly, there is nothing more that observing h can teach us.
Let be a possible human reward function, and R the set of such rewards. A human (ir)rationality planning algorithm (hereafter referred to as a planner), is a map from R to the space of policies (thus says how a human with reward will actually behave - for example, this could be bounded rationality, rationality with biases, or many other options). Say that the pair is compatible if . Thus a human with planner and reward would behave as does.
What possible compatible pairs are there? Here are some candidates:
- , where and are some “plausible” or “acceptable” planner and reward functions (what this means is a big question).
- , where is the “fully rational” planner, and is a reward that fits to give the required policy.
- , where , and , where is defined as ; here is the “fully anti-rational” planner.
- , where maps all rewards to , and is trivial and constant.
- , where and .
Distinguishing among compatible pairs
How can we distinguish between compatible pairs? At first appearance, we can’t. That’s because, by their definition of compatible, all pairs produce the correct policy . And once we have , further observations of tell us nothing.
I initially thought that Kolmogorov or algorithmic complexity might help us here. But in fact:
Theorem: The pairs , , are either simpler than , or differ in Kolmogorov complexity from it by a constant that is independent of .
Proof: The cases of and are easy, as these differ from and by two minus signs. Given , a fixed-length algorithm computes . Then a fixed length algorithm defines (by mapping input to ). Furthermore, given and any history , a fixed length algorithm computes the action the agent will take; then a fixed length algorithm defines and for .
So the Kolmogorov complexity can shift between and (all in for , all in for ), but it seems that the complexity of the pair doesn’t go up during these shifts.
This is puzzling. It seems that, in principle, one cannot assume anything about ’s reward at all! , , and is compatible with any possible reward . If we give up the assumption of human rationality - which we must - it seems we can’t say anything about the human reward function. So it seems IRL must fail.