*(Re)Posted as part of the AI Alignment Forum sequence on* *Value Learning.*

Rohin’s note:In the last post, we saw that a good broad value learning approach would need to understand the systematic biases in human planning in order to achieve superhuman performance. Perhaps we can just use machine learning again and learn the biases and reward simultaneously? This post by Stuart Armstrong (original here) and the associated paper say: “Not without more assumptions.”

This post comes from a theoretical perspective that may be alien to ML researchers; in particular, it makes an argument that simplicity priors do not solve the problem pointed out here, where simplicity is based on Kolmogorov complexity (which is an instantiation of the Minimum Description Length principle). The analog in machine learning would be an argument that regularization would not work. The proof used is specific to Kolmogorov complexity and does not clearly generalize to arbitrary regularization techniques; however, I view the argument as being suggestive that regularization techniques would also be insufficient to address the problems raised here.

Humans have no values… nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values.

### An agent with no clear preferences

There are three buttons in this world, , , and , and one agent .

and can be operated by , while can be operated by an outside observer. will initially press button ; if ever is pressed, the agent will switch to pressing . If is pressed again, the agent will switch back to pressing , and so on. After a large number of turns , will shut off. That’s the full algorithm for .

So the question is, what are the values/preferences/rewards of ? There are three natural reward functions that are plausible:

- , which is linear in the number of times is pressed.
- , which is linear in the number of times is pressed.
- , where is the indicator function for being pressed an even number of times, being the indicator function for being pressed an odd number of times.

For , we can interpret as an maximising agent which overrides. For , we can interpret as an maximising agent which releases from constraints. And is the “ is always fully rational” reward. Semantically, these make sense for the various ’s being a true and natural reward, with “coercive brain surgery” in the first case, “release H from annoying social obligations” in the second, and “switch which of and gives you pleasure” in the last case.

But note that there is no semantic implications here, all that we know is , with its full algorithm. If we wanted to deduce its true reward for the purpose of something like Inverse Reinforcement Learning (IRL), what would it be?

### Modelling human (ir)rationality and reward

Now let’s talk about the preferences of an actual human. We all know that humans are not always rational. But even if humans were fully rational, the fact remains that we are physical, and vulnerable to things like coercive brain surgery (and in practice, to a whole host of other more or less manipulative techniques). So there will be the equivalent of “button ” that overrides human preferences. Thus, “not immortal and unchangeable” is in practice enough for the agent to be considered “not fully rational”.

Now assume that we’ve thoroughly observed a given human h (including their internal brain wiring), so we know the human policy (which determines their actions in all circumstances). This is, in practice all that we can ever observe - once we know perfectly, there is nothing more that observing h can teach us.

Let be a possible human reward function, and **R** the set of such rewards. A human (ir)rationality planning algorithm (hereafter referred to as a planner), is a map from **R** to the space of policies (thus says how a human with reward will actually behave - for example, this could be bounded rationality, rationality with biases, or many other options). Say that the pair is compatible if . Thus a human with planner and reward would behave as does.

What possible compatible pairs are there? Here are some candidates:

- , where and are some “plausible” or “acceptable” planner and reward functions (what this means is a big question).
- , where is the “fully rational” planner, and is a reward that fits to give the required policy.
- , where , and , where is defined as ; here is the “fully anti-rational” planner.
- , where maps all rewards to , and is trivial and constant.
- , where and .

### Distinguishing among compatible pairs

How can we distinguish between compatible pairs? At first appearance, we can’t. That’s because, by their definition of compatible, all pairs produce the correct policy . And once we have , further observations of tell us nothing.

I initially thought that Kolmogorov or algorithmic complexity might help us here. But in fact:

**Theorem:** The pairs , , are either simpler than , or differ in Kolmogorov complexity from it by a constant that is independent of .

**Proof:** The cases of and are easy, as these differ from and by two minus signs. Given , a fixed-length algorithm computes . Then a fixed length algorithm defines (by mapping input to ). Furthermore, given and any history , a fixed length algorithm computes the action the agent will take; then a fixed length algorithm defines and for .

So the Kolmogorov complexity can shift between and (all in for , all in for ), but it seems that the complexity of the pair doesn’t go up during these shifts.

This is puzzling. It seems that, in principle, one cannot assume anything about ’s reward at all! , , and is compatible with any possible reward . If we give up the assumption of human rationality - which we must - it seems we can’t say anything about the human reward function. So it seems IRL must fail.

How I understand the main point:

The goal is to get superhuman performance aligned with human values Rh. How might we achieve this? By learning the human values.Then we can use a perfect planner p⋆ to find the best actions to align the world with the human values. This will have superhuman performance, because humans' planning algorithms are not perfect. They don't always find the best actions to align the world with their values.

How do we learn the human values? By observing human behaviour, ie. their actions in each circumstance. This is modelled as the human policy π(h).

Behaviour is the known outside view of a human, and values+planner is the unknown inside view. We need to learn both the values and the planner such that p(R)=π(h).

Unfortunately, this equation is underdetermined. We only know π(h). p and R can vary independently.

Are there differences among the (p,R) candidates? One thing we could look at is their Kolmogorov complexity. Maybe the true candidate has the lowest complexity. But this is not the case, according to the article.

Yep, basically that. ^_^

Out of curiosity, is there an intuitive explanation as to why these are different? Is it mainly because ambitious value learning inevitably has to deal with lots of (systematic) mistakes in the data, whereas normally you'd make sure that the training data doesn't contain (many) obvious mistakes? Or are there examples in ML where you can retroactively correct mistakes imported from a flawed training set?

(I'm not sure "training set" is the right word for the IRL context. Applied to ambitious value learning, what I mean would be the "human policy".)

Update: Ah, it seems like the next post is all about this! :) My point about errors seems like it might be vaguely related, but the explanation in the next post feels more satisfying. It's a different kind of problem because you're not actually interested in predicting observable phenomena anymore, but instead are trying to infer the "latent variable" – the underlying principle(?) behind the inputs. The next post in the sequence also gives me a better sense of why people say that ML is typically "shallow" or "surface-level reasoning".

Two points about understandability:

This made me think that

Rwas the set of all possible values the reward functions could take or something like that, but it's actually the set of reward.functionsSecondly, writing p(1),p(2) is somewhat confusing because the p(⋅) notation makes one think that p is applied to something (since it's a function), but in fact it's just to count them. Writing p1,p2 would avoid the issue.

I don't find it surprising that a simplicity prior doesn't work since human rationality doesn't seem to be particularly simple. I do have the intuition that the problem is extremely hard.

Interestingly, humans are able to predict each other values in most cases - and this helps our society to exist. Relationship, market, just walking out - all it is based on our ability to read the intentions of other people successfully.

However, many cases of bad events happen when we don't understand each other intentions: this enable scammers and interpersonal conflicts.

Only across small inferential gaps. That works for most cases only because people interact inside bubbles, groups based on similarity. Interactions between random people would be mostly puzzling.

I'm curious what you think of my comment here, which suggests that Kolmogorov complexity might be enough after all, as long as we are willing to change our notion of compatibility.

(I'm also curious what you think of Daniel's post, although to a lesser extent.)

Answered your comment there.

Could one approach to detecting biases be to look for "dominated strategies"? For instance, suppose the human model is observed making various trades, exchanging sets of tokens for other sets of tokens, and the objective of the machine is to infer "intrinsic values" for each type of token.

(Maybe conditional on certain factors, i.e "An A is valuable, but only if you have a B", or "a C is only valuable on Tuesday").

Then if the human trades an A and an E for a B, a B for a C, and a C for an A, but then trades an A for ten Es, we can infer that the human has

someform of bias, maybe neglecting tokens with small value (not realizing that the value of an E matters until you have ten of them), or maybe an "eagerness" to make trades.This clearly relies on some "Strong assumptions" (for instance, that tokens are

onlyvaluable in themselves - that executing a trade has no inherent value).For the reasonable option, two other statments hold true. At least one of which fails for all totally unreasonable rules of similar Komalgorov complexity that I can think of.

1) π(h) is good at optimizing R, (much better than random).

2) p(R) is Quickly computable. As opposed to the fully rational planner, with every bias turned into a goal, which would be slow to compute (I think).

Shifting all biases to goals should also increase the complexity of the goal function.

Even just insisting that R is simple (low Komelgorov complexity), and p is effective (displays many bits of optimization pressure towards R) should produce results more sane than these. (Maybe subtly flawed?)

EDIT: This has a tendancy to locate simple instrumental subgoals. Eg maximise entorpy.

I agree you can add in more assumptions in order to get better results. The hard part is a) how you know that your assumptions are always correct, and b) how you know when you have enough assumptions that you will actually find the correct p and R.

(You might be interested in Inferring Reward Functions from Demonstrators with Unknown Biases, which takes a similar perspective as you quite explicitly, and Resolving human values, completely and adequately, which takes this perspective implicitly.)

The problem is that with these additional and obvious constraints, humans

cannotbe assigned arbitrary values, unlike the title of the post suggests. Sure there will be multiple R that pass any number of assumptions and we will be uncertain about which to use. However, because we don't perfectly know π(h), we had that problem to begin with. So it's not clear why this new problem matters. Maybe our confidence in picking the right R will be a little lower then expected, but I don't see why this reduction must be large.If we add assumptions like this, they will inevitably be misspecified, which can lead to other problems. For example, how would you operationalize that π is good at optimizing R? What if in reality due to effects currently beyond our understanding, our actions are making the future more likely to be dystopian in some way than if we took random actions? Should our AI infer that we prefer that dystopia, since otherwise we wouldn't be better than random?

(See also the next three posts in this sequence.)

The point isn't that there is nothing wrong or dangerous about learning biases and rewards. The point is that the OP is not very relevant to those concerns. The OP says that learning can't be done without extra assumptions, but we have plenty of natural assumptions to choose from. The fact that assumptions are needed is interesting, but it is by no means a strong argument against IRL.

That's an interesting question, because we obviously are taking actions that make the future more likely to be dystopian - we're trying to develop AGI, which might turn out unfriendly.

You'd think so, but nobody has defined these assumptions in anything like sufficient detail to make IRL work. My whole research agenda is essentially a way of defining these assumptions, and it seems to be a long and complicated process.

Evaluating R on a single example of human behavior is good enough to reject R(2), R(4) and possibly R(3).

Example: this morning I went to the kitchen and picked up a knife. Among possible further actions, I had A - "make a sandwich" and B - "stab myself in the gut". I chose A. R(2) and R(4) say I wanted B and R(3) is indifferent. I think that's enough reason to discard them.

Why not do this? Do you not agree that this test discards dangerous R more often than useful R? My guess is that you're asking for very strong formal guarantees from the assumptions that you consider and use a narrow interpretation of what it means to "make IRL work".

Rejecting any specific R is easy - one bit of information (at most) per specific R. So saying "humans have preferences, and they are not always rational or always anti-rational" rules out R(1), R(2), and R(3). Saying "this apparent preference is genuine" rules out R(4).

But it's not like there are just these five preferences and once we have four of them out of the way, we're done. There are many, many different preferences in the space of preferences, and many, many of them will be simpler than R(0). So to converge to R(0), we need to add huge amounts of information, ruling out more and more examples.

Basically, we need to include enough information to define R(0) - which is what my research project is trying to do. What you're seeing as "adding enough clear examples" is actually "hand-crafting R(0) in totality".

For more details see here: https://arxiv.org/abs/1712.05812

My example test is not nearly as specific as you imply. It discards large swaths of harmful and useless reward functions. Additional test cases would restrict the space further. There are still harmful Rs in the remaining space, but their proportion must be much lower than in the beginning. Is that not good enough?

Are you saying that R can't generalize if trained on a reasonably sized data set? This is very significant, if true, but I don't see it.

Details are good. I have a few notes though.

This might be a nitpick, but there is no such thing. If the agent was not originally composed from p and R, then none of the decompositions are "true". There are only "useful" decompositions. But that itself requires many assumptions about how usefulness is measured. I'm confused about how much of a problem this is. But it might be a big part of our philosophical difference - I want to slap together some ad hoc stuff that

possiblyworks, while you want to find somethingtrue.In this section you show that the pair (p(0), R(0)) is high complexity, but it seems that p(0) could be complex and R(0) could be relatively simple, unlike the title suggests. We don't actually need to find p(0), finding R(0) should be good enough.

Huh, isn't that what I'm saying? Is the problem that the assumptions I mentioned are derived from observing the human?

Slight tangent: I realized that the major difference between a human and the agent H (from the first example in OP), is that the human can take complex inputs. In particular, it can take logical propositions about itself or desirable R(0) and approve or disapprove of them. I'm not saying that "find R(0) that a human would approve of" is a good algorithm, but something along those lines could be useful.

We may not be disagreeing any more. Just to check, do you agree with both these statements:

Adding a few obvious constraints rule out many different R, including the ones in the OP.

Adding a few obvious constraints is not enough to get a safe or reasonable R.

1 is trivial, so yes. But I don't agree with 2. Maybe the disagreement comes from "few" and "obvious"? To be clear, I count evaluating some simple statistic on a large data set as one constraint. I'm not so sure about "obvious". It's not yet clear to me that my simple constraints aren't good enough. But if you say that more complex constraints would give us a lot more confidence, that's reasonable.

From OP I understood that you want to throw out IRL entirely. e.g.

seems like an unambiguous rejection of IRL and very different from

Ok, we strongly disagree on your simple constraints being enough. I'd need to see these constraints explicitly formulated before I had any confidence in them. I suspect (though I'm not certain) that the more explicit you make them, the more tricky you'll see that it is.

And no, I don't want to throw IRL out (this is an old post), I want to make it work. I got this big impossibility result, and now I want to get around it. This is my current plan: https://www.lesswrong.com/posts/CSEdLLEkap2pubjof/research-agenda-v0-9-synthesising-a-human-s-preferences-into

That's a part of the disagreement. In the past you clearly thought that Occam's razor was an "obvious" constraint that might work. Possibly you thought it was a unique such constraint. Then you found this result, and made a large update in the other direction. That's why you say the result is big - rejecting a constraint that you already didn't expect to work wouldn't feel very significant.

On the other hand, I don't think that Occam's razor is unique such constraint. So when I see you reject it, I naturally ask "what about all the other obvious constraints that might work?". To me this result reads like "0 didn't solve our equation therefore the solution must be very hard". I'm sure that you have strong arguments against many other approaches, but I haven't seen them, and I don't think the one in OP generalizes well.

This is a bit awkward. I'm sure that I'm not proposing anything that you haven't already considered. And even if you show that this approach is wrong, I'd just try to put a band-aid on it. But here is an attempt:

First we'd need a data set of human behavior with both positive and negative examples (e.g. "I made a sandwitch", "I didn't stab myself", etc). So it would be a set of tuples of state s, action a and +1 for positive examples, -1 for negative ones. This is not trivial to generate, especially it's not clear how to pick negative examples, but here too I expect that the obvious solutions are all fine. By the way, I have no idea how the examples are formalized, that seems like a problem, but it's not unique to this approach, so I'll assume that it's solved.

Next, given a pair (p, R), we would score it by adding up the following:

1. p(R) should accurately predict human behavior. So we want a count of p(R)(s)=a for positive cases and p(R)(s)!=a for negative cases.

2. R should also predict human behavior. So we want to sum R(s, a) for positive examples, minus the same sum for negative examples.

3. Regularization for p.

4. Regularization for R.

Here we are concerned about overfitting R, and don't care about p as much, so terms 1 and 4 would get large weights, and terms 2, 3 would get smaller weights.

Finally we throw machine learning at the problem to maximize this score.

I think that right now we don't know how to bridge the gap between the thing that presses the buttons on the computer, and a fuzzy specification of a human as a macroscopic physical object. And so if you are defining "human" as the thing that presses the buttons, and you can take actions that fully control which buttons get pressed, it makes sense that there's not necessarily a definition of what this "human" wants.

If we actually start bridging the gap, though, I think it makes lots of sense for the AI to start building up a model of the human-as-physical-object which also takes into account button presses, and in that case I'm not too pessimistic about regularization.

I think of the example as illustrative but the real power of the argument comes from the planner+reward formalism and the associated impossibility theorem. The fact that Kolmogorov complexity doesn't help is worrying. It's possible that other regularization techniques work where Kolmogorov complexity doesn't, but that begs the question of what is so special about these other regularization techniques.

Suppose we start our AI off with the intentional stance, where we have a high-level description of these human objects as agents with desires and plans, beliefs and biases and abilities and limitations.

What I'm thinking when I say we need to "bridge the gap" is that I think if we knew what we were doing, we could stipulate that some set of human button-presses is more aligned with some complicated object "hDesires" than not, and the robot should care about hDesires, where hDesires is the part of the intentional stance description of the physical human that plays the functional role of desires.