Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A putative new idea for AI control; index here. Crossposted at LessWrong 2.0. This post has nothing really new for this message board, but I'm posting it here because of the subsequent posts I'm intending to write.

Humans have no values... nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values.


An agent with no clear preferences

There are three buttons in this world, B(0), B(1), and X, and one agent H.

B(0) and B(1) can be operated by H, while X can be operated by an outside observer. H will initially press button B(0); if ever X is pressed, the agent will switch to pressing B(1). If X is pressed again, the agent will switch back to pressing B(0), and so on. After a large number of turns N, H will shut off. That's the full algorithm for H.

So the question is, what are the values/preferences/rewards of H? There are three natural reward functions that are plausible:

  • R(0), which is linear in the number of times B(0) is pressed.
  • R(1), which is linear in the number of times B(1) is pressed.
  • R(2) = I(E,X)R(0) + I(O,X)R(1), where I(E,X) is the indicator function for X being pressed an even number of times,I(O,X)=1-I(E,X) being the indicator function for X being pressed an odd number of times.

For R(0), we can interpret H as an R(0) maximising agent which X overrides. For R(1), we can interpret H as an R(1) maximising agent which X releases from constraints. And R(2) is the "H is always fully rational" reward. Semantically, these make sense for the various R(i)'s being a true and natural reward, with X="coercive brain surgery" in the first case, X="release H from annoying social obligations" in the second, and X="switch which of R(0) and R(1) gives you pleasure".

But note that there is no semantic implications here, all that we know is H, with its full algorithm. If we wanted to deduce its true reward for the purpose of something like Inverse Reinforcement Learning (IRL), what would it be?

Modelling human (ir)rationality and reward

Now let's talk about the preferences of an actual human. We all know that humans are not always rational (how exactly we know this is a very interesting question that I will be digging into). But even if humans were fully rational, the fact remains that we are physical, and vulnerable to things like coercive brain surgery (and in practice, to a whole host of other more or less manipulative techniques). So there will be the equivalent of "button X" that overrides human preferences. Thus, "not immortal and unchangeable" is in practice enough for the agent to be considered "not fully rational".

Now assume that we've thoroughly observed a given human h (including their internal brain wiring), so we know the human policy π(h) (which determines their actions in all circumstances). This is, in practice all that we can ever observe - once we know π(h) perfectly, there is nothing more that observing h can teach us (ignore, just for the moment, the question of the internal wiring of h's brain - that might be able to teach us more, but we'll need extra assumptions).

Let R be a possible human reward function, and R the set of such rewards. A human (ir)rationality planning algorithm p (hereafter refereed to as a planner), is a map from R to the space of policies (thus p(R) says how a human with reward R will actually behave - for example, this could be bounded rationality, rationality with biases, or many other options). Say that the pair (a,R) is compatible if p(R)=π(h). Thus a human with planner p and reward R would behave as h does.

What possible compatible pairs are there? Here are some candidates:

  • (p(0), R(0)), where p(0) and R(0) are some "plausible" or "acceptable" planner and reward functions (what this means is a big question).
  • (p(1), R(1)), where p(1) is the "fully rational" planner, and R(1) is a reward that fits to give the required policy.
  • (p(2), R(2)), where R(2)= -R(1), and p(2)= -p(1), where -p(R) is defined as p(-R); here p(2) is the "fully anti-rational" planner.
  • (p(3), R(3)), where p(3) maps all rewards to π(h), and R(3) is trivial and constant.
  • (p(4), R(4)), where p(4)= -p(0) and R(4)= -R(0).

Distinguishing among compatible pairs

How can we distinguish between compatible pairs? At first appearance, we can't. That's because, by their definition of compatible, all pairs produce the correct policy π(h). And once we have π(h), further observations of h tell us nothing.

I initially thought that Kolmogorov or algorithmic complexity might help us here. But in fact:

Theorem: The pairs (p(i), R(i)), i ≥ 1, are either simpler than (p(0), R(0)), or differ in Kolmogorov complexity from it by a constant that is independent of (p(0), R(0)).

Proof: The cases of i=4 and i=2 are easy, as these differ from i=0 and i=1 by two minus signs. Given (p(0), R(0)), a fixed-length algorithm computes π(h). Then a fixed length algorithm defines p(3) (by mapping input to π(h)). Furthermore, given π(h) and any history η, a fixed length algorithm computes the action a(η) the agent will take; then a fixed length algorithm defines R(1)(η,a(η))=1 and R(1)(η,b)=0 for b≠a(η).

So the Kolmogorov complexity can shift between p and R (all in R for i=1,2, all in p for i=3), but it seems that the complexity of the pair doesn't go up during these shifts.

This is puzzling. It seems that, in principle, one cannot assume anything about h's reward at all! R(2)= -R(1), R(4)= -R(0), and p(3) is compatible with any possible reward R. If we give up the assumption of human rationality - which we must - it seems we can't say anything about the human reward function. So it seems IRL must fail.

Yet, in practice, we can and do say a lot about the rationality and reward/desires of various human beings. We talk about ourselves being irrational, as well as others being so. How do we do this? What structure do we need to assume, and is there a way to get AIs to assume the same?

This the question I'll try and partially answer in subsequent posts, using the example of the anchoring bias as a motivating example. The anchoring bias is one of the clearest of all biases; what is it that allows us to say, with such certainty, that it's a bias (or at least a misfiring heuristic) rather than an odd reward function?

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 4:01 AM

Intuitively, the reason that our biases are biases and not a different reward function is because:

  1. I would be happy to get rid of my biases, in that I would accept a well-designed self-modification that removed my biases. ("well-designed" is hiding a lot of complexity, but the point is just that such a self-modification exists.)

  2. The bias applies across a variety of different scenarios with very different reward functions.

The first point suggests that you require that the human values are reflectively stable. In particular, if the human would choose action a in state s under the (m, R) pair, then they should also say that they would choose action a if they were in state s even when you explain what the consequences of action a will be. This is not a good solution -- when people's speech and people's behavior disagree, it's certainly possible that the behavior actually reflects their values and not the speech -- but something along these lines seems important.

I'm more interested in the second point though. Let's consider the setting where you have n different tasks for which you observe the human policy. After running IRL, you have a single rationality model M and multiple rewards R_1 ... R_n. Intuitively, the rationality model M is better if the expected complexity of the inferred reward for a new unseen task is lower. That is, if you sample a new task T_{n+1} from the distribution of tasks and run IRL to estimate R_{n+1} using the existing learned rationality model M, you can expect that R_{n+1} will be simple.

What I'm trying to get at here is that the correct rationality model has a lot more explanatory power. Kolmogorov complexity doesn't really capture that.

Under this definition, it seems likely that you could only get (M(0), R(0)) and (M(4), R(4)), at least out of the compatible pairs you suggested. To break this last tie, perhaps you could add in an assumption that humans are closer to rational than anti-rational on simple tasks.

I do agree that in the fully general case where we observe the full policy for all of human behavior and want to determine all of human values, things get murkier. Some possible answers in this scenario:

  • We put a strong prior on humans making plans hierarchically. This could bring us back to the case where we have multiple tasks.

  • Assume humans are optimal given constraints on their resources (that is, bounded rationality). Then, we only need to infer a reward function and not a rationality model. It is far from obvious that this is anywhere close to accurate as a model of humans, but it seems plausible enough to warrant investigation.

Both of these answers feel very unsatisfying to me though -- they feel like hacks that don't model reality perfectly.

Side note: How do I set my username? I logged in with Facebook and it never asked me for my name (Rohin Shah) and now I'm just "user 264".