Crossposted at LessWrong 2.0.

Humans have no values... nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values.


An agent with no clear preferences

There are three buttons in this world, B(0), B(1), and X, and one agent H.

B(0) and B(1) can be operated by H, while X can be operated by an outside observer. H will initially press button B(0); if ever X is pressed, the agent will switch to pressing B(1). If X is pressed again, the agent will switch back to pressing B(0), and so on. After a large number of turns N, H will shut off. That's the full algorithm for H.

So the question is, what are the values/preferences/rewards of H? There are three natural reward functions that are plausible:

  • R(0), which is linear in the number of times B(0) is pressed.
  • R(1), which is linear in the number of times B(1) is pressed.
  • R(2) = I(E,X)R(0) + I(O,X)R(1), where I(E,X) is the indicator function for X being pressed an even number of times,I(O,X)=1-I(E,X) being the indicator function for X being pressed an odd number of times.

For R(0), we can interpret H as an R(0) maximising agent which X overrides. For R(1), we can interpret H as an R(1) maximising agent which X releases from constraints. And R(2) is the "H is always fully rational" reward. Semantically, these make sense for the various R(i)'s being a true and natural reward, with X="coercive brain surgery" in the first case, X="release H from annoying social obligations" in the second, and X="switch which of R(0) and R(1) gives you pleasure".

But note that there is no semantic implications here, all that we know is H, with its full algorithm. If we wanted to deduce its true reward for the purpose of something like Inverse Reinforcement Learning (IRL), what would it be?


Modelling human (ir)rationality and reward

Now let's talk about the preferences of an actual human. We all know that humans are not always rational (how exactly we know this is a very interesting question that I will be digging into). But even if humans were fully rational, the fact remains that we are physical, and vulnerable to things like coercive brain surgery (and in practice, to a whole host of other more or less manipulative techniques). So there will be the equivalent of "button X" that overrides human preferences. Thus, "not immortal and unchangeable" is in practice enough for the agent to be considered "not fully rational".

Now assume that we've thoroughly observed a given human h (including their internal brain wiring), so we know the human policy π(h) (which determines their actions in all circumstances). This is, in practice all that we can ever observe - once we know π(h) perfectly, there is nothing more that observing h can teach us (ignore, just for the moment, the question of the internal wiring of h's brain - that might be able to teach us more, but we'll need extra assumptions).

Let R be a possible human reward function, and R the set of such rewards. A human (ir)rationality planning algorithm p (hereafter refereed to as a planner), is a map from R to the space of policies (thus p(R) says how a human with reward R will actually behave - for example, this could be bounded rationality, rationality with biases, or many other options). Say that the pair (p,R) is compatible if p(R)=π(h). Thus a human with planner p and reward R would behave as h does.

What possible compatible pairs are there? Here are some candidates:

  • (p(0), R(0)), where p(0) and R(0) are some "plausible" or "acceptable" planners and reward functions (what this means is a big question).
  • (p(1), R(1)), where p(1) is the "fully rational" planner, and R(1) is a reward that fits to give the required policy.
  • (p(2), R(2)), where R(2)= -R(1), and p(2)= -p(1), where -p(R) is defined as p(-R); here p(2) is the "fully anti-rational" planner.
  • (p(3), R(3)), where p(3) maps all rewards to π(h), and R(3) is trivial and constant.
  • (p(4), R(4)), where p(4)= -p(0) and R(4)= -R(0).


Distinguishing among compatible pairs

How can we distinguish between compatible pairs? At first appearance, we can't. That's because, by their definition of compatible, all pairs produce the correct policy π(h). And once we have π(h), further observations of h tell us nothing.

I initially thought that Kolmogorov or algorithmic complexity might help us here. But in fact:

Theorem: The pairs (p(i), R(i)), i ≥ 1, are either simpler than (p(0), R(0)), or differ in Kolmogorov complexity from it by a constant that is independent of (p(0), R(0)).

Proof: The cases of i=4 and i=2 are easy, as these differ from i=0 and i=1 by two minus signs. Given (p(0), R(0)), a fixed-length algorithm computes π(h). Then a fixed length algorithm defines p(3) (by mapping input to π(h)). Furthermore, given π(h) and any history η, a fixed length algorithm computes the action a(η) the agent will take; then a fixed length algorithm defines R(1)(η,a(η))=1 and R(1)(η,b)=0 for b≠a(η).


So the Kolmogorov complexity can shift between p and R (all in R for i=1,2, all in p for i=3), but it seems that the complexity of the pair doesn't go up during these shifts.

This is puzzling. It seems that, in principle, one cannot assume anything about h's reward at all! R(2)= -R(1), R(4)= -R(0), and p(3) is compatible with any possible reward R. If we give up the assumption of human rationality - which we must - it seems we can't say anything about the human reward function. So it seems IRL must fail.

Yet, in practice, we can and do say a lot about the rationality and reward/desires of various human beings. We talk about ourselves being irrational, as well as others being so. How do we do this? What structure do we need to assume, and is there a way to get AIs to assume the same?

This the question I'll try and partially answer in subsequent posts, using the example of the anchoring bias as a motivating example. The anchoring bias is one of the clearest of all biases; what is it that allows us to say, with such certainty, that it's a bias (or at least a misfiring heuristic) rather than an odd reward function?

New Comment
36 comments, sorted by Click to highlight new comments since:

Perhaps I'm missing something, but it seems like "agent H" has nothing to do with an actual human, and that the algorithm and environment as given support even less analogy to a human than a thermostat.

Thus, proofs about such a system are of almost no relevance to moral philosophy or agent alignment research?

Thermostats connected to heating and/or cooling systems are my first goto example for asking people where they intuitively experience the perception of agency or goal seeking behavior. I like using thermostats as the starting point because:

  1. Their operation has clear connections to negative feedback loops and thus obvious "goals" because they try to lower the temperature when it is too hot and try to raise the temperature when it is too cold.

  2. They have internally represented goals, because their internal mechanisms can be changed by exogenous-to-the-model factors that change their behavior in response to otherwise identical circumstances. Proximity plus non overlapping ranges automatically lead to fights without any need for complex philosophy.

  3. They have a natural measure of "optimization strength" in the form of the wattage of their heating and cooling systems, which can be adequate or inadequate relative to changes in the ambient temperature.

  4. They require a working measurement component that detects ambient temperature, giving a very limited analogy for "perception and world modeling". If two of thermostats are in a fight, a "weak and fast" thermostat can use a faster sampling rate to get a headstart on the "slower stronger" thermostat that put the temperature where it wanted and then rested for 20 minutes before measuring again. This would predictably give a cycle of temporary small victories for the fast one that turn into wrestling matches that it always loses, over and over.

I personally bite the bullet and grant that thermostats are (extremely minimal) agents with (extremely limited) internal experiences, but I find that most people I talk about this with do not feel comfortable admitting that these might be "any kind of agent".

Yet the thermostat clearly has more going on than "agent H" in your setup.

A lot of people I talk with about this are more comfortable with a basic chess bot architecture than a thermostat, when talking about the mechanics of agency, because:

  1. Chess bots consider more than a simple binary actions.

  2. Chess bots generate iterated tree-like models of the world and perform the action that seems likely to produce the most preferred expected long term consequence.

  3. Chess bots prune possible futures such that they try not to do things that hostile players could exploit now or in the iterated future, demonstrating a limited but pragmatically meaningful theory of mind.

Personally, I'm pretty comfortable saying that chess bots are also agents, and they are simply a different kind of agent than a thermostat, and they aren't even strictly "better" than thermostats because thermostats have a leg up on them in having a usefully modifiable internal representation of their goals, which most chess bots lack!

An interesting puzzle might be how to keep much of the machinery of chess, but vary the agents during the course of their training and development so that they have skillful behavioral dynamics but different chess bot's skills are organized around things like a preference to checkmate the opponent while they still have both bishops, but lower down their hierarchy of preferences is preferring to be checkmated while retaining both bishops versus, and even further down is losing any bishops and also being checkmated.

Imagine a tournament of 100 chess bots where the rules of chess are identical for everyone, but some of the players are in some sense "competing in different games" due to a higher level goal of beating the chess bots that have the same preferences as them. So there might be bishop keepers, bishop hunters, queen keepers, queen hunters, etc.

Part of the tournament rules is that it would not be public knowledge who is in which group (though the parameters of knowledge could be an experimental parameter).

And in a tournament like that I'm pretty sure that any extremely competitive bishop keeping chess bot would find it very valuable to be able to guess from observation of the opponents early moves that in a specific game they might be playing a rook hunting chessbot that would prefer to capture their rook and then be checkmated than to officially "tie the game" without ever capturing one of their rooks.

In a tournament like this, keeping your true preferences secret and inferring your opponent's true preferences would both be somewhat useful.

Some overlap in the game should always exist (like preference for win > tie > lose all else equal) and competition on that dimension would always exist.

Then if any AgentAlice knows AgentBob's true preferences she can probably see deeper into the game tree than otherwise by safely pruning more lines of play out of the tree, and having a better chance of winning. On the other hand mutual revelation of preferences might allow gains from trade, so it isn't instantly clear how to know when to reveal preferences and when to keep them cryptic...

Also, probably chess is more complicated than is conceptually necessary. Qubic (basically tic tac toe on a 4x4x4 grid) probably has enough steps and content to allow room for variations in strategy (liking to have played in corners, or whatever) so that the "preference" aspects could hopefully dominate the effort put into it rather than demanding extensive and subtle knowledge of chess.

Since qubic was solved at least as early as 1992, it should probably be easier to prove things about "qubic with preferences" using the old proofs as a starting point. Also it is probably a good idea to keep in mind which qubic preferences are instrumentally entailed by the pursuit of basic winning, so that preferences inside and outside those bounds get different logical treatment :-)

Thanks! But H is used as an example, not a proof.

And the chessbots actually illustrate my point - is a bishop-retaining chessbot actually intending to retain their bishop, or is it an agent that wants to win, but has a bad programming job which inflates the value of bishops?

I think we should use "agent" to mean "something that determines what it does by expecting that it will do that thing," rather than "something that aims at a goal." This explains why we don't have exact goals, but also why we "kind of" have goals: because our actions look like they are directed to goals, so that makes "I am seeking this goal" a good way to figure out what we are going to do, that is, a good way to determine what to expect ourselves to do, which makes us do it.

Seems a reasonable way of seeing things, but not sure it works if we take that definition too formally/literally.

I haven't really finished thinking about this yet but it seems to me it might have important consequences. For example, the AI risk argument sometimes takes it for granted that an AI must have some goal, and then basically argues that maximizing a goal will cause problems (which it would, in general.) But using the above model suggests something different might happen, not only with humans but also with AIs. That is, at some point an AI will realize that if it expects to do A, it will do A, and if it expects to do B, it will do B. But it won't have any particular goal in mind, and the only way it will be able to choose a goal will be thinking about "what would be a good way to make sense of what I am doing?"

This is something that happens to humans with a lot of uncertainty: you have no idea what goal you "should" be seeking, because really you didn't have a goal in the first place. If the same thing happens to an AI, it will likely seem even more undermotivated than humans do, because we have at least vague and indefinite goals that were set by evolution. The AI on the other hand will just have whatever it happened to be doing up until it came to that realization to make sense of itself.

This suggests the orthogonality thesis might be true, but in a weird way. Not that "you can make an AI that seeks any given goal," but that "Any AI at all can seek any goal at all, given the right context." Certainly humans can; you can convince them to do any random thing, in the right context. In a similar way, you might be able to make a paperclipper simply by asking it what actions would make the most paperclips, and doing those things. Then when it realizes that different answers will cause different effects, it will just say to itself, "Up to now, everything I've done has tended to make paperclips. So it makes sense to assume that I will always maximize paperclips," and then it will be a paperclipper. But on the other hand if you never use your AI for any particular goal, but just play around with it, it will not be able to make sense of itself in terms of any particular goal besides playing around. So both evil AIs and non-evil AIs might be pretty easy to make (much like with humans.)

Initially I wrote a response spelling out in excruciating detail an example of a decent chess bot playing the final moves in a game of Preference Chess, ending with "How does this not reveal an extremely clear example of trivial preference inference, what am I missing?"

Then I developed the theory that what I'm missing is that you're not talking about "how preference inference works" but more like "what are extremely minimalist preconditions for preference inference to get started".

And given where this conversation is happening, I'm guessing that one of the things you can't take for granted is that the agent is at all competent, because sort of the whole point here is to get this to work for a super intelligence looking at a relatively incompetent human.

So even if a Preference Chess Bot has a board situation where it is one move away from winning, losing, or taking another piece that it might prefer to take... no matter what move the bot actually performs you could argue it was just a mistake because it couldn't even understand the extremely short run tournament level consequences of whatever Preference Chess move it made.

So I guess I would argue that even if any specific level of stable state intellectual competence or power can't be assumed, you might be able to get away with a weaker assumption of "online learning"?

It will always be tentative, but I think it buys you something similar to full rationality that is more likely to be usefully true of humans. Fundamentally you could use "an online learning assumption" to infer "regret of poorly chosen options" from repetitions of the same situation over and over, where either similar or different behaviors are observed later in time.

To make the agent have some of the right resonances... imagine a person at a table who is very short and wearing a diaper.

The person's stomach noisily grumbles (which doesn't count as evidence-of-preference at first).

They see in front of them a cupcake and a cricket (the eye's looking at both is somewhat important because it means they could know that a choice is even possible, allowing us to increment the choice event counter here).

They put the cricket in their mouth (which doesn't count as evidence-of-preference at first).

They cry (which doesn't count as evidence-of-preference at first).

However, we repeat this process over and over and notice that by the 50th repetition they are reliably putting the cupcake in their mouth and smiling afterwords. So we use the relatively weak "online learning assumption" to say that something about the cupcake choice itself (or the cupcake's second order consequences that the person may think semi-reliably reliably happens) are more preferred than the cricket.

Also, the earlier crying and later smiling begin to take on significance as either side channel signals of preference (or perhaps they are the actual thing that is really being pursued as a second order consequence?) because of the proximity of the cry/smile actions reliably coming right after the action whose rate changes over time from rare to common.

The development of theories about side channel information could make things go faster as time goes on. It might even becomes the dominant mode of inference, up to the point where it starts to become strategic, as with lying about one's goals in competitive negotiation contexts becoming salient once the watcher and actor are very deep into the process...

However, I think your concern is to find some way to make the first few foundational inferences in a clear and principled way that does not assume mutual understanding between the watcher and the actor, and does not assume perfect rationality on the part of the actor.

So an online learning assumption does seem to enable a tentative process, that focuses on tiny little recurring situations, and the understanding of each of these little situations as a place where preferences can operate causing changes in rates of performance.

If a deeply wise agent is the watcher, I could imagine them attempting to infer local choice tendencies in specific situations and envisioning how "all the apparently preferred microchoices" might eventually chain together into some macro scale behavioral pattern. The watcher might want to leap to a conclusion that the entire chain is preferred for some reason.

It isn't clear that the inference to the preference for the full chain of actions would be justified, precisely because of the assumption of the lack of full rationality.

The watcher would want to see the full chain start to occur in real life, and to become more common over time when chain initiation opportunities presented themselves.

Even then, the watcher might even double check by somehow adding signposts to the actor's environment, perhaps showing the actor pictures of the 2nd, 4th, 8th, and 16th local action/result pairs that it thinks are part of a behavioral chain. The worry is that the actor might not be aware how predictable they are and might not actually prefer all that can be predicted from their pattern of behavior...

(Doing the signposting right would require a very sophisticated watcher/actor relationship, where the watcher had already worked out a way to communicate with the actor, and observed the actor learning that the watcher's signals often functioned as a kind of environmental oracle for how the future could go, with trust in the oracle and so on. These preconditions would all need to be built up over time before post-signpost action rate increases could be taken as a sign that the actor preferred performing the full chain that had been signposted. And still things could be messed up if "hostile oracles" were in the environment such that the actor's trust in the "real oracle" is justifiably tentative.)

One especially valuable kind of thing the watcher might do is to search the action space for situations where a cycle of behavior is possible, with a side effect each time through the loop, and to put this loop and the loop's side effect into the agent's local awareness, to see if maybe "that's the point" (like a loop that causes the accumulation of money, and after such signposting the agent does more of the thing) or maybe "that's a tragedy" (like a loop that causes the loss of money, that might be a dutch booking in progress, and after signposting the agent does less of the thing).

Is this closer to what you're aiming for? :-)

I'm sorry, I have trouble following long posts like that. Would you mind presenting your main points in smaller, shorter posts? I think it would also make debate/conversation easier.

I'll try to organize the basic thought more cleanly, and will comment here again with a link to the better version when it is ready :-)

the question is, what are the values/preferences/rewards of H?

Why isn't the answer "None. This framework is not applicable"?

I have a pen. It leaves marks on paper. If I press a button the tip retracts and it no longer leaves marks on paper. What are the values/preferences/rewards of my pen?

This framework is not applicable

Then what is the requirements for the framework to be applicable? Many human values, the ones we haven't self-analysed much, behave like H and its buttons: swayed by random considerations that we're not sure are value-relevant or not.

I think there are two ways that a reward function can be applicable:

1) For making moral judgements about how you should treat your agent. Probably irrelevant for your button presser unless you're a panpsychist.

2) If the way your agent works is by predicting the consequences of its actions and attempting to pick an action that maximises some reward (eg a chess computer trying to maximise its board valuation function). Your agent H as described doesn't work this way, although as you note there are agents which do act this way and produce the same behaviour as your H.

There's also the kind-of option:

3) Anything can be modelled as if it had a utility function, in the same way that any solar system can be modelled as a geocentric one with enough epicycles. In this case there's no "true" reward function, just "the reward function that makes the maths I want to do as easy as possible". Which one that is depends on what you're trying to do, and maybe pretending there's a reward function isn't actually better than using H's true non-reward-based algorithm.

My "solution" does use 2), and should be posted in the next few days (maybe on lesswrong 2 only - are you on that?)

what is the requirements for the framework to be applicable?

This framework lives in the map, not in the territory. It is a model feature, applicable when it makes a model more useful. Specifically, it makes sense when the underlying reality is too complex to deal with directly. Because of the complexity we, basically, reduce the dimensionality of the problem by modeling it as a simpler combination of aggregates. "Values" are one kind of such aggregates.

If you have an uncomplicated algorithm with known code, you don't need such simplifying features.

It is partly in the territory, and comes with the situation where you are modeling yourself. In that situation, the thing will always be "too complex to deal with directly," regardless of its absolute level of complexity.

comes with the situation where you are modeling yourself

Maybe, but that's not the context in this thread.

Isn't a big part of the problem the fact that you only have conscious access to a few things? In other words, your actions are determined in many ways by an internal economy that you are ignorant of (e.g. mental energy, physical energy use in the brain, time and space etc. etc.) These things are in fact value relevant but you do not know much about them so you end up making up reasons why you did what you did.

The implied argument that "we cannot prove X, therefore X cannot be true or false" is not logically valid. I mentioned this recently when Caspar made a similar argument.

I think it is true, however, that humans do not have utility functions. I would not describe that, however, by saying that humans are not rational; on the contrary, I think pursuing utility functions is the irrational thing.

In practice, "humans don't have values" and "humans have values, but we can never know what they are" are not meaningfully different.

I also wouldn't get too hung up on utility function; a utility function just means that the values don't go wrong when an agent tries to be consistent and avoid money pumps. If we want to describe human values, we need to find values that don't go crazy when transformed into utility functions.

If we want to describe human values, we need to find values that don't go crazy when transformed into utility functions.

That seems misguided. If you want to describe human values, you need to describe them as you find them, not as you would like them to be.

I would add that values are probably not actually existing objects but just useful ways to describe human behaviour. Thinking that they actually exist is mind projection fallacy.

In the world of facts we have: human actions, human claims about the actions and some electric potentials inside human brains. It is useful to say that a person has some set of values to predict his behaviour or to punish him, but it doesn't mean that anything inside his brain is "values".

If we start to think that values actually exist, we start to have all the problems of finding them, defining them and copying into an AI.

The problem with your "in practice" argument is that it would similarly imply that we can never know if someone is bald, since it is impossible to give a definition of baldness that rigidly separate bald people from non-bald people while respecting what we mean by the word. But in practice we can know that a particular person is bald regardless of the absence of that rigid definition. In the same way a particular person can know that he went to the store to buy milk, even if it is theoretically possible to explain what he did by saying that he has an abhorrence of milk and did it for totally different reasons.

Likewise, in practice we can avoid money pumps by avoiding them when they come up in practice. We don't need to formulate principles which will guarantee that we will avoid them.

A person with less than 6% hair is bald, a person with 6% - 15% hair might be bald, but it is unknowable due to the nature of natural language. A person with 15% - 100% hair is not bald.

We can't always say whether someone is bald, but more often, we can. Baldness remains applicable.

We can make similar answers about people's intentions.

In the same way a particular person can know that he went to the store to buy milk

Yes. Isn't this fascinating? What is going on in human minds that, not only can we say stuff about our own values and rationality, but about those of other humans? And can we copy that into an AI somehow?

That will be the subject of subsequent posts.

What about a situation when a person says and thinks that he is going to buy a milk, but actually buy milk plus some sweets? And do it often, but do not acknowledge compulsive-obsessive behaviour towards sweets?

They don't have to acknowledge compulsive-obsessive behavior. Obviously they want both milk and sweets, even if they don't notice wanting the sweets. That doesn't prevent other people from noticing it.

Also, they may be lying, since they might think that liking sweets is low status.

I think you proved that values can't exist outside a human mind, and it is a big problem to the idea of value alignment.

The only solution I see is: don't try to extract values from the human mind, but try to upload a human mind into a computer. In that case, we kill two birds with one stone: we have some form of AI, which has human values (no matter what are they), and it has also common sense.

Upload as AI safety solution also may have difficulties in foom-style self-improving, as its internal structure is messy and incomprehensible for normal human mind. So it is intrinsically safe and only known workable solution to the AI safety.

However, there are (at least) two main problems with such solution of AI safety: it may give rise to neuromorphic non-human AIs and it is not preventing the later appearance of pure AI, which will foom and kill everybody.

The solution to it I see in using first human upload as AI Nanny or AI police which will prevent the appearance of any other more sophisticated AIs elsewhere.

We can and do make judgements about rationality and values. Therefore I don't see why AIs need fail at it. I'm starting to get a vague idea how to proceed... Let me work on it for a few more days/weeks, then I'll post it.

We can and do make judgements about rationality and values.

How do you know this is true? Perhaps we make judgements about predicted behaviors and retrofit stories about rationality and values onto that.

How do you know this is true?

By introspection?

In these matters, introspection is fairly suspect. And simply unavailable when talking about humans other than oneself (which I think Stuart is doing, maybe I misread).

We're talking about "mak[ing] judgements about rationality and values". That's entirely SOP for humans and introspection allows you to observe it in real time. This is not some kind of an unconscious/hidden/masked activity.

Moreover other humans certainly behave as if they make judgements about rationality (usually expressed as "this makes {no} sense") and values of others. They even openly verbalise these judgements.

May I suggest a test for any such future model? It should take into account that I have unconsciousness sub-personalities which affect my behaviour but I don't know about them.

That is a key feature.

Also, the question was not if I could judge other's values, but is it possible to prove that AI has the same values as a human being.

Or are you going to prove the equality of two value systems while at least one of them of them remains unknowable?

I'm more looking at "formalising human value-like things, into something acceptable".