Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

There is a No Free Lunch result in value-learning. Essentially, you can't learn the preferences of an agent from its behaviour unless you make assumptions about its rationality, and you can't learn its rationality unless you make assumptions about its preferences.

More importantly, simplicity/Occam's razor/regularisation don't help with this, unlike with most No Free Lunch theorems. Among the simplest explanations of human behaviour are:

  1. We are always fully rational all the time.
  2. We are always fully anti-rational all the time.
  3. We don't actually prefer anything to anything.

That result, though mathematically valid, seems highly theoretical, and of little practical interest - after all, for most humans, it's obvious what other humans want, most of the time. But I'll argue that the result has strong practical consequences.

Identifying clickbait

Suppose that Facebook or some other corporation decides to cut down on the amount of clickbait on its feeds.

This shouldn't be too hard, the programmers reason. They start by selecting a set of clickbait examples, and check how people engage with these. They programme a neural net to recognise that kind of "engagement" on other posts, which nets a large amount of candidate clickbait. They then go through the candidate posts, labelling the clear examples of clickbait and the clear non-examples, and add these to the training and test sets. They retrain and improve the neural net. A few iterations later, their neural net is well trained, and they let it run on all posts, occasionally auditing the results. Seeking to make the process more transparent, they run interpretability methods on the neural net, seeking to isolate the key components of clickbait, and clear away some errors or over-fits - maybe the equivalent, for clickbait, of removing the "look for images of human arms" in the dumbbell identification nets.

The central issue

Could that method work? Possibly. With enough data and enough programming efforts, it certainly seems that it could. So, what's the problem?

The problem is that so many stages of the process requires choices on the part of the programmers. The initial selection of clickbait in the first place; the labelling of candidates at the second stage; the number of cycles of iterations and improvements; the choice of explicit hyper-parameters and implicit ones (like how long to run each iteration); the auditing process; the selection of key components. All of these rely on the programmers being able to identify clickbait, or the features of clickbait, when they see them.

And that might not sound bad; if we wanted to identify photos of dogs, for example, we would follow a similar process. But there is a key difference. There is a somewhat objective definition of dog (though beware ambiguous cases). And the programmers, when making choices, will be approximating or finding examples of this definition. But there is no objective, semi-objective, or somewhat objective definition of clickbait.

Why? Because the definition of clickbait depends on assessing the preferences of the human that sees it. It can be roughly defined as "something a human is likely to click on (behaviour), but wouldn't really ultimately want to see (preference)".

And, and this is an important point, the No Free Lunch theorem applies to humans. So humans can't deduce preferences or rationality from behaviour, at least, not without making assumptions.

So how do we solve the problem? Because humans do often deduce the preferences and rationality of other humans, and often other humans will agree with them, including the human being assessed. How do we do it?

Well, drumroll, we do it by... making assumptions. And since evolution is so very lazy, the assumptions that humans make - about each other's rationality/preference, about their own rationality/preference - are all very similar. Not identical, of course, but compared with a random agent making random assumptions to interpret the behaviour of another random agent, humans are essentially all the same.

This means that, to a large extent, it is perfectly valid for programmers to use their own assumptions when defining clickbait, or in other situations of assessing the values of others. Indeed, until we solve the issue in general, this may be the only way of doing this; it's certainly the only easy way.

The lesson

So, are there any practical consequences for this? Well, the important thing is that programmers realise they are using their own assumptions, and take these into consideration when programming. Even things that they feel might just be "debugging", by removing obvious failure modes, could be them injecting their assumptions into the system. This has two major consequence:

  1. These assumptions don't form a nice neat category that "carve reality at its joints". Concepts such as "dog" are somewhat ambiguous, but concepts like "human preferences" will be even more so, because they are a series of evolutionary kludges, rather than a single natural thing. Therefore we expect that extrapolating programmer assumptions, or moving to a new distribution, will result in bad behaviour, that will have to be patched anew with more assumptions.
  2. There are cases when their assumptions and those of the users may diverge; looking out for these situations is important. This is easier if programmers realise they are making assumptions, rather than approximating objectively true categories.
New Comment
13 comments, sorted by Click to highlight new comments since: Today at 6:34 PM
it is perfectly valid for programmers to use their own assumptions

Looks like "humans consulting HCH" procedure: programmers query their own intuition, consult each other, read books etc. This is why jury is often used in criminal cases: written law is just an approximation of human opinion, so why not ask humans directly?

There are no free lunch theorems "proving" that intelligence is impossible. There is no algorithm that can optimize an arbitrary environment. We display intelligence. The problem with the theorem comes from the part where you assume an arbitrary max-entropy environment, rather than inductive priors. If you assume that human values are simple (low komelgorov complexity) and that human behavior is quite good at fulfilling those values, then you can deduce non trivial values for humans.

If you assume that human values are simple (low komelgorov complexity) and that human behavior is quite good at fulfilling those values, then you can deduce non trivial values for humans.

And you will deduce them wrong. "Human values are simple" pushes you towards "humans have no preferences", and if by "human behavior is quite good at fulfilling those values" you mean something like noisy rationality, then it will go very wrong, see for example https://www.lesswrong.com/posts/DuPjCTeW9oRZzi27M/bounded-rationality-abounds-in-models-not-explicitly-defined

And if instead you mean a proper accounting of bounded rationality, of the difference between anchoring bias and taste, of the difference between system 1 and system 2, of the whole collection of human biases... well, then, yes, I might agree with you. But that's because you've already put all the hard work in.

I should have been clearer, the point isn't that you get correct values, the point is that you get out of the swath of null or meaningless values and into the just wrong. While the values gained will be wrong, they would be significantly correlated, its the sort of AI to produce drugged out brains in vats, or something else that's not what we want, but closer than paperclips. One measure you could use of human effectiveness is given all possible actions ordered by util, what percentile are the actions we took in.

Once we get into this region, it becomes clear that the next task is to fine tune our model of the bounds on human rationality, or figure out how to get an AI to do it for us.

I disagree. I think that if we put a complexity upper bound on human rationality, and assume noisy rationality, then we will get values that are "meaningless" from your perspective.

I'm trying to think of ways how of we could test this....

Fair warning, the following is pretty sketchy and I wouldn't bet I'd stick with it if I thought a bit longer.

---

Imagine a simple computer running a simple chess playing program. The program uses purely integer computation, except to calculate its reward function and to run minimax over them, which is in floating point. The search looks for the move that maximizes the outcome, which corresponds to a win.

This, if I understand your parlance, is ‘rational’ behaviour.

Now consider that the reward is negated, and the planner instead looks for the move that minimizes the outcome.

This, if I understand your parlance, is ‘anti-rational’ behaviour.

Now consider that this anti-rational program is run on a machine where floating point values encoded with a sign bit ‘1’ represent a positive number and those with a ‘0’ sign bit a negative number—the opposite to the standard encoding.

It's the same ‘anti-rational’ program, but exactly the same wires are lit up in the same pattern on this hardware as with the ‘rational’ program on the original hardware.

In what sense can you say the difference between rationality and anti-rationality at all exists in the program (or in humans), rather than in the model of them, when the same wires are both rational and anti-rational? I believe the same dilemma holds for indifferent planners. It doesn't seem like reward functions of the type your paper talks about are a real thing, at least in a sense independent of interpretation, so it makes sense that you struggle to distinguish them when they aren't there to distinguish.

---

I am tempted to base an argument off the claim that misery is avoided because it's bad rather than being bad because it's avoided. If true, this shortcuts a lot of your concern: reward functions exist only in the map, where numbers and abstract symbols can be flipped arbitrarily, but in the physical world these good and bad states have intrinsic quality to them and can be distinguished meaningfully. Thus the question is not how to distinguish indistinguishable reward functions, but how to understand this aspect of qualitative experience. Then, presumably, if a computer could understand what the experience of unhappiness is like, it would not have to assume our preferences.

This doesn't help solve the mystery. Why couldn't a species evolve to maximise its negative internal emotional states? We can't reasonably have gotten preference and optimization lined up by pure coincidence, so there must be a reason. But it seems like a more reasonable stance to shove the question off into the ineffable mysteries of qualia than to conflate it with a formalism that seems necessarily independent of the thing we're trying to measure.

Imagine [...] distinguish.

It's because of concerns like this that we have to solve the symbol grounding problem for the human we are trying to model; see, eg, https://www.lesswrong.com/posts/EEPdbtvW8ei9Yi2e8/bridging-syntax-and-semantics-empirically

But that doesn't detract from the main point: that simplicity, on its own, is not sufficient to resolve the issue.

But that doesn't detract from the main point: that simplicity, on its own, is not sufficient to resolve the issue.

It kind of does. You have shown that simplicity cannot distinguish (p, R) from (-p, -R), but you have not shown that simplicity cannot distinguish a physical person optimizing competently for a good outcome from a physical person optimizing nega-competently for a bad outcome.

If it seems unreasonable for there to be a difference, consider a similar map-territory distinction of a height map to a mountain. An optimization function that gradient descents on a height map is the same complexity, or nearabouts, as one that gradient ascents on the height map's inverse. However, a system that physically gradient descents on the actual mountains can be much simpler than one that gradient ascents on the mountain's inverse. Since negative mental experiences are somehow qualitatively different to positive ones, it would not surprise me much if they did in fact effect a similar asymmetry here.

Saying that an agent has a preference/reward R is an interpretation of that agent (similar to the "intentional stance" of seeing it as an agent, rather than a collection of atoms). And the (p,R) and (-p,-R) interpretations are (almost) equally complex.

One of us is missing what the other is saying. I'm honestly not sure what argument you are putting forth here.

I agree that preference/reward is an interpretation (the terms I used were map and territory). I agree that (p,R) and (-p,-R) are approximately equally complex. I do not agree that complexity is necessarily isomorphic between the map and the territory. This means although the model might be a strong analogy when talking about behaviour, it is sketchy to use it as a model for complexity of behaviour.

I tried to answer in more detail here: https://www.lesswrong.com/posts/f5p7AiDkpkqCyBnBL/preferences-as-an-instinctive-stance (hope you didn't mind; I used your comment as a starting point for a major point I wanted to clarify).

But I admit to being confused now, and not understanding what you mean. Preferences don't exist in the territory, so I'm not following you, sorry! :-(

Obviously misery would be avoided because it's bad, not the other way around. We are trying to figure out what is bad by seeing what we avoid. And the problem remains whether we might be accidentally avoiding misery, while trying to avoid its opposite.

Obviously misery would be avoided because it's bad, not the other way around.

As mentioned, this isn't obvious to me, so I'd be interested in your reasoning. Why should evolution build systems that want to avoid intrinsically bad mental states?

We are trying to figure out what is bad by seeing what we avoid. And the problem remains whether we might be accidentally avoiding misery, while trying to avoid its opposite.

Yes, my point here was twofold. One, the formalism used in the paper does not seem to be deeply meaningful, so it would be best to look for some other angle of attack. Two, given the claim about intrinsic badness, the programmer is embedding domain knowledge (about conscious states), not unlearnable assumptions. A computer system would fail to learn this because qualia is a hard problem, not because it's unlearnable. This makes it asymmetric and circumventable in a way that the no free lunch theorem is not.