Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The Tails Come Apart As Metaphor For Life, but with an extra pun.

Suppose you task your friends with designing the Optimal Meal. The meal that maximizes utility, in virtue of its performance at the usual roles food fills for us. We leave aside considerations such as sourcing the ingredients ethically, or writing the code for an FAI on the appetizer in tiny ketchup print, or injecting the lettuce with nanobots that will grant the eater eternal youth, and solely concern ourselves with arranging atoms to get a good meal qua meal.

So you tell your friends to plan the best meal possible, and they go off and think about it. One comes back and tells you that their optimal meal is like one of those modernist 30-course productions, where each dish is a new and exciting adventure. The next comes back and says that their optimal meal is mostly just a big bowl of their favorite beef stew, with some fresh bread and vegetables.

To you, both of these meals seem good - certainly better than what you've eaten recently. But then you start worrying that if this meal is important, then the difference in utility between the two proposed meals might be large, even though they're both better than the status quo (say, cold pizza). In a phrase, gastronomical waste. But then how do you deal with the fact that different people have chosen different meals? Do you just have to choose one yourself?

Now your focus turns inward, and you discover a horrifying fact. You're not sure which meal you think is better. You, as a human, don't have a utility function written down anywhere, you just make decisions and have emotions. And as you turn these meals over in your mind, you realize that different contexts, different fleeting thoughts or feelings, different ways of phrasing the question, or even just what side of the bed you got up on that morning, might influence you to choose a different meal at a point of decision, or rate a meal differently during or after the fact.

You contain within yourself the ability to justify either choice, which is remarkably like being unable justify either choice. This "Optimal Meal" was a boondoggle all along. Although you can tell that either would be better than going home and eating cold pizza, there was never any guarantee that your "better" was a total ordering of meals, not merely a partial ordering.

Then, disaster truly strikes. Your best friend asks you "So, what do you want to eat?"

You feel trapped. You can't decide. So you call your mom. You describe to her these possible meals, and she listens to you and makes sympathetic noises and asks you about the rest of your day. And you tell her that you're having trouble choosing and would like her help, and so she thinks for a bit, and then she tells you that maybe you should try the modernist 30-course meal.

Then you and your friends go off to the Modernism Bistro, and you have a wonderful time.

This is a parable about how choosing the Optimal Arrangement Of All Atoms In The Universe is an impossible moral problem. Accepting this as a given, what kind of thing is happening when we accept the decision of some authority (superhuman AI or otherwise) as to what should be done with those atoms?

When you were trying to choose what to eat, there was no uniquely right choice, but you still had to make a choice anyhow. If some moral authority (e.g. your mom) makes a sincere effort to deliberate on a difficult problem, this gives you an option that you can accept as "good enough," rather than "a waste of unknowable proportions."

How would an AI acquire this moral authority stuff? In the case of humans, we can get moral authority by:

  • Taking on the social role of the leader and organizer
  • Getting an endorsement or title from a trusted authority
  • Being the most knowledgeable or skilled at evaluating a certain problem
  • Establishing personal relationships with those asked to trust us
  • Having a track record of decisions that look good in hindsight
  • Being charismatic and persuasive

You might think "Of course we shouldn't trust an AI just because it's persuasive." But in an important sense, none of these reasons is good enough. We're talking about trusting something as an authority on an impossible problem, here.

A good track record on easier problems is a necessary condition to even be thinking about the right question, true. I'm not advocating that we fatalistically accept some random nonsense as the meaning of life. The point is that even after we try our hardest, we (or an AI making the choice for us) will be left in the situation of trying to decide between Optimal Meals, and narrowing this choice down to one option shouldn't be thought of as a continuation of the process that generated those options.

If after dinner, you called your mom back and said "That meal was amazing - but how did you figure out that was what I really wanted?", you would be misunderstanding what happened. Your mom didn't solve the problem of underdetermination of human values, she just took what she knew of you and made a choice - an ordinary, contingent choice. Her role was never to figure out what you "really wanted," it was to be an authority whose choice you and your friends could accept.

So there are two acts of trust that I'm thinking about this week. The first is how to frame friendly AI as a trusted authority rather than an oracle telling us the one best way to arrange all the atoms. And the second is how a friendly AI should trust its own decision-making process when it does meta-ethical reasoning, without assuming that it's doing what humans uniquely want.

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 8:20 AM

It seems to me that with meals, there's a fact of the matter that AI could help with. After all, if two copies of you went and had the different meals, one of them would probably be happier than the other.

Though that happiness might not depend only on the chosen meal. For example, if one meal is a cake that looks exactly like Justin Bieber, that might be actually not as fun as it sounds. But if you skipped it in favor of an ordinary burrito, you'd forever regret that you didn't get to see the Justin Bieber cake.

I think you still hit the multidimension comparison problem. One of the branches may be happier, the other more satisfied. One might spend the rest of the evening doing some valuable work, the other didn't have as much time. One consumed a slightly better mix of protein and fat, the other got the right amount of carbs and micronutrients.

Without a utility function to tell you the coefficients, there _IS_ no truth of the matter which is better.

Edit: this problem also hits intertemporally, even if you solve the comparability problem for multiple dimensions of utility. The one that's best when eating may be different than the one that you remember most pleasurably afterward, which may be different than the one you remember most pleasantly a year from now.

I think all of your reasons for how a human comes to have moral authority boil down to something like having a belief that doing things that this authority says are expected to be good (have positive valence, in my current working theory of values). This perhaps gives a way of reframing alignment as the problem of constructing an agent to whom you would give moral authority to decide for you, rather than as we normally do as an agent that is value aligned.

I'm curious about the source of your intuition that we are obligated to make an optimal selection. You mention that the utility difference between two plausibly best meals could be large, which is true, especially when we drop the metaphor and reflect on the utility difference between two plausibly best FAI value schemes. And I suppose that, taken literally, the utilitarian code urges us to maximize utility, so leaving any utility on the table would technically violate utilitarianism.

On a practical level, though, I'm usually not in the habit of nitpicking people who do things for me that are sublimely wonderful yet still marginally short of perfect, and I try not to criticize people who made a decision that was plausibly the best available decision simply because some other decision was also plausibly the best available decision. If neither of us can tell for sure which of two options is the best, and our uncertainty isn't of the kind that seems likely to be resolvable by further research, then my intuition is that the morally correct thing to do is just pick one and enjoy it, especially if there are other worse options that might fall upon us by default if we dither for too long.

I agree with you that a trusted moral authority figure can make it easier for us to pick one of several plausibly best options...but I disagree with you that such a figure is morally necessary; instead, I see them as useful moral support for an action that can be difficult due to a lack of willpower or self-confidence. Ideally, I would just always pick a plausibly best decision by myself; since that's hard and I am a human being who sometimes experiences angst, it's nice when my friends and my mom help me make hard decisions. So the role of the moral authority, in my view, isn't that they justify a hard decision, causing it to become correct where it was not correct prior to their blessing; it's that the moral authority eases the psychological difficulty of making a decision that was hard to accept but that was nevertheless correct even without the authority's blessing.

Yes, I agree with everything you said... until the last sentence ;)

In-parable, your mom is mostly just there as moral support. Neither of you is doing cognitive work the other couldn't. But an aligned AI might have to do a lot of hard work figuring out what some candidate options for the universe even are, and if we want to check its work it will probably have to break big complicated visions of the future into human-comprehensible mouthfuls. So we really will need to extend trust to it - not as in trusting that whatever it picks is the only right thing, but trusting it to make decent decisions even in domains that are too complicated for human oversight.

This seems to me like a "you do not understand your own values well enough" problem, not a "you need a higher moral authority to decide for you" problem.

Or, if we dissolve the idea of your values as something that produces some objective preference ordering (which I suppose is the point of this post): you lack a process that allows you to make decisions when your value system is in a contradictory state.

But it is not important. If it were important, you wouldn't think about it as "what do I want". For example, if you want "world peace" and you have two ways to achieve it, and you can't choose because both are so great, it means either that you have no dice or there is a reason why both would fail.

Yes, I hope that my framing of the problem supports this sort of conclusion :P

An alternate framing where it still seems important would be "moral uncertainty". Where when we don't know what to do, it's because we are lacking some facts, maybe even key facts. So I'm sort of sneakily arguing against that frame.