Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Many people are very nationalistic, putting their country above all others. Such people can be hazy about what "above all others" can mean, outside of a few clear examples - eg winning a total war totally. They're also very hazy on what is meant by "their country" - geography is certainly involved, as is proclaimed or legal nationality, maybe some ethnic groups or a language, or even just giving deference to certain ideals.

Consider the plight of a communist Croatian Yugoslav nationalist during the 1990s...

I'd argue that the situation these nationalists find themselves in - strong views on poorly defined concepts - is the general human state for preferences. Or, to use an appropriate map and territory analogy:

  • Most people forge their preferences by exploring their local territory, creating a mental map of this, and taking strong preferences over the concepts within their mental map. When the map starts to become imperfect, they will try to extend the concepts to new areas, so that their preferences can also be extended.

Some of the debates about the meaning of words are about this extension-of-preferences process. Scott Alexander recommends that we dissolve concepts such as disease, looking for the relevant categories of 'deserves sympathy' and 'acceptable to treat in a medical way'.

And that dissolving is indeed the correct thing for rationalists to do. But, for most people, including most rationalists, 'sick people deserve sympathy' is a starting moral principle, one we've learnt by example and experience in childhood. When we ask 'do obese people deserve sympathy?' we've trying to extend that moral principle to a situation where our map/model (which includes, say, three categories of people: healthy, mildly sick, very sick) no longer matches up with reality.

Scott's dissolving process requires decomposing 'disease' into more nodes, and then applying moral principles to those individual nodes. In this case, a compelling consequentialist analysis is to look at whether condemnation or praise is effective at changing the condition; ie does fat-shaming people make them less likely to be fat, or others less likely to become fat in the first place? Here the moral principle involved is something like "it's wrong to harm someone (eg through shaming them) if there is no benefit to them or others from doing so".

And that's a compelling moral principle, but it's not the same one that we started with. Some people will have a strong "no harm" intuition, of which "sick people deserve sympathy" is merely an illustrative example. But many (most?) will have been taught that sick people deserve sympathy, as a specific moral requirement they should follow. When we dissolve the definition of disease, we lose a part of of our moral preferences.

And yes, human values are such a mess that we could do with losing or simplifying a bunch of them. But human values are genuinely complicated, and we don't want to over-simplify them. So it's important to note that the "dissolving" process also generally involves discarding a portion of our values, those that don't fit neatly on the new map we have. It's important to decide when we're willing to pay that price, and when we're not.

Reversing the purpose of maps

We generally see maps as working the other way round: as tools to that serve the purposes of our "real" goals. Eliezer writes about how, if definitions didn't stand for some query, something relevant to our "real" preferences, we'd have no reason to care about them.

But if, as I've argued, most of our preferences live in our mental maps, then changing definitions or improving maps can tear up our preferences and values - or at least force us to re-assess them.

Defending "purity"

This is why I spend so much time thinking about "conservative" values, especially those around the moral foundation of purity. I mainly don't share that moral foundation, so it's clear to me how incoherent it is. It's painful to listen to someone who has that moral foundation, twist and turn and try to justify it based on more consequentialist reasoning. Yes, rituals can bind a community together; but are you really telling me that if, say, TV shows or facebook games were shown to do a better binding job, you'd cheerfully discard those rituals?

But I strongly suspect that, ultimately, the moral foundations I do care about, such as care/harm, as also incoherent when we push too far into unfamiliar territory. So I want to forge something coherent out of purity, as practice for forging something coherent out of all our values.

A metaphorical example

Your parent, on their deathbed, gives you your mission in life: an old map, a compass, and the instructions "Go west, young man[1]!"

  1. The map is... incomplete:

  1. The compass is fine, but, as we know, its concept of west is not exactly the same as the standard geographical one.

  2. In the era and place that your hypothetical parent was from, the connotations of "going west" involve adventure and potential richness.

  3. And, most importantly, neither of you have yet realised that the world is round.

So, for a short while, "going west" seems like a clear, well-defined goal. But as we get to the edge of the map, both literally and metaphorically, the concept starts to lose definition and become far more uncertain; and hence, so does your goal.

What will you do with your goal when your mental maps are forced to change?

  1. Don't worry if you're not actually a young man; their mind was starting to go, towards the end. ↩︎

New Comment
10 comments, sorted by Click to highlight new comments since:

I think that you need to look at the generators of the instruction to go west.

For example.

Travel west. I want you to maximize the total westward distance you travel.

Getting on a westward orbiting space station would be really good.

Travel west. My utility is linear in your longditude.

You need to figure out where to cut the map and move just to the east of that line.

Travel west. There is a pot of gold a few miles west and I want you to be rich.

The information to distinguish between these interpretations is not within the request to travel west.

You need to look at why you were asked to travel west.

So it's important to note that the "dissolving" process also generally involves discarding a portion of our values, those that don't fit neatly on the new map we have.

I don't think that those values are being discarded, I think they are being broken down into more basic parts.

The information to distinguish between these interpretations is not within the request to travel west.

Yes, but I'd argue that most of moral preferences are similarly underdefined when the various interpretations behind them come apart (eg purity).

Planned summary for the Alignment Newsletter:

This post argues that by default, human preferences are strong views built upon poorly defined concepts, that may not have any coherent extrapolation in new situations. To put it another way, humans build mental maps of the world, and their preferences are defined on those maps, and so in new situations where the map no longer reflects the world accurately, it is unclear how preferences should be extended. As a result, anyone interested in preference learning should find some incoherent moral intuition that other people hold, and figure out how to make it coherent, as practice for the case we will face where our own values will be incoherent in the face of new situations.

Planned opinion:

This seems right to me -- we can also see this by looking at the various paradoxes found in the philosophy of ethics, which involve taking everyday moral intuitions and finding extreme situations in which they conflict, and it is unclear which moral intuition should “win”.

Cool, neat summary.

Construal level theory (near vs far thinking in the old Overcoming Bias days aka values vs decisions aka abstract vs concrete preferences) may be a better lens than map vs territory. Warning: this line of thinking leads toward accepting Hansonian signaling explanations of "values" and noticing rampant hypocrisy in most discussions of specific value-level claimed beliefs.

human values are genuinely complicated, and we don't want to over-simplify them.

Once you acknowledge that, it's hard to take seriously the abstractions of "moral foundations theory". You don't even have to go all the way to "human values" being an incoherent phrase to recognize that preferences are always and only in models (as they're defined by counterfactual comparisons). From there it's a short inference to understanding that models are never complete, and rarely usable far outside the training/testing datasets.

"Go west, young man."

This quote is generally attributed to Horace Greeley, and very much associated with the US expansion westward in the North American continent. This was not a matter of inaccurate nor incomplete maps - there was plenty of shared context and knowledge that the advice would apply to those in NY or DC, and stop applying as the young man got toward Kansas or Oklahoma.

Sometimes the cluster in the map a preference is pointing at involves another preference. Which provides a natural resolution mechanism. What happens when there's two preferences, I'm unsure. I suppose it depends on how your map changes. In which case, I think you should focus on how to make purity coherent you should start off with some "simple" map and various "simple" changes in the map. To make purity coherent relative to your map is both computationally hard, and empathetically hard.

Side-note: It would be interesting to see which resolution mechanisms produce the most varied shifts in preferences for boundedly rational agents with complex utility functions.

Side-note^2: Stuart, I'm writing a review of all the work done on corrigibility. Would you mind if I asked you some questions on your contributions?

Stuart, I'm writing a review of all the work done on corrigibility. Would you mind if I asked you some questions on your contributions?

No prob. Email or Zoom/Hangouts/Skype?

Hangouts I suppose. It just works. Would next weekend be OK for you?

Edit: I've scheduled a meeting for 12pm UK time on Saturday. Tell me if that works for you.

Sorry, had a few terrible few days, and missed your message. How about Friday, 12pm UK time?

Alright, here's the link for Friday:

Thanks for replying.