Teo Ajantaival

Comments

AI Alignment Problem: “Human Values” don’t Actually Exist

Terminal value monism is possible with impersonal compassion as the common motivation to resolve all conflicts. This means that every thus aligned small self lives primarily to prevent hellish states wherever they may arise, and that personal euthanasia is never a primary option, especially considering that survivors of suffering may later be in a good position to understand and help it in others (as well as contributing themselves as examples for our collective wisdom of life narratives that do/don't get stuck in hellish ways).

AI Alignment Problem: “Human Values” don’t Actually Exist
● Humans do not have one terminal value (unless they are mentally ill).

Why though?

I don't see any other way to (ultimate) alignment/harmony/unification between (nor within) minds than to use a single terminal value-grounded currency for resolving all conflicts.

For as soon as we weigh two terminal values against each other, we are evaluating them through a shared dimension (e.g., force or mass in the case of a literal scale as the comparator), and are thus logically forced to accept that either one of the terminal values (or its motivating power) could be translated into the other, or that there was this third terminal {value/motivation/tension} for which the others are tools.

Do you suggest getting rid of the idea of terminal value(s) altogether, or could you explain how we can resolve conflicts between two terminal values, if terminal means irreducible?

(To the extent that I think in terminal and instrumental values, I claim to care terminally only about suffering. I also claim to not be mentally ill. A lot of Buddhists etc. might make similar claims, and I feel like the statement above quoted from the Conclusion without more context would label a lot of people either mentally ill or not human, while to me the process of healthy unification feels like precisely the process of becoming a terminal value monist. :-))

Why do you reject negative utilitarianism?
Did you make any update regarding the simplicity / complexity of value?

Yes, in terms of how others may explicitly defend the terminal value of even preferences (tastes, hobbies), instead of defending only terminal virtues (health, friendship), or core building blocks of experience (pleasure, beauty).

No, in terms of assigning anything {independent positive value}.

I experience all of the things quoted in Complexity of value,

"Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc.; truth; knowledge and true opinions of various kinds, understanding, wisdom; beauty, harmony, proportion in objects contemplated; aesthetic experience; morally good dispositions or virtues; mutual affection, love, friendship, cooperation; just distribution of goods and evils; harmony and proportion in one's own life; power and experiences of achievement; self-expression; freedom; peace, security; adventure and novelty; and good reputation, honor, esteem, etc."

but I don’t know how to ultimately prioritize between them unless they are commensurable. I make them commensurable by weighing their interdependent value in terms of the one thing we all(?) agree is an independent motivation: preventable suffering. (If preventable suffering is not worth preventing for its own sake, what is it worth preventing for, and is this other thing agreeable to someone undergoing the suffering as the reason for its motivating power?) This does not mean that I constantly think of them in these terms (that would be counterproductive), but in conflict resolution I do not assign them independent positive numerical values, which pluralism would imply one way or another.

Any pluralist theory begs the question of outweighing suffering with enough of any independently positive value. If you think about it for five minutes, aggregate happiness (or any other experience) does not exist. If our first priority is to prevent preventable suffering, that alone is an infinite game; it does not help to make a detour to boost/copy positive states unless this is causally connected to preventing suffering. (Aggregates of suffering do not exist either, but each moment of suffering is terminally worth preventing, and we have limited attention, so aggregates and chain-reactions of suffering are useful tools of thought for preventing as many as we can. So are many other things without requiring our attaching them independent positive value, or else we would be tiling Mars with them whenever it outweighed helping suffering on Earth according to some formula.)

My experience so far with this kind of unification is that it avoids many (or even all) of the theoretical problems that are still considered canonical challenges for pluralist utilitarianisms that assign both independent negative value to suffering and independent positive value to other things. I do not claim that this would be simple or intuitive – that would be analogous to reading about some Buddhist system, realizing its theoretical unity, and teleporting past its lifelong experiential integration – but I do claim that a unified theory with grounding in a universally accepted terminal value might be worth exploring further, because we cannot presuppose that any kind of CEV would be intuitive or easy to align oneself with.

My impression is that theoretical simplicity is a major driver of your preference for NU, and also that if others (such as myself) weighed theoretical simplicity more highly that they would likely be more inclined towards NU.
In other words, I think theoretical simplicity may be a double crux in the disagreements here about NU. Would you agree with that?

Partly, yes. It may also be that all of us, me included, are out of touch with the extreme ends of experience and thus do not understand the ability of some motivations to override everything else.

It is also difficult to operationalize a false belief in independent value: When are we attached to a value to the extent that we would regret not spending its resources elsewhere, on CEV-level reflection?

People also differ along their background assumptions on whether AGI makes the universally life-preventing button a relevant question, because for many, the idea of an AGI represents an omnipotent optimizer that will decide everything about the future. If so, we want to be careful about assigning independent positive value to all the things, because each one of those invites this AGI to consider {outweighing suffering} with {producing those things}, since pluralist theories do not require a causal connection between the things being weighed.

Why do you reject negative utilitarianism?

Thanks for the replies, everyone!

I don’t have the time to reply back individually, but I read them all and believe these to be pretty representative of the wider community’s reasons to reject NU as well.

I can’t speak for those who identify strictly as NU, but while I currently share many of NU’s answers to theoretical outweighing scenarios, I do find it difficult to unpack all the nuance it would take to reconcile “NU as CEV” with our everyday experience.

Therefore, I’ll likely update further away from

{attempting to salvage NU’s reputation by bridging it with compassion, motivation theory, and secular Buddhism}

towards

{integrating these independent of NU, seeing if this would result in a more relatable language, or if my preferred kind of theoretical unity (without pluralist outweighing) would still have the cost of its sounding absurd and extreme on its face}

Why do you reject negative utilitarianism?

Thanks for the perspective.

I agree that even NU may imply rejecting NU in its present form, because it does not feel like a psychologically realistic theory to constantly apply in everyday life; we are more motivated to move towards goals and subgoals that do not carry explicit reminders of extreme suffering on the flip side.

I do feel that I am very close to NU whenever I consider theoretical problems and edge-cases that would outweigh extreme suffering with anything else than preventing more extreme suffering. In practice, it may be more applicable (and ironically, more useful according to NU) to replace NU with asymptotically approaching uncompromising compassion, or an equivalent positive-language narrative, while taking huge grains of epistemic humility whenever it starts to feel like a good idea to override any explicit preferences of others.

Epistemic humility is also why I cannot imagine a situation where I would push The Button, because I cannot imagine how I could know with a certainty that it works (so I cannot derive an intuition out of that thought experiment).

Why do you reject negative utilitarianism?

Yes, I am making the (AFAICT, in your perspective) “incredibly, amazingly strong claim” that in a unified theory, only suffering ultimately matters. In other words, impartial compassion is the ultimate scale (comparator) to decide conflicts between expected suffering vs. other values (whose common basis for this comparison derives from their complete, often context-dependent relationship to expected suffering, including accounting for the wider incentives & long-term consequences from breaking rules that are practically always honored).

I find negative utilitarianism unappealing for roughly the same reason I'd find "we should only care about disgust" or "we should only care about the taste of bananas" unappealing.

Roughly? Suffering is not an arbitrary foundation for unification (for a “common currency” underlying an ideally shared language for cause prioritization). Suffering is the clearest candidate for a thing we all find terminally motivating, at least once we know what we’re talking about (i.e., aren’t completely out of touch with the nature of extreme suffering, as evaluators & comparators of experiences are expected not to be). Personally, I avoid the semantics of arguing over what we “should” care about. Instead, I attempt to find out what I do care about, what these values’ motivational currency is ultimately derived from, and how could I unify these findings into a psychologically realistic model with consistent practical implications & minimal irreconcilable contradictions (such as outweighing between multiple terminal values, because I’m a skeptic of aggregation over space and time; aggregate experiences physically exist only as thoughts not fit to outweigh suffering, which only preventing more suffering can do).

"Only suffering ultimately matters" and "only the taste of bananas ultimately matters" share the virtue of simplicity, but they otherwise run into the same difficulty, which is just that they don't exhaustively describe all the things I enjoy or want or prefer. I don't think my rejection of bananatarianism has to be any more complicated than that.

I agree re: bananatarianism, but there’s more to unpack from the suffering-motivated unification than meets the eye.

No verbal descriptions can exhaustively describe all the things we enjoy, want, or prefer, because our inner homeostatic & psychological dynamics contain processes that are too multidimensional for simple overarching statements. What we can do, is unpack the implications of {“Only suffering ultimately matters”} to see how this can imply, predict, retrodict, and explain our other motivations.

In evolutionary and developmental history terms, we can see at the first quick glance that many (if not immediately all) of our other motivations interact with suffering, or have interacted with our suffering in the past (individually, neurally, culturally, evolutionarily). They serve functions of group cohesion, coping with stress, acquiring resources, intimacy, adaptive learning & growth, social deterrence, self-protection, understanding ourselves, and various other things we value & honor because they make life easier or interesting. Neither NU nor other systems will honor all of our perceived wants as absolutes to maximize (reproduction) or to even respect at all (revenge; animalistic violence; desire for personal slaves or worship), but most of our intuitively nominated “terminal” values need not be overriden by the slightest suffering, because they do serve strong functions to prevent suffering, especially when they seem to us like autonomous goals without constantly reminding us of how horrible things did and would happen without them. NU simply claims that it is the most diplomatic solution for a unified theory to de-attach from other values as absolutes, and respect them to the degree that we need them (practicing epistemic uncertainty when we do not yet understand the full role of something we intuitively deeply value!). This may in practice lead to great changes, ideally in directions of more self-compassion and general compassion for others without our other values overriding our motivation to prevent as many cases of extreme suffering as we can.

A considerate rejection of NU needs to be more complicated than of bananatarianism, because the unity, applicability, and explanatory power of NU relies on its implications (instead of explicit absolute rules or independent utility assignments for exemplary grounding-units of every value—challenges for other systems), and its weights of instrumental value depend not on static snapshots of worlds, but on the full context and counterfactuals epistemically accessible to us in each situation. In extreme situations, we may decide it worthwhile to simulate the expected long-term consequences of possibly bending rules that we normally accept as near-absolute heuristics to save ourselves from resource-intensive overthinking (e.g., the degree to which we respect someone’s autonomy, in emergencies). This doesn’t imply that superweapon research is a low-hanging fruit for aspiring suffering-minimizers (for reasons I won’t detail right now, because I find the world-wiping objections worth addressing mostly in the context of AGI assumptions; my primary interest here is unification for practical cause prioritization, worth noting).

To actually reject NU, you must explain what makes something (other than suffering) terminally valuable (or as I say, motivating) beyond its instrumental value for helping us prevent suffering in the total context. This instrumental value is multifaceted and can be derived from various kinds of relationships to suffering. So other “terminal” values may serve important purposes, including that they help us (some examples in parentheses):

  • cope with suffering (coping mechanisms, friendship, community)
  • avoid ruminating on suffering (explicit focus on expansive, positive language and goals that don’t contain reminders of their possibly suffering-mediated usefulness)
  • re-interpret suffering (humor, narratives, catharsis)
  • prevent suffering (science, technology, cognitive skills & tools)
  • understand suffering (wide & deep personal experience, culture)
  • predict suffering (science)
  • skip epistemic difficulties of trying to optimize others’ suffering for them (autonomy)
  • prevent the abuse of our being motivated by suffering (human rights, justice system, deterrence)
  • help others’ suffering (life, health, freedom, personal extra resources to invest as we see fit, reducing x-risk)
  • resilience against suffering (experience, intelligence, learning, cultural memory)
  • safety against emergencies (family, intimacy, community, institutions)
  • help us relax and regain our ability to help (good food, joy, replenishing activities)
  • avoid a horrible, anxiety-epidemic spreading societal collapse (not getting caught secretly killing people and everyone who knew them, in the name of compassion, by not doing this)

To reject NU, is there some value you want to maximize beyond self-compassion and its role for preventing suffering, at the risk of allowing extreme suffering? How will you tell this to someone undergoing extreme suffering?

NUs are not saying you are deluded for valuing multiple things. But you may be overly attached to them if you—beyond self-compassion—would want to spend your attention on copying/boosting instances of them rather than on preventing others from having to undergo extreme suffering.

After writing this, I wonder if the actual disagreement is still the fear that an NU-AGI would consider humans less {instrumentally valuable for preventing suffering} than it would consider {our suffering terminally worth preventing}. This feels like a very different conversation than what would be a useful basis for a common language of cause prioritization.

Why do you reject negative utilitarianism?
If you flip the Rachels-Temkin spectrum argument (philpapers.org/archive/NEBTGT.pdf), then some tradeoff between happiness and suffering is needed to keep transitive preferences, which is necessary to avoid weird conclusions like accepting suffering to avoid happiness. As long as you don't think theres some suffering threshold where 1 more util of suffering is infinitely worse than anything else, then this makes sense.

Can you give a practical example of a situation where I would be hereby forced to admit that happiness has terminal value above its instrumental value for my preventing as many suffering moments as I can?

I don’t see why {resolving conflicts by weighing everything (ultimately) in suffering} would ever lead me to {“accept suffering to avoid happiness”}, if happiness already can be weighed against suffering in terms of its suffering-preventing effects—just not by itself, which is what many other utilitarianisms rely on, inviting grotesque problems like doctors having parties so great that they outweigh the untreated suffering of their patients.

Are there also practical situations where I’d want to admit that paperclips have terminal value, or else accept suffering to avoid paperclips?

I don’t see what hidden assumptions I’m missing here. I certainly don’t think an infinitely large paperclip is an acceptable comparand to outweigh any kind of suffering. In the case of happiness, it depends completely on whether the combined causal cascades from this happiness are expected to prevent more suffering than the current comparand suffering: no need to attach any independent numerical terminal value to happiness itself, or we’d be back to counting happy sheep believing it to outweigh someone’s agony any moment now.

Also NU in general has a bad reputation in the philosophy community (more than classical utilitarianism I think) so it's better EAs don't endorse it.

I believe the first part of this statement may currently be true for the WEIRD (western, educated, industrialized, rich, democratic) philosophy community. Other parts of the world have long histories and living traditions of suffering-based views, primarily various forms of Buddhism. In what I’ve read about Mahayana Buddhism (or the Bodhisattva path), compassion is often explicitly identified as the only necessary motivation that implies and/or transcends all the outwardly visible customs, rules, and ethics, and that compassion is the voice to listen to when other “absolutes” conflict. (Omnicidal superweapon research is not part of these philosophies of compassion, but invented, in my estimation, as an implication of NU by later armchair rationalists to easily dismiss NU.)

I’ll take the second part of your statement as your current personal opinion of NU in its present form and perceived reputation. I am personally still optimistic that suffering is the most universal candidate to derive all other values from, and I would be careful not to alienate a large segment of systematic altruists such as might be found among secular, rationalist Buddhists. I mostly agree though, that NU in its present form may be tainted by the prevalence of the world-destruction argument (even though it is argued to represent only a straw man NU by proponents of NU).

Why do you reject negative utilitarianism?

No, I’m not depressed, and I believe I never have been. I understand and appreciate the question if what you describe is your prior experience of people who identify as negative utilitarians. I may identify as NU for discussion’s sake, but my underlying identification is with the motivation of impartial compassion. I would go as far as to say that I am happy in all areas of my personal life, being driven towards unification by my terminal concern for the expected suffering of others.

I have had brief experiences of medical emergencies that gave me new perspectives into suffering from the inside. (In hindsight, much of it was generated by fear and escalating perception, but so it is in real danger.) While those happened years ago, I’ve continued to reflect on them and feel like they’ve changed me, affecting my daily life and priorities since then. For a while, I felt grateful for getting my life back and considered devoting myself to treating acute pain. I since graduated Master’s in Psychology without clinical internship to focus more on research, feeling that my comparative advantage is in channeling compassion for more scalable, theoretical work.

I believe a possible mistake of depressed NUs is to focus on others’ suffering before taking care of themselves (by listening to the foundational motivation of self-compassion). Self-compassion is our value-grounding for extended self-compassion, which leads to universal compassion in the limit.

Nate Soares has a post about self-compassion as a key part of his wider, 40-post series, Replacing Guilt, both of which I universally recommend (also ePUB-compiled here).