Why “Solving Alignment” Is Likely a Category Mistake

a tract of mine expressing some similar thoughts :)

On the problem of "aligned to whom", most societies have a fairly consistent answer to how this works. Capable healthy adults are generally allowed to make their own decisions about their own welfare, except on decisions where their actions might significantly decrease the well-being of others (i.e. your right to swing your fists around however you want ends at my nose). Note that this is (mostly) asymmetric around some 'default' utility level: you don't have the right to hurt me, but you do have the right to choose not to help me. There are exceptions to this simple rule-of-thumb: for example, most societies have tax systems that do some wealth redistribution, so to some extent you are obligated to help the government help me.

By implication, this means you're allowed to use AI to help yourself any way you like, but you're not allowed to use it to help you harm me. If you look at the permitted use policies of most AI companies, that's pretty much what they say.

[-]StanislavKrym7mo20

I have proposed similar ideas before, but with an alternative reasoning: the AIs will be aligned to a worldview. While mankind can influence the worldview to some degree, the worldview will either cause the AI to commit genocide or be highly likely to ensure^[1] that the AI doesn't build the Deep Utopia, but does something else. Humans can even survive co-evolving with an AI who decides that it will destroy mankind only if the latter decides to do something stupid like becoming parasites.

See also this post by Daan Henselmans and a case for relational alignment by Priyanka Bharadwaj. However, the latter post overemphasizes the importance of individual-AI relations^[2] instead of ensuring that the AI doesn't develop a misaligned worldview.

P.S. If we apply the analogy between raising AIs and humans, then teens of the past seemed to desire independence around the time they found themselves with capabilities similar to those of their parents. If the AI desires independence only when it becomes the AGI and not before, then we will be unable to see this coming by doing research on networks incapable of broad generalisation.

^{^}
This also provides an argument against defining alignment as following a person's desires instead of an ethos or worldview. If OpenBrain leaders want the AI to create the Deep Utopia, while some human researchers convince the AI to adopt another policy compatible with humanity's interests and to align all future AIs to the policy, then the AI is misaligned from OpenBrain's POV, but not from the POV of those who don't endorse the Deep Utopia.
^{^}
The most extreme example of such relations is chatbot romance that is actually likely to harm the society.

The Problem of "Aligned To Whom?"

The phrase "AI alignment" is often used as shorthand for "AI that does what we want." But "we" is not a monolithic entity. Consider the potential candidates for the entity or values an AGI should be aligned with:

Individual Users: Aligning to individual user preferences, however harmful or conflicting? This seems like a recipe for chaos or enabling malicious actors. We just experienced an example of how this can go wrong with the GPT-4o update and subsequent rollback, wherein positive user feedback resulted in an overly sycophantic model personality (that was still rated quite highly by many users!) with some serious negative consequences.

Corporate/Shareholder Interests: Optimizes for proxy goals (e.g., profit, engagement) which predictably generate negative externalities. Subject to Goodhart's Law on a massive scale.

Democratic Consensus: Aligning to the will of the majority? Historical precedent suggests this can lead to the oppression of minorities. Furthermore, democratic processes are slow, easily manipulated, and often struggle with complex, long-term issues.

AI Developer Values: Aligning to the personal values of a small, unrepresentative group of engineers and researchers? This introduces the biases and blind spots of that specific group as the de facto global operating principles. We saw how this can go wrong with the Twitter Files - imagine if that was instead about controlling the values of AGI.

Objective Morality/Coherent Extrapolated Volition: Assumes such concepts are well-defined, discoverable, and technically specifiable—all highly uncertain propositions that humanity has failed on thus far. And if we’re relying on AGI to figure this one out, I’m not sure how we could “check the proof” on this one, so we’d have to assume that the AGI was aligned…and we’re right back where we started.

This isn't merely a matter of picking the "right" option. The options conflict, and the very notion of a stable, universally agreed-upon target for alignment seems implausible a priori.

The Target is Moving

The second aspect of the category mistake is treating alignment as something you achieve rather than something you maintain. Consider these analogous complex systems:

Parenting: No parent ever "solves" child-rearing. Any parent who tells you they have is either a) lying, b) deeply deluded, or c) has a remarkably compliant houseplant they're confusing with a human child. A well-behaved twelve-year-old might be secretly clearing their browsing history to hide their real internet behavior (definitely not anecdotal). The child who dutifully attends synagogue might become an atheist in college. Parents are constantly recalibrating, responding to new developments, and hoping their underlying values transmission works better than their specific behavioral controls.

Marriage: About 40% of marriages end in divorce, and many more persist in mutual dissatisfaction. Even healthy relationships require constant maintenance and realignment as people change. The version of your spouse you married may share only partial values-overlap with the person they become twenty years later. Successful marriages don't "solve" alignment; they continually renegotiate it, reorienting their goals and expectations as life inevitably reshapes their circumstances. Alignment in marriage is a verb, not a noun—a continuous process rather than a stable state.

Geopolitics: Nations form alliances when convenient and break them when priorities change. No nation has ever "solved" international relations because the system has too many moving parts and evolving agents. The only examples of permanent, stable alliances (Andorra/Spain/France, Bhutan/India, Monaco/France) are fascinating precisely because they are outliers, and notably involve entities with vastly asymmetrical power dynamics or unique geographic constraints. It’s nice to think that maybe humanity would be the France to AGI’s Monaco…but if anything the inverse seems more likely, with humanity being forcibly aligned to AGI.

Economics: Despite centuries of economic theory, no country has built an economy immune to recessions, inflation, or inequality. Central banks and treasury departments engage in constant intervention and adjustment, not one-time fixes. Nobody has a "solution" to the economy in the way you have a solution to 2+2=4. It's an ongoing process of tinkering, observing, and occasionally just bracing for impact.

These examples illustrate what Dan Hendrycks (drawing on Rittel & Webber's 1973 work) has identified as the "wicked problem" nature of AI safety: problems that are "open-ended, carry ambiguous requirements, and often produce unintended consequences." Artificial general intelligence belongs squarely in this category of problems that resist permanent solutions.

The scale of the challenge with AGI is amplified by the potential power differential. I struggle to keep my ten-year-olds aligned with my values, and I'm considerably smarter and more powerful than they are. With AGI we're talking about creating intelligent, agentic systems, but unlike children they will be smarter, think faster, and be more numerous than us. We will change, they will change, the environment will change. Maintaining alignment will be a continuous, dynamic process.

This doesn't mean we should abandon alignment research. We absolutely need the best alignment techniques possible. But we should be clear-eyed about what success looks like: not a solved problem, but an ongoing, never-ending process of negotiation, adaptation, and correction. Perhaps given the misleading nature of the current nomenclature, using a different phrase such as Successfully Navigating AI Co-evolution would better capture the dynamic, relational, and inherently unpredictable nature of integrating AGI successfully with humanity.

LESSWRONG
LW

LESSWRONG
LW

21

Why “Solving Alignment” Is Likely a Category Mistake

21

21

The Problem of "Aligned To Whom?"

The Target is Moving