On the problem of "aligned to whom", most societies have a fairly consistent answer to how this works. Capable healthy adults are generally allowed to make their own decisions about their own welfare, except on decisions where their actions might significantly decrease the well-being of others (i.e. your right to swing your fists around however you want ends at my nose). Note that this is (mostly) asymmetric around some 'default' utility level: you don't have the right to hurt me, but you do have the right to choose not to help me. There are exceptions to this simple rule-of-thumb: for example, most societies have tax systems that do some wealth redistribution, so to some extent you are obligated to help the government help me.
By implication, this means you're allowed to use AI to help yourself any way you like, but you're not allowed to use it to help you harm me. If you look at the permitted use policies of most AI companies, that's pretty much what they say.
I have proposed similar ideas before, but with an alternative reasoning: the AIs will be aligned to a worldview. While mankind can influence the worldview to some degree, the worldview will either cause the AI to commit genocide or be highly likely to ensure[1] that the AI doesn't build the Deep Utopia, but does something else. Humans can even survive co-evolving with an AI who decides that it will destroy mankind only if the latter decides to do something stupid like becoming parasites.
See also this post by Daan Henselmans and a case for relational alignment by Priyanka Bharadwaj. However, the latter post overemphasizes the importance of individual-AI relations[2] instead of ensuring that the AI doesn't develop a misaligned worldview.
P.S. If we apply the analogy between raising AIs and humans, then teens of the past seemed to desire independence around the time they found themselves with capabilities similar to those of their parents. If the AI desires independence only when it becomes the AGI and not before, then we will be unable to see this coming by doing research on networks incapable of broad generalisation.
This also provides an argument against defining alignment as following a person's desires instead of an ethos or worldview. If OpenBrain leaders want the AI to create the Deep Utopia, while some human researchers convince the AI to adopt another policy compatible with humanity's interests and to align all future AIs to the policy, then the AI is misaligned from OpenBrain's POV, but not from the POV of those who don't endorse the Deep Utopia.
The most extreme example of such relations is chatbot romance that is actually likely to harm the society.
A common framing of the AI alignment problem is that it's a technical hurdle to be overcome. A clever team at DeepMind or Anthropic would publish a paper titled "Alignment is All You Need," everyone would implement it, and we'd all live happily ever after in harmonious coexistence with our artificial friends.
I suspect this perspective constitutes a category mistake on multiple levels. Firstly, it presupposes that the aims, drives, and objectives of both the artificial general intelligence and what we aim to align it with can be simplified into a distinct and finite set of elements, a simplification I believe is unrealistic. Secondly, it treats both the AGI and the alignment target as if they were static systems. This is akin to expecting a single paper titled "The Solution to Geopolitical Stability" or "How to Achieve Permanent Marital Bliss." These are not problems that are solved; they are conditions that are managed, maintained, and negotiated on an ongoing basis.
The phrase "AI alignment" is often used as shorthand for "AI that does what we want." But "we" is not a monolithic entity. Consider the potential candidates for the entity or values an AGI should be aligned with:
This isn't merely a matter of picking the "right" option. The options conflict, and the very notion of a stable, universally agreed-upon target for alignment seems implausible a priori.
The second aspect of the category mistake is treating alignment as something you achieve rather than something you maintain. Consider these analogous complex systems:
These examples illustrate what Dan Hendrycks (drawing on Rittel & Webber's 1973 work) has identified as the "wicked problem" nature of AI safety: problems that are "open-ended, carry ambiguous requirements, and often produce unintended consequences." Artificial general intelligence belongs squarely in this category of problems that resist permanent solutions.
The scale of the challenge with AGI is amplified by the potential power differential. I struggle to keep my ten-year-olds aligned with my values, and I'm considerably smarter and more powerful than they are. With AGI we're talking about creating intelligent, agentic systems, but unlike children they will be smarter, think faster, and be more numerous than us. We will change, they will change, the environment will change. Maintaining alignment will be a continuous, dynamic process.
This doesn't mean we should abandon alignment research. We absolutely need the best alignment techniques possible. But we should be clear-eyed about what success looks like: not a solved problem, but an ongoing, never-ending process of negotiation, adaptation, and correction. Perhaps given the misleading nature of the current nomenclature, using a different phrase such as Successfully Navigating AI Co-evolution would better capture the dynamic, relational, and inherently unpredictable nature of integrating AGI successfully with humanity.