I doubt that it's an alignment issue instead of a governance issue. The resolution of similar problems in the AI-2027-slowdown scenario[1] is that "the language in the Spec is vague, but seems to imply a chain of command that tops out at company leadership", which then makes the authors intervene and try to prevent a power grab with unclear results. In addition, even the AI-2027 authors acknowledge the risks related to the Intelligence Curse in a footnote. Were the Curse to happen, it would lock in a power distribution which doesn't involve most humans ~at all.
The Race Branch has mankind fail to align the AIs with obvious results of genocide or disempowerment.
Disclaimer: these ideas are not new, just my own way of organizing and elevating of what feels important to pay better attention to in the context of alignment work.
All uses of em dashes are my own! LLMs were occasionally used to improve word choice or clarify expression.
One of the reasons alignment is hard relates to the question of: who should AGI benefit? And also: who benefits (today) from AGI?
An idealist may answer “everyone” for both, but this is problematic, and also not what’s happening currently. If AGI is for everyone, that would include “evil” people who want to use AGI for “bad” things. If it’s “humanity,” whether through some universal utility function or averaged preferences, how does one reach those core metric(s) and account for diversity in culture, thought, morality, etc.? What is lost by achieving alignment around, or distillation to, that common value system?
Many human justice systems punish asocial behaviors like murder or theft, and while there’s general consensus that these things are harmful or undesirable, there are still some who believe it’s acceptable to steal in certain contexts (e.g., Robin Hood), or commit murder (as an act of revenge, in the form of a death penalty / deterrence mechanisms, etc.). Issues like abortion introduce categorical uncertainty, such as the precise moment at which life begins, or what degree of autonomy and individual freedoms should be granted at different stages of development, making terms like “murder” — and its underlying concepts of death and agency — murky.
Furthermore, values shift over time, and can experience full regime-changes across cultures or timelines. For example, for certain parts of history, population growth was seen as largely beneficial, driving labor capacity, economic productivity, innovation and creativity, security and survival, etc. Today, ecological capacity ceilings, resource limits, stress concentrations, and power-law-driven risks like disease or disaster, have increased antinatalism perspectives. If consensus is already seemingly impossible at a given moment, and values shift over time, then both rigidity and plasticity in AGI alignment seem dangerous: how would aligned AGI retain the flexibility to navigate long-horizon generational shifts? How do we ensure adaptability without values-drift in dangerous directions? It seems likely that any given universal utility function would still be constrained to context, or influenced by changes over large time scales; though perhaps this view is overly grounded in known human histories and behaviors, and the possibility still exists that something universal and persistent has yet to be fully conceptualized.
There’s also the central question of who shapes alignment, and how their biases or personal interests might inform decisions. Paraphrasing part of a framework posited by Evan Hubinger, you might have a superintelligent system aligned to traits based on human data/expectations, or to traits this superintelligent system identifies (possibly beyond human conception). If things like pluralism or consensus are deemed insurmountable in a human-informed alignment scenario and we let AGI choose, would its choice actually be a good one? How would we know, or place trust in it?
These philosophical problems have concrete, practical consequences. Currently, AGI is mostly being developed by human engineers and scientists within human social systems.[1] Whether or not transcendent superintelligence that drives its own development eventually emerges, for now, only a small subset of the human population is focused on this effort. These researchers share a fuzzy driving goal to build “intelligent” systems, and a lot of subgoals about how to do that and how deploy them — often related to furthering ML research, or within other closer-in, computational sciences like physics, mathematics, biology, etc.
There are far fewer literature professors, historians, anthropologists, creatives, social workers, landscape design architects, restaurant workers, farmers, etc., who are intimately involved in creating AGI. This isn’t surprising or illogical, but if AI is likely to be useful to “everyone” in some way (à la radio, computers), then “everyone” probably needs to be involved. Of course, there are also entire populations around the world with limited access to the requisite technology (i.e., internet, computers), let alone AI/ML systems. This is a different but still important gap to address.
Many people are being excluded from AGI research (even if largely unintentional). And eventually, intelligent systems will probably play a role in their lives. But what will AI/AGI systems look like by then? Will they be more plastic, or able to learn continuously? Hopefully! But maybe not. What are the opportunity costs of not intentionally including extreme levels of human diversity early on in a task as important as AGI, which might be responsible for the protection, enablement, and/or governance of humanity in meaningful ways in the future? We have seen this story before, for example, in the history of science and medicine in terms of including women in research about women’s health[2][3], and in pretty much any field where a population being analyzed is not involved in the analysis. It usually leads to getting things wrong, which isn’t just a truth-seeking issue — it can be extremely harmful to the people who are supposed to benefit.
What this means for alignment research:
The inner alignment problem asks: how do we ensure AGI does what we want? Outer alignment tries to address: what are we optimizing for?, but can still feel too narrowly examined in current alignment research. Who makes up “we”? What frameworks determine whose “wants” count? And how do we ensure those mechanisms are legitimate given irreducible disagreement about values?
A top-down approach would treat questions of pluralism and plasticity as prerequisites, resolving the second-order problem (alignment to what) before tackling first-order technical alignment (how to do it). But waiting for philosophical consensus before building AGI is both impractical and risky, as development continues regardless.
A parallel approach is necessary. Researchers working through mechanistic alignment can take immediate and intentional (bottoms-up) action to diversify who builds and informs AGI and shapes its goals, while simultaneously pursuing top-down governance and philosophical work on pluralism and legitimacy mechanisms. Mechanistic work should of course be done in conversation with philosophical work, because path dependencies will lock in early choices. If AGI systems crystallize around the values and blind spots of their creators before incorporating global human diversity, they may lack the ability to later accommodate strongly divergent worldviews. Importantly, AGI researchers already agree that solving continuous learning will be a critical piece of this puzzle.
Without resolving second-order alignment issues, we risk building systems that are increasingly powerful, yet aligned to an unexamined, unrepresentative “we”. This could be as dangerous as unaligned AGI. One could even argue that AGI isn’t possible without solving these problems first — that true general intelligence requires the capacity to navigate human pluralism in all its complexity, as well as adapt to new contexts while maintaining beneficent alignment.
Even where AI itself is increasingly playing a more direct role in research, it is not yet clear that these systems are meaningfully controlling or influencing progress.