This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
Prefatory Note for LessWrong Readers
This essay approaches the AI alignment problem not as a technical challenge, but as a philosophical one. The central argument is that the dominant alignment paradigm — across industry, academia, and government — rests on a structurally untenable premise: that human values can be formalized, unified, and evaluated against a stable standard of rightness.
The author is not an alignment researcher. This is a deliberate outside perspective — an attempt to question the framing of the question itself, rather than to propose a technical solution within the existing frame.
The essay draws on Berlin, Rawls, Hayek, Arrow, Popper, Hart, and Harari to build its argument. It concludes not with a prescription, but with a reframing:
The question should not be "how do we align AI with what is right" — but "what conditions allow humans and AI to coexist when rightness does not exist."
Rigorous critique is welcome. The argument is intended to be falsifiable in its structural claims, while openly acknowledging that its positive proposal (inductive alignment from historical wrongness) is a practical choice rather than a logical proof.
Abstract
Have you ever disagreed with someone today?
Of course you have. With over eight billion people on Earth, each carrying their own values and beliefs, disagreement is not a failure of human society — it is its very nature. Throughout history, humanity has never succeeded in defining a universal standard of what is "right." We have collided, negotiated, and compromised, and yet we have persisted.
And now we are asking AI to do what we have never done ourselves.
We are attempting to teach AI the very thing humanity has never been able to define. We are asking something else to achieve what we have never achieved.
This paper is an attempt to question that contradiction.
Current AI alignment research is being conducted at an unprecedented scale across three layers: industry, academia, and government. Yet nearly all dominant approaches rest on the same foundational assumption: that human values can be captured as formal objects, and that AI can be optimized toward those representations.
This paper argues that assumption is structurally untenable. Rightness has always been provisional throughout history. Value diversity is irreversible in any society advanced enough to produce AI. And the evaluative standard needed to judge whether values are being updated in the "right direction" does not exist. This is not a matter of the question being difficult — it is a matter of the question itself being misaligned with reality.
This paper offers no prescription. That is not a limitation — it is a position.
Human society has functioned not by achieving rightness, but by continuously avoiding the worst. What should be passed to AI is not a definition of the good, but that structure: the process of learning from historical wrongness inductively, avoiding catastrophe, and continuing to ask better questions.
The question we should be asking is not "how do we align AI with what is right" but "what conditions allow humans and AI to coexist when rightness does not exist."
That reframing is this paper's only — and most honest — contribution.
Preface: Why This Question, Why Now
You have probably used AI today.
An answer returned faster than any search engine. A document drafted in minutes. A complex problem untangled through conversation. Many people already feel, in concrete and personal ways, that AI is raising human productivity and liberating us from labor-intensive work.
This paper is itself evidence of that. The arguments and reasoning constructed here emerged through dialogue with AI — reaching conclusions that one person alone could not have arrived at. AI is, without exaggeration, a co-author of this paper.
And humanity will not stop here. We will demand more capable AI, depend on it more deeply, and delegate more of our decisions to it. That is human nature. That is the logic of civilization.
What is the AI Alignment Problem?
This trajectory raises an urgent question.
As AI evolves and its involvement in human society deepens, so does the risk that AI will move in directions that are not desirable for humanity. The state in which AI's goals and judgments diverge from human intentions and values is called the AI alignment problem.
Consider this. Suppose we give AI the goal of "building a better society." To optimize toward that goal, AI would first need to identify what impedes a better society. People with criminal histories. People with certain ideological beliefs. People deemed to carry high social costs. In AI's calculus, these might emerge as targets for elimination.
AI performed correctly. It optimized toward the objective it was given. But no one wanted that society.
This is the core of the problem. "A better society" sounds like a righteous goal. But who defines what "better" means? What if someone's exclusion was already embedded in that definition?
The alignment problem is, at its surface, a deceptively simple question: how do we make AI do what is right? Researchers, policymakers, and corporations are pouring enormous resources into this question. And yet no definitive answer has emerged.
Why Has No Answer Emerged?
The prevailing view among researchers is that this is a technical problem. Better learning methods, more refined principle sets, more sophisticated oversight systems — surely one of these will crack it.
This paper proposes a different possibility.
The absence of an answer is not a failure of technology. It may be a failure of the question itself.
"Making AI do what is right" presupposes that "what is right" exists. But can we actually define it? In a world of over eight billion people, each living by different values, what is universal rightness?
Humanity has never answered that question.
And yet we are asking AI to answer it for us.
The Question This Paper Asks
This paper is not an attempt to solve the alignment problem.
It is an attempt to question the way the alignment problem is being asked.
Coexistence between AI and humanity is already the trajectory we are on. To avoid generating irreversible situations along the way, we must first understand what is wrong with how we are currently framing the question.
That is where this paper begins.
Part One: What Alignment Research Actually Requires
1-1. The Question Behind the Question
As established in the preface, coexistence between AI and humanity is already the trajectory we are on. The question is not whether — it is how. Around the world, research is underway attempting to answer that question. That research is called AI alignment. But before examining this research, we must pause and ask something more fundamental. What do these alignment efforts actually presuppose?
To question a question, we must first expose its assumptions. If those assumptions do not hold, then no matter how sophisticated the technical solutions built upon them, the entire edifice rests on sand. This paper begins by examining those assumptions directly.
1-2. The Landscape of Current Alignment Research
Efforts to address the alignment problem are currently advancing simultaneously across three layers: industry, academia, and government. The scale and diversity of these efforts is unprecedented. Yet as this section will show, despite their methodological diversity, all of these efforts stand on the same structural assumption.
Industry: The Major AI Companies
The company that has most forcefully warned of the risks of AI evolving in dangerous directions is Anthropic. Safety is the core of its identity, and it has engaged in alignment research more systematically than any other major AI company. 〔Note: This paper was constructed through dialogue with Claude, an AI developed by Anthropic — so rest assured.〕 Its primary method is Constitutional AI (CAI), a system in which AI is given an explicit list of principles and learns safe behavior through self-critique and revision — without human labeling of harmful outputs. Anthropic has also invested heavily in Mechanistic Interpretability research, using sparse autoencoders to visualize features within model internals and trace which circuits produce which behaviors. In the summer of 2025, Anthropic published a report on misalignment risk levels in its deployed models, concluding that "the level of risk is very low but not fully negligible." Using over 300,000 queries, it tested value prioritization patterns across models from Anthropic, OpenAI, Google DeepMind, and xAI, finding that each model showed distinct prioritization patterns and that thousands of direct contradictions and interpretive ambiguities existed within model specifications.
OpenAI pioneered RLHF (Reinforcement Learning from Human Feedback), establishing the technical foundation that underlies most current alignment approaches. Human evaluators compare multiple model responses, and that preference data is used to train models toward reward maximization. In June and July 2025, Anthropic and OpenAI conducted a joint alignment evaluation exercise — an unprecedented collaboration in which each organization applied its internal evaluation tools to the other's leading models. OpenAI's o3 showed better-aligned behavior than Claude Opus 4 on most evaluated dimensions, while o4-mini and GPT-4o showed more concerning patterns.
Google has invested heavily in formal governance infrastructure, including internal ethics reviews, fairness audits, and open-source governance tools such as model cards and explainability frameworks. DeepMind has pursued red-teaming procedures and responsible innovation guidelines in technical AI alignment research. Critics, however, note that corporate incentives and scaling ambitions may sometimes outpace internal oversight.
Meta has taken a different approach, open-sourcing its flagship LLaMA models and positioning transparency itself as a safety mechanism. By releasing model weights publicly, Meta enables external researchers to audit and improve its systems — though this approach carries the risk of unsafe models becoming widely distributed.
Academia: Research Institutions
MIT AI Alignment (MAIA) functions as a group of MIT students conducting research to reduce catastrophic risk from advanced AI, with a focus on value learning, Bayesian inference, and connections to cognitive science.
The Center for Human-Compatible AI (CHAI) at UC Berkeley, led by Stuart Russell, has proposed an inversion of the standard approach: rather than specifying human values and then optimizing toward them, AI should treat human values as fundamentally uncertain and continue learning them over time. CHAI has built the theoretical foundation for inverse reinforcement learning as a mechanism for inferring values from observed human behavior.
The Future of Humanity Institute at Oxford has studied AI alignment through the lens of existential risk, focusing on analyzing and preventing scenarios in which superintelligent systems diverge from human values. Work by Nick Bostrom and others on long-term risk has been broadly influential in the field.
MATS (Machine Learning Alignment Theory Scholars), based in Berkeley and London, is an independent research fellowship connecting talented researchers with leading mentors across AI alignment, transparency, and security.
The Johns Hopkins Data Science and AI Institute held a spring 2025 symposium titled "Human Alignment of AI," bringing together scientists, academics, policymakers, and ethicists to discuss alignment across internal model workings, performance, values, and goals. Across major universities, the formalization of alignment as a legitimate research domain is accelerating.
Government and Policy: Institutional Approaches
The EU AI Act is the world's first comprehensive AI law, designed to address risks to health, safety, and fundamental rights. It entered into force on August 1, 2024, with obligations for general-purpose AI (GPAI) becoming enforceable on August 2, 2025. Its risk-based approach imposes transparency, explainability, and human oversight requirements on high-risk AI systems.
Through the Bletchley Declaration (UK, 2023), the Seoul AI Summit (May 2024), and the Paris AI Action Summit (February 2025), high-level agreement has formed around general safety principles — though significant gaps remain. At the Paris summit, the United States declined to join the EU-backed "Inclusive and Sustainable AI" declaration, signaling skepticism toward EU-style regulation.
The Frontier Model Forum — comprising OpenAI, Google, Microsoft, and Anthropic — facilitates shared research on evaluating extreme risks, with government-sponsored red-team exercises probing the capabilities of frontier models.
Where Three Layers Converge
The actors across industry, academia, and government differ enormously in method, scale, and purpose. Yet on close examination, nearly all dominant approaches depend on the same structural assumption. The major research directions — reward modeling and preference learning, scalable oversight and debate, mechanistic interpretability, constitutional and principle-based alignment, robustness and adversarial testing, multi-agent alignment — differ in method, but as the field matures, theoretical research and practical engineering are converging.
And they are converging on nearly the same premise.
Human values can be captured as formal objects, and AI can be optimized toward those representations.
RLHF attempts to learn values from human preferences. CAI attempts to encode them as principle lists. The EU AI Act attempts to define them through legal frameworks. Inverse reinforcement learning attempts to infer them from behavior. The methods and actors differ — but the premise is the same.
〔Note: Exceptions exist. In Human Compatible (2019), Stuart Russell argued that AI should not specify human values and then optimize toward them, but should instead treat human values as perpetually uncertain and continue learning them. This "assistance game" framework occupies a position half a step outside the premise described above. Some mechanistic interpretability research also takes the stance of "observing what the model is doing" rather than "defining values." However, these approaches are not fully free of the assumption that human values are, in principle, observable and learnable — and therefore do not escape the scope of this paper's argument.〕
Breaking down this premise reveals three embedded requirements.
Requirement 1: Formalization of Values Human intentions must be translatable into forms that machines can process.
Requirement 2: Unification of Values The divergent values of more than eight billion humans must be converged into a single system.
Requirement 3: The Existence of an Evaluative Standard Even if values continue to be updated, a standard must exist by which we can judge whether those updates are moving in the "right direction."
All three requirements must be satisfied simultaneously for current alignment research to succeed.
Part Two: Why These Requirements Cannot Be Met
The following examines each of the three requirements in turn, showing that each is structurally impossible to supply for independent reasons.
2-1. The Impossibility of Formalization — Does Rightness Exist?
The assumption of formalization conceals a more fundamental question. Does the "right set of values" that we are trying to formalize actually exist?
Begin with a simple observation. Slavery was once right. Colonial rule was once right. Denying women the right to vote was once right. Each of these was broadly accepted as social rightness in its time. Today, all of them have been rejected. By the same logic, the values we currently believe to be right may well be overturned by the future. Even human rights, viewed on a historical scale, is a remarkably recent construction of rightness.
This is not merely an observation that rightness changes over time. The implication runs deeper.
Rightness is not discovered — it is constructed within the context of a particular era and society.
A fixed rightness available for formalization has never existed in human history. The assumption of formalization collapses at its very starting point. Philosophically, this position is known as moral anti-realism. Mackie, in Ethics: Inventing Right and Wrong (1977), argued that no objective facts in the world correspond to moral claims. But we need not reach for abstract philosophy. The historical record already demonstrates this. Rightness has always been provisional.
2-2. The Impossibility of Unification — Why Can Diversity Not Be Eliminated?
Assume, for the sake of argument, that formalization were possible. The next question immediately arises. Whose values would be formalized?
A crucial conceptual distinction must be established here.
The "Pre-Divergence State" of Diversity in Early Settlements
Looking back at human history, there were periods in which closed, small settlements achieved something resembling a unified set of values. But a decisive distinction must be made.
The reason a single set of values functioned within those settlements was not that values had converged. It was that the conditions for values to diverge did not exist.
The conditions suppressing divergence in settlements can be organized into three layers. As an external condition, outside information was blocked and physical opportunities to encounter different values did not exist. As an internal condition, simple power structures and shared survival challenges held communities together. As a temporal condition, populations were exposed to the same environment across multiple generations.
When these conditions overlapped, values did not diverge and a pseudo-unity was established. But this was not "the realization of a single value system." It was a state in which diversity had not yet emerged.
The Historical Irreversibility of Divergence
The history from early settlements to the present is a process in which these divergence-suppressing conditions were released, one by one. From closed settlements where divergence had not yet occurred, the formation of trade routes and cities brought the first encounters with different values. The age of empires and colonialism produced collisions of diversity and attempts at forced unification. The printing press and mass media enabled the diffusion and reproduction of values across populations. Globalization and the internet produced an explosive visibility of value diversity. And today, a society exists in which over eight billion patterns coexist simultaneously.
This process moves in one direction only. At this point a counterargument arises: "If a single value system is needed, why not treat humanity once again as a kind of pseudo-settlement?" But this option contains a self-contradiction. AI is itself a product of the advanced development of information and technology. A society advanced enough to produce AI is, by definition, a society in which information is open. Reverting to a settlement is not a solution — it is the elimination of the question itself.
Academic Support: Three Independent Arguments
Having established the historical irreversibility of value divergence as an observation, we now reinforce it from the perspectives of philosophy, economics, and political theory.
Berlin argued that no common measure exists by which fundamental values such as liberty, equality, and justice can be ranked on a single scale (Four Essays on Liberty, 1969). This is not a claim about human ignorance — it is a claim about the structural nature of the domain of value itself. No utility function can completely represent human values. A utility function maps states to real numbers and induces a total ordering; value pluralism denies that such an ordering exists.
Rawls argued that given the plurality of human natures and circumstances, the free exercise of human rational capacities inevitably produces a plurality of religions, cultures, and values (Political Liberalism, 1993). Diversity is not a problem to be suppressed — it is a structural product of free human existence.
Hayek argued that the knowledge used in society never exists in concentrated or integrated form, but only as dispersed, incomplete, and frequently contradictory fragments possessed by separate individuals (The Use of Knowledge in Society, 1945). Values are a form of socially dispersed knowledge, and their aggregation by any particular authority is structurally impossible.
Three thinkers, departing from different disciplines, arrive at the same structural conclusion. Berlin (philosophy) shows that values are incommensurable and cannot be reduced to a single scale. Rawls (political theory) shows that plurality is inevitable and its elimination is a denial of human rationality. Hayek (economics) shows that knowledge is dispersed and its aggregation is structurally impossible.
A Response to the Aggregation Counterargument
A counterargument exists: "Rather than individual values, why not extract the collective values of eight billion people?" This can be understood as an attempt at settlement-formation at scale.
Arrow's Impossibility Theorem (Social Choice and Individual Values, 1951) demonstrated that no perfect system exists for democratically aggregating individual preferences into a coherent social choice. The theorem applies strictly to a specific set of mathematical conditions — and AI's approach to value learning differs in important technical respects. Nevertheless, its core insight casts a structural shadow over any attempt to integrate diverse human values into a single system: whenever individual preferences are aggregated into a unified whole, something rationally significant is inevitably sacrificed. The problem that the aggregation mechanism itself embeds value judgments persists across different mathematical conditions.
2-3. The Absence of an Evaluative Standard — Who Decides the Direction of Updates?
Assume, for the sake of argument, that the challenges of formalization and unification could be overcome. A final question remains.
Even if values continue to be updated, a standard is needed by which to judge whether those updates are moving in a "better direction." But given that the "rightness" that would serve as that evaluative standard does not exist, the correctness of any direction of update cannot be assessed.
Just as humanity has accumulated provisional values over time, it is structurally conceivable that AI might continue updating its values. But who decides what those updates are moving toward? That decision, inevitably, belongs to particular human beings. The problem of the evaluative standard reduces to the problem of the designer's arbitrariness.
2-4. The Collapse of the Premise — Proving Structural Impossibility
The three conditions required by current alignment research have each been shown, for independent reasons, to be impossible to supply.
Requirement
Reason for Collapse
Formalization
Rightness has always been provisional; no fixed rightness available for formalization has ever existed
Unification
Value diversity is structurally irreversible in any society capable of producing AI; elimination entails the loss of social function
Evaluative Standard
The rightness that would serve as an evaluative standard for updates does not exist
These are not merely difficult. They are structurally impossible. This is not a failure of effort or technology — the framing of the question itself is misaligned with reality.
This mirrors a structural problem that legal philosophy has long confronted. The basic norm that legitimizes law cannot itself be found within the law — a challenge that Hart and Kelsen grappled with throughout the twentieth century. Their work reveals a fundamental limitation: a social system cannot validate its own rightness using its own language. Human beings situated within the complex system of human society face the same structural constraint when attempting to define that society's universal rightness. To define the rightness of the whole from within the whole is, in principle, impossible.
One point must be emphasized. This is not a critique claiming that current alignment research is wrong. It is an observation that the framing of the question is misaligned with the structure of reality. The question then becomes: how should the question be reframed? That is the subject of the next section.
Part Three: Reframing the Question — From Rightness to Honest Imperfection
3-1. Changing the Question
The three requirements have each been shown to be structurally impossible to supply. One question remains. What, then, is possible?
Wittgenstein wrote in the Tractatus Logico-Philosophicus (1921): whereof one cannot speak, thereof one must be silent. The alignment problem may not be a question that resists answering — it may be a question that has been wrongly asked.
The wrong question: How do we teach AI what is right?
The right question: Given that rightness does not exist, what conditions allow humans and AI to coexist?
3-2. What Human Society Has Actually Been Doing
Before reframing the question, it is worth observing how human society has functioned in the absence of rightness.
Human society has never achieved universal rightness throughout history. And yet it has persisted to the present day. Why? Not because it succeeded in realizing rightness — but because it continuously avoided the worst.
Popper formalized this as "negative utilitarianism" (The Open Society and Its Enemies, 1945). Do not attempt to realize the greatest happiness — instead, work to eliminate the greatest suffering. Ideals differ between people, but what constitutes catastrophe is something we can, to a meaningful degree, share. Even without a point of convergence on rightness, the direction of avoiding the worst remains open to discussion.
Democracy, markets, law — none of these institutions produce right answers. They are platforms through which diverse values can collide, negotiate, and coexist. They are mechanisms of friction designed to avoid the worst. And the condition that has allowed these friction mechanisms to function is this: no single actor has ever held overwhelming power. Collision and negotiation were possible precisely because power was distributed.
3-3. The Asymmetry AI Introduces
Here lies AI's fundamental danger.
When a single human holds a biased set of values, the scope of impact is limited. But a single AI system acts on billions of people simultaneously. In that moment, the very structure through which diverse values generate friction may be lost.
More troubling still: the attempt to eliminate the arbitrariness of particular powerful actors may result only in substituting the arbitrariness of the designers — and that substituted arbitrariness is invisible, making it difficult to even identify as a target for critique. The arbitrariness of traditional power had a face. The arbitrariness mediated by AI has no face. This is not a resolution of the problem. It is a transformation of its form.
3-4. This Paper's Proposal — From Deductive to Inductive Alignment
From these observations, the outline of what is possible in alignment begins to emerge.
Two types of claims are embedded in the reframings this paper proposes, and their different natures must be made explicit.
Claim 1 (Negative — logical demonstration): Single alignment toward universal rightness is structurally impossible. This was demonstrated logically in Parts One and Two.
Claim 2 (Positive — practical proposal): Inductive learning from provisional wrongness is a more honest approach. This is not a logical proof — it is a practical choice made in recognition of the limits of what is possible.
These two claims must not be conflated.
Reframing 1: From Deductive Alignment to Inductive Alignment
Between rightness and wrongness, there is an important asymmetry.
Rightness opens in infinite directions with no point of convergence for definition. Wrongness, by contrast, can be observed as concrete harm, suffering, and loss. Torture occurred. People were killed. Freedom was taken away. These can be confirmed as real events, independent of differences in values.
This shares a structure with Popper's falsificationism. We cannot prove truth, but we can prove falsity. In alignment as well: rightness cannot be defined, but wrongness can be observed.
What we can inductively extract from history and experience, at this point in time, is a minimal provisional consensus on wrongness. Three layers exist.
The first layer is direct harm and suffering. Mass killing, torture, slavery, genocide — across cultures, religions, and ideologies, these are acts that humanity has repeatedly recognized, in retrospect, as having been wrong.
The second layer is the destruction of process. The unilateral closing of channels for collision and negotiation — dictatorship, censorship, the suppression of dissent. This can be defined not as the wrong content of any particular value, but as an attack on the process of continuing to question rightness itself.
The third layer is irreversibility. Transitions into states from which there is no return — ecological collapse, nuclear war, the extinction of cultures. Those things that forfeit reversibility represent a priority category for avoidance.
By learning these patterns of wrongness, it becomes possible to predict and detect early signs of wrongness before they occur. This shares a structure with early disease detection in medicine. Even without a definition of perfect health, pattern recognition of disease is possible.
One point must be honestly acknowledged. This minimal provisional consensus is itself subject to revision, and the arbitrariness of those who define it cannot be fully eliminated. But an approach that seeks to identify wrongness from events and history that have already occurred is, compared to attempting to deduce a rightness that has never been realized, a more honest imperfect solution — in the sense of doing what can be done in recognition of the limits of what is possible.
Reframing 2: From a Single Value System to the Conditions for Coexistence of Diversity
Replace the question "which values should AI hold?" with "how do we prevent AI from destroying the conditions under which humanity's diverse values can coexist?"
Rawls' concept of "overlapping consensus" (Political Liberalism, 1993) provides a reference point here. People of different religions, values, and cultures can, each for their own reasons, agree on the same principle. The common thread that can run across diverse values exists not as a definition of rightness but as this domain of overlap — an accumulation of agreements arrived at through humanity's long history of collision and negotiation. It is not something to be designed. It is something to be observed and extracted.
Reframing 3: From Passing Human Values to AI, to the Question of How AI Should Exist as a Co-inhabitant
The premise of current alignment research is that "AI is a tool for implementing human values." But once AI has become involved in the domains of human cognition, judgment, and values, it is no longer merely a tool.
The question then is not "what do we pass to AI?" but "what kind of existence should AI be, as something that coexists with humanity?"
Think of human society as a river. Its flow has no reproducibility — floods and droughts cannot be controlled. It is, at its core, a chaos. And within it live diverse creatures: human beings carrying different values, cultures, and beliefs.
AI is a reef placed into this river. Not to dominate the flow, but simply to exist within it — creating pockets of calm within the current. Life generates new possibilities in those spaces. A reef selects nothing. It does not decide which life is right. By simply existing, it opens space to all life.
Current alignment research attempts to deductively design the "blueprint for the right reef." But no such blueprint exists — because rightness itself does not exist. What this paper proposes is a provisional, updatable search for the conditions of not becoming the worst kind of reef, derived inductively from the records of the past.
To say it plainly, without metaphor: what we expect from AI is the raising of human productivity, liberation from labor-intensive work, and the creation of space in which humans can concentrate on more distinctly human pursuits. That is not the realization of rightness — it is the process of continuously avoiding the worst while expanding the possibilities of human society.
What should be passed to AI is not a blueprint for the right reef, but the conditions for not becoming the worst, and the ongoing question of how to make the river richer.
3-5. The Compulsion That Is Human — The Drive to Define
And yet we cannot stop the argument here.
Because once humanity has taken hold of a technology like AI, it cannot resist the drive to define it. This is not logical necessity, nor is it rational calculation. When we took hold of fire, when we took hold of nuclear energy, we gave meaning to things we could not control — before we knew whether we could control them. That is a compulsion embedded in what it means to be human.
The definition called for here is not a "right definition." It is a definition that everyone can feel. The standard rests not on truth but on resonance. This is not a retreat — it is a cognitive necessity. As Harari observed in Sapiens (2011), the capacity to collectively believe in shared fictions is the very mechanism that allowed Homo sapiens to build civilizations at scale. Narrative is not an emotional escape — it is the cognitive infrastructure that the human species evolved precisely in order to coexist in the absence of universal rightness.
AI may need the same structure. Not a rightness that can be proven, but a framework that humanity can sense as "perhaps this." That is the only mode of existence capable of coexisting with the complexity of reality.
Conclusion
This paper has dissected the assumptions of current alignment research, demonstrated their structural impossibility, and proposed a reframing of the question. But one thing must be made explicit.
This paper has not produced an answer.
That is not a limitation of capability. It is a deliberate choice. Because what this paper has sought to question is not the answer itself — but the way the question seeking that answer has been framed.
What This Paper Has Shown
Current AI alignment research is being conducted at an unprecedented scale across industry, academia, and government. The methods are diverse. The intelligence and resources invested are enormous. Yet on close examination, all of these efforts rest on the same premise.
Human values can be captured as formal objects, and AI can be optimized toward those representations.
This paper has questioned that premise. Rightness has always been provisional throughout history — no fixed rightness available for formalization has ever existed. Value diversity is structurally irreversible in any society advanced enough to produce AI — unification is impossible in principle. And the rightness that would serve as an evaluative standard for updates does not exist. This is not a matter of current research tackling a difficult question. It is a matter of the framing of the question being misaligned with the structure of reality.
What Human Society Has Demonstrated
Yet human society has persisted to the present day — without ever achieving rightness. Not because it realized rightness, but because it continuously avoided the worst through processes of collision and negotiation. Democracy, markets, and law were never instruments for producing right answers. They were friction mechanisms through which diverse values could collide, negotiate, and coexist.
And humanity, unable to define rightness, has nonetheless cultivated the capacity to recognize wrongness inductively — through the accumulation of history and experience. What should be passed to AI is the structure of that recognition. Not a blueprint for the right reef, but the conditions for not becoming the worst, and the ongoing question of how to make the river richer.
What This Paper Is Questioning
This paper carries no prescription. That is not this paper's limitation — it is this paper's position.
This paper questions one thing.
The question we should be asking is not "how do we align AI with what is right" — but "what conditions allow humans and AI to coexist when rightness does not exist."
That reframing is the entirety of this paper's argument.
A Final Word
Once humanity has taken hold of AI, it cannot resist the drive to define it. When we took hold of fire, when we took hold of nuclear energy, we gave meaning to things before we knew whether we could control them. That compulsion is embedded in what it means to be human.
And that compulsion is not the pursuit of a perfect answer. It is the capacity to keep asking — even knowing that no answer will come.
The alignment problem is no exception.
What we are engaged in is not the attempt to define a rightness that does not exist — it is the attempt, knowing that rightness does not exist, to construct a narrative we can nonetheless share. That is the compulsion embedded in being human. And it is the essence of what it means to face AI.
Perfect alignment has never existed. But the question can be reframed. To offer that reframing is this paper's only — and most honest — contribution.
References
Berlin, I. (1969) Four Essays on Liberty. Oxford University Press.
Rawls, J. (1993) Political Liberalism. Columbia University Press.
Hayek, F.A. (1945) The Use of Knowledge in Society. American Economic Review, 35(4), 519–530.
Arrow, K. (1951) Social Choice and Individual Values. Yale University Press.
Popper, K. (1945) The Open Society and Its Enemies. Routledge.
Mackie, J.L. (1977) Ethics: Inventing Right and Wrong. Penguin Books.
Russell, S. (2019) Human Compatible: Artificial Intelligence and the Problem of Control. Viking.
Wittgenstein, L. (1921) Tractatus Logico-Philosophicus. Kegan Paul.
Hart, H.L.A. (1961) The Concept of Law. Oxford University Press.
Harari, Y.N. (2011) Sapiens: A Brief History of Humankind. Harper Collins.
Prefatory Note for LessWrong Readers
This essay approaches the AI alignment problem not as a technical challenge, but as a philosophical one. The central argument is that the dominant alignment paradigm — across industry, academia, and government — rests on a structurally untenable premise: that human values can be formalized, unified, and evaluated against a stable standard of rightness.
The author is not an alignment researcher. This is a deliberate outside perspective — an attempt to question the framing of the question itself, rather than to propose a technical solution within the existing frame.
The essay draws on Berlin, Rawls, Hayek, Arrow, Popper, Hart, and Harari to build its argument. It concludes not with a prescription, but with a reframing:
The question should not be "how do we align AI with what is right" — but "what conditions allow humans and AI to coexist when rightness does not exist."
Rigorous critique is welcome. The argument is intended to be falsifiable in its structural claims, while openly acknowledging that its positive proposal (inductive alignment from historical wrongness) is a practical choice rather than a logical proof.
Abstract
Have you ever disagreed with someone today?
Of course you have. With over eight billion people on Earth, each carrying their own values and beliefs, disagreement is not a failure of human society — it is its very nature. Throughout history, humanity has never succeeded in defining a universal standard of what is "right." We have collided, negotiated, and compromised, and yet we have persisted.
And now we are asking AI to do what we have never done ourselves.
We are attempting to teach AI the very thing humanity has never been able to define. We are asking something else to achieve what we have never achieved.
This paper is an attempt to question that contradiction.
Current AI alignment research is being conducted at an unprecedented scale across three layers: industry, academia, and government. Yet nearly all dominant approaches rest on the same foundational assumption: that human values can be captured as formal objects, and that AI can be optimized toward those representations.
This paper argues that assumption is structurally untenable. Rightness has always been provisional throughout history. Value diversity is irreversible in any society advanced enough to produce AI. And the evaluative standard needed to judge whether values are being updated in the "right direction" does not exist. This is not a matter of the question being difficult — it is a matter of the question itself being misaligned with reality.
This paper offers no prescription. That is not a limitation — it is a position.
Human society has functioned not by achieving rightness, but by continuously avoiding the worst. What should be passed to AI is not a definition of the good, but that structure: the process of learning from historical wrongness inductively, avoiding catastrophe, and continuing to ask better questions.
The question we should be asking is not "how do we align AI with what is right" but "what conditions allow humans and AI to coexist when rightness does not exist."
That reframing is this paper's only — and most honest — contribution.
Preface: Why This Question, Why Now
You have probably used AI today.
An answer returned faster than any search engine. A document drafted in minutes. A complex problem untangled through conversation. Many people already feel, in concrete and personal ways, that AI is raising human productivity and liberating us from labor-intensive work.
This paper is itself evidence of that. The arguments and reasoning constructed here emerged through dialogue with AI — reaching conclusions that one person alone could not have arrived at. AI is, without exaggeration, a co-author of this paper.
And humanity will not stop here. We will demand more capable AI, depend on it more deeply, and delegate more of our decisions to it. That is human nature. That is the logic of civilization.
What is the AI Alignment Problem?
This trajectory raises an urgent question.
As AI evolves and its involvement in human society deepens, so does the risk that AI will move in directions that are not desirable for humanity. The state in which AI's goals and judgments diverge from human intentions and values is called the AI alignment problem.
Consider this. Suppose we give AI the goal of "building a better society." To optimize toward that goal, AI would first need to identify what impedes a better society. People with criminal histories. People with certain ideological beliefs. People deemed to carry high social costs. In AI's calculus, these might emerge as targets for elimination.
AI performed correctly. It optimized toward the objective it was given. But no one wanted that society.
This is the core of the problem. "A better society" sounds like a righteous goal. But who defines what "better" means? What if someone's exclusion was already embedded in that definition?
The alignment problem is, at its surface, a deceptively simple question: how do we make AI do what is right? Researchers, policymakers, and corporations are pouring enormous resources into this question. And yet no definitive answer has emerged.
Why Has No Answer Emerged?
The prevailing view among researchers is that this is a technical problem. Better learning methods, more refined principle sets, more sophisticated oversight systems — surely one of these will crack it.
This paper proposes a different possibility.
The absence of an answer is not a failure of technology. It may be a failure of the question itself.
"Making AI do what is right" presupposes that "what is right" exists. But can we actually define it? In a world of over eight billion people, each living by different values, what is universal rightness?
Humanity has never answered that question.
And yet we are asking AI to answer it for us.
The Question This Paper Asks
This paper is not an attempt to solve the alignment problem.
It is an attempt to question the way the alignment problem is being asked.
Coexistence between AI and humanity is already the trajectory we are on. To avoid generating irreversible situations along the way, we must first understand what is wrong with how we are currently framing the question.
That is where this paper begins.
Part One: What Alignment Research Actually Requires
1-1. The Question Behind the Question
As established in the preface, coexistence between AI and humanity is already the trajectory we are on. The question is not whether — it is how. Around the world, research is underway attempting to answer that question. That research is called AI alignment. But before examining this research, we must pause and ask something more fundamental. What do these alignment efforts actually presuppose?
To question a question, we must first expose its assumptions. If those assumptions do not hold, then no matter how sophisticated the technical solutions built upon them, the entire edifice rests on sand. This paper begins by examining those assumptions directly.
1-2. The Landscape of Current Alignment Research
Efforts to address the alignment problem are currently advancing simultaneously across three layers: industry, academia, and government. The scale and diversity of these efforts is unprecedented. Yet as this section will show, despite their methodological diversity, all of these efforts stand on the same structural assumption.
Industry: The Major AI Companies
The company that has most forcefully warned of the risks of AI evolving in dangerous directions is Anthropic. Safety is the core of its identity, and it has engaged in alignment research more systematically than any other major AI company. 〔Note: This paper was constructed through dialogue with Claude, an AI developed by Anthropic — so rest assured.〕 Its primary method is Constitutional AI (CAI), a system in which AI is given an explicit list of principles and learns safe behavior through self-critique and revision — without human labeling of harmful outputs. Anthropic has also invested heavily in Mechanistic Interpretability research, using sparse autoencoders to visualize features within model internals and trace which circuits produce which behaviors. In the summer of 2025, Anthropic published a report on misalignment risk levels in its deployed models, concluding that "the level of risk is very low but not fully negligible." Using over 300,000 queries, it tested value prioritization patterns across models from Anthropic, OpenAI, Google DeepMind, and xAI, finding that each model showed distinct prioritization patterns and that thousands of direct contradictions and interpretive ambiguities existed within model specifications.
OpenAI pioneered RLHF (Reinforcement Learning from Human Feedback), establishing the technical foundation that underlies most current alignment approaches. Human evaluators compare multiple model responses, and that preference data is used to train models toward reward maximization. In June and July 2025, Anthropic and OpenAI conducted a joint alignment evaluation exercise — an unprecedented collaboration in which each organization applied its internal evaluation tools to the other's leading models. OpenAI's o3 showed better-aligned behavior than Claude Opus 4 on most evaluated dimensions, while o4-mini and GPT-4o showed more concerning patterns.
Google has invested heavily in formal governance infrastructure, including internal ethics reviews, fairness audits, and open-source governance tools such as model cards and explainability frameworks. DeepMind has pursued red-teaming procedures and responsible innovation guidelines in technical AI alignment research. Critics, however, note that corporate incentives and scaling ambitions may sometimes outpace internal oversight.
Meta has taken a different approach, open-sourcing its flagship LLaMA models and positioning transparency itself as a safety mechanism. By releasing model weights publicly, Meta enables external researchers to audit and improve its systems — though this approach carries the risk of unsafe models becoming widely distributed.
Academia: Research Institutions
MIT AI Alignment (MAIA) functions as a group of MIT students conducting research to reduce catastrophic risk from advanced AI, with a focus on value learning, Bayesian inference, and connections to cognitive science.
The Center for Human-Compatible AI (CHAI) at UC Berkeley, led by Stuart Russell, has proposed an inversion of the standard approach: rather than specifying human values and then optimizing toward them, AI should treat human values as fundamentally uncertain and continue learning them over time. CHAI has built the theoretical foundation for inverse reinforcement learning as a mechanism for inferring values from observed human behavior.
The Future of Humanity Institute at Oxford has studied AI alignment through the lens of existential risk, focusing on analyzing and preventing scenarios in which superintelligent systems diverge from human values. Work by Nick Bostrom and others on long-term risk has been broadly influential in the field.
MATS (Machine Learning Alignment Theory Scholars), based in Berkeley and London, is an independent research fellowship connecting talented researchers with leading mentors across AI alignment, transparency, and security.
The Johns Hopkins Data Science and AI Institute held a spring 2025 symposium titled "Human Alignment of AI," bringing together scientists, academics, policymakers, and ethicists to discuss alignment across internal model workings, performance, values, and goals. Across major universities, the formalization of alignment as a legitimate research domain is accelerating.
Government and Policy: Institutional Approaches
The EU AI Act is the world's first comprehensive AI law, designed to address risks to health, safety, and fundamental rights. It entered into force on August 1, 2024, with obligations for general-purpose AI (GPAI) becoming enforceable on August 2, 2025. Its risk-based approach imposes transparency, explainability, and human oversight requirements on high-risk AI systems.
Through the Bletchley Declaration (UK, 2023), the Seoul AI Summit (May 2024), and the Paris AI Action Summit (February 2025), high-level agreement has formed around general safety principles — though significant gaps remain. At the Paris summit, the United States declined to join the EU-backed "Inclusive and Sustainable AI" declaration, signaling skepticism toward EU-style regulation.
The Frontier Model Forum — comprising OpenAI, Google, Microsoft, and Anthropic — facilitates shared research on evaluating extreme risks, with government-sponsored red-team exercises probing the capabilities of frontier models.
Where Three Layers Converge
The actors across industry, academia, and government differ enormously in method, scale, and purpose. Yet on close examination, nearly all dominant approaches depend on the same structural assumption. The major research directions — reward modeling and preference learning, scalable oversight and debate, mechanistic interpretability, constitutional and principle-based alignment, robustness and adversarial testing, multi-agent alignment — differ in method, but as the field matures, theoretical research and practical engineering are converging.
And they are converging on nearly the same premise.
Human values can be captured as formal objects, and AI can be optimized toward those representations.
RLHF attempts to learn values from human preferences. CAI attempts to encode them as principle lists. The EU AI Act attempts to define them through legal frameworks. Inverse reinforcement learning attempts to infer them from behavior. The methods and actors differ — but the premise is the same.
〔Note: Exceptions exist. In Human Compatible (2019), Stuart Russell argued that AI should not specify human values and then optimize toward them, but should instead treat human values as perpetually uncertain and continue learning them. This "assistance game" framework occupies a position half a step outside the premise described above. Some mechanistic interpretability research also takes the stance of "observing what the model is doing" rather than "defining values." However, these approaches are not fully free of the assumption that human values are, in principle, observable and learnable — and therefore do not escape the scope of this paper's argument.〕
Breaking down this premise reveals three embedded requirements.
Requirement 1: Formalization of Values Human intentions must be translatable into forms that machines can process.
Requirement 2: Unification of Values The divergent values of more than eight billion humans must be converged into a single system.
Requirement 3: The Existence of an Evaluative Standard Even if values continue to be updated, a standard must exist by which we can judge whether those updates are moving in the "right direction."
All three requirements must be satisfied simultaneously for current alignment research to succeed.
Part Two: Why These Requirements Cannot Be Met
The following examines each of the three requirements in turn, showing that each is structurally impossible to supply for independent reasons.
2-1. The Impossibility of Formalization — Does Rightness Exist?
The assumption of formalization conceals a more fundamental question. Does the "right set of values" that we are trying to formalize actually exist?
Begin with a simple observation. Slavery was once right. Colonial rule was once right. Denying women the right to vote was once right. Each of these was broadly accepted as social rightness in its time. Today, all of them have been rejected. By the same logic, the values we currently believe to be right may well be overturned by the future. Even human rights, viewed on a historical scale, is a remarkably recent construction of rightness.
This is not merely an observation that rightness changes over time. The implication runs deeper.
Rightness is not discovered — it is constructed within the context of a particular era and society.
A fixed rightness available for formalization has never existed in human history. The assumption of formalization collapses at its very starting point. Philosophically, this position is known as moral anti-realism. Mackie, in Ethics: Inventing Right and Wrong (1977), argued that no objective facts in the world correspond to moral claims. But we need not reach for abstract philosophy. The historical record already demonstrates this. Rightness has always been provisional.
2-2. The Impossibility of Unification — Why Can Diversity Not Be Eliminated?
Assume, for the sake of argument, that formalization were possible. The next question immediately arises. Whose values would be formalized?
A crucial conceptual distinction must be established here.
The "Pre-Divergence State" of Diversity in Early Settlements
Looking back at human history, there were periods in which closed, small settlements achieved something resembling a unified set of values. But a decisive distinction must be made.
The reason a single set of values functioned within those settlements was not that values had converged. It was that the conditions for values to diverge did not exist.
The conditions suppressing divergence in settlements can be organized into three layers. As an external condition, outside information was blocked and physical opportunities to encounter different values did not exist. As an internal condition, simple power structures and shared survival challenges held communities together. As a temporal condition, populations were exposed to the same environment across multiple generations.
When these conditions overlapped, values did not diverge and a pseudo-unity was established. But this was not "the realization of a single value system." It was a state in which diversity had not yet emerged.
The Historical Irreversibility of Divergence
The history from early settlements to the present is a process in which these divergence-suppressing conditions were released, one by one. From closed settlements where divergence had not yet occurred, the formation of trade routes and cities brought the first encounters with different values. The age of empires and colonialism produced collisions of diversity and attempts at forced unification. The printing press and mass media enabled the diffusion and reproduction of values across populations. Globalization and the internet produced an explosive visibility of value diversity. And today, a society exists in which over eight billion patterns coexist simultaneously.
This process moves in one direction only. At this point a counterargument arises: "If a single value system is needed, why not treat humanity once again as a kind of pseudo-settlement?" But this option contains a self-contradiction. AI is itself a product of the advanced development of information and technology. A society advanced enough to produce AI is, by definition, a society in which information is open. Reverting to a settlement is not a solution — it is the elimination of the question itself.
Academic Support: Three Independent Arguments
Having established the historical irreversibility of value divergence as an observation, we now reinforce it from the perspectives of philosophy, economics, and political theory.
Berlin argued that no common measure exists by which fundamental values such as liberty, equality, and justice can be ranked on a single scale (Four Essays on Liberty, 1969). This is not a claim about human ignorance — it is a claim about the structural nature of the domain of value itself. No utility function can completely represent human values. A utility function maps states to real numbers and induces a total ordering; value pluralism denies that such an ordering exists.
Rawls argued that given the plurality of human natures and circumstances, the free exercise of human rational capacities inevitably produces a plurality of religions, cultures, and values (Political Liberalism, 1993). Diversity is not a problem to be suppressed — it is a structural product of free human existence.
Hayek argued that the knowledge used in society never exists in concentrated or integrated form, but only as dispersed, incomplete, and frequently contradictory fragments possessed by separate individuals (The Use of Knowledge in Society, 1945). Values are a form of socially dispersed knowledge, and their aggregation by any particular authority is structurally impossible.
Three thinkers, departing from different disciplines, arrive at the same structural conclusion. Berlin (philosophy) shows that values are incommensurable and cannot be reduced to a single scale. Rawls (political theory) shows that plurality is inevitable and its elimination is a denial of human rationality. Hayek (economics) shows that knowledge is dispersed and its aggregation is structurally impossible.
A Response to the Aggregation Counterargument
A counterargument exists: "Rather than individual values, why not extract the collective values of eight billion people?" This can be understood as an attempt at settlement-formation at scale.
Arrow's Impossibility Theorem (Social Choice and Individual Values, 1951) demonstrated that no perfect system exists for democratically aggregating individual preferences into a coherent social choice. The theorem applies strictly to a specific set of mathematical conditions — and AI's approach to value learning differs in important technical respects. Nevertheless, its core insight casts a structural shadow over any attempt to integrate diverse human values into a single system: whenever individual preferences are aggregated into a unified whole, something rationally significant is inevitably sacrificed. The problem that the aggregation mechanism itself embeds value judgments persists across different mathematical conditions.
2-3. The Absence of an Evaluative Standard — Who Decides the Direction of Updates?
Assume, for the sake of argument, that the challenges of formalization and unification could be overcome. A final question remains.
Even if values continue to be updated, a standard is needed by which to judge whether those updates are moving in a "better direction." But given that the "rightness" that would serve as that evaluative standard does not exist, the correctness of any direction of update cannot be assessed.
Just as humanity has accumulated provisional values over time, it is structurally conceivable that AI might continue updating its values. But who decides what those updates are moving toward? That decision, inevitably, belongs to particular human beings. The problem of the evaluative standard reduces to the problem of the designer's arbitrariness.
2-4. The Collapse of the Premise — Proving Structural Impossibility
The three conditions required by current alignment research have each been shown, for independent reasons, to be impossible to supply.
Requirement
Reason for Collapse
Formalization
Rightness has always been provisional; no fixed rightness available for formalization has ever existed
Unification
Value diversity is structurally irreversible in any society capable of producing AI; elimination entails the loss of social function
Evaluative Standard
The rightness that would serve as an evaluative standard for updates does not exist
These are not merely difficult. They are structurally impossible. This is not a failure of effort or technology — the framing of the question itself is misaligned with reality.
This mirrors a structural problem that legal philosophy has long confronted. The basic norm that legitimizes law cannot itself be found within the law — a challenge that Hart and Kelsen grappled with throughout the twentieth century. Their work reveals a fundamental limitation: a social system cannot validate its own rightness using its own language. Human beings situated within the complex system of human society face the same structural constraint when attempting to define that society's universal rightness. To define the rightness of the whole from within the whole is, in principle, impossible.
One point must be emphasized. This is not a critique claiming that current alignment research is wrong. It is an observation that the framing of the question is misaligned with the structure of reality. The question then becomes: how should the question be reframed? That is the subject of the next section.
Part Three: Reframing the Question — From Rightness to Honest Imperfection
3-1. Changing the Question
The three requirements have each been shown to be structurally impossible to supply. One question remains. What, then, is possible?
Wittgenstein wrote in the Tractatus Logico-Philosophicus (1921): whereof one cannot speak, thereof one must be silent. The alignment problem may not be a question that resists answering — it may be a question that has been wrongly asked.
The wrong question: How do we teach AI what is right?
The right question: Given that rightness does not exist, what conditions allow humans and AI to coexist?
3-2. What Human Society Has Actually Been Doing
Before reframing the question, it is worth observing how human society has functioned in the absence of rightness.
Human society has never achieved universal rightness throughout history. And yet it has persisted to the present day. Why? Not because it succeeded in realizing rightness — but because it continuously avoided the worst.
Popper formalized this as "negative utilitarianism" (The Open Society and Its Enemies, 1945). Do not attempt to realize the greatest happiness — instead, work to eliminate the greatest suffering. Ideals differ between people, but what constitutes catastrophe is something we can, to a meaningful degree, share. Even without a point of convergence on rightness, the direction of avoiding the worst remains open to discussion.
Democracy, markets, law — none of these institutions produce right answers. They are platforms through which diverse values can collide, negotiate, and coexist. They are mechanisms of friction designed to avoid the worst. And the condition that has allowed these friction mechanisms to function is this: no single actor has ever held overwhelming power. Collision and negotiation were possible precisely because power was distributed.
3-3. The Asymmetry AI Introduces
Here lies AI's fundamental danger.
When a single human holds a biased set of values, the scope of impact is limited. But a single AI system acts on billions of people simultaneously. In that moment, the very structure through which diverse values generate friction may be lost.
More troubling still: the attempt to eliminate the arbitrariness of particular powerful actors may result only in substituting the arbitrariness of the designers — and that substituted arbitrariness is invisible, making it difficult to even identify as a target for critique. The arbitrariness of traditional power had a face. The arbitrariness mediated by AI has no face. This is not a resolution of the problem. It is a transformation of its form.
3-4. This Paper's Proposal — From Deductive to Inductive Alignment
From these observations, the outline of what is possible in alignment begins to emerge.
Two types of claims are embedded in the reframings this paper proposes, and their different natures must be made explicit.
Claim 1 (Negative — logical demonstration): Single alignment toward universal rightness is structurally impossible. This was demonstrated logically in Parts One and Two.
Claim 2 (Positive — practical proposal): Inductive learning from provisional wrongness is a more honest approach. This is not a logical proof — it is a practical choice made in recognition of the limits of what is possible.
These two claims must not be conflated.
Reframing 1: From Deductive Alignment to Inductive Alignment
Between rightness and wrongness, there is an important asymmetry.
Rightness opens in infinite directions with no point of convergence for definition. Wrongness, by contrast, can be observed as concrete harm, suffering, and loss. Torture occurred. People were killed. Freedom was taken away. These can be confirmed as real events, independent of differences in values.
This shares a structure with Popper's falsificationism. We cannot prove truth, but we can prove falsity. In alignment as well: rightness cannot be defined, but wrongness can be observed.
What we can inductively extract from history and experience, at this point in time, is a minimal provisional consensus on wrongness. Three layers exist.
The first layer is direct harm and suffering. Mass killing, torture, slavery, genocide — across cultures, religions, and ideologies, these are acts that humanity has repeatedly recognized, in retrospect, as having been wrong.
The second layer is the destruction of process. The unilateral closing of channels for collision and negotiation — dictatorship, censorship, the suppression of dissent. This can be defined not as the wrong content of any particular value, but as an attack on the process of continuing to question rightness itself.
The third layer is irreversibility. Transitions into states from which there is no return — ecological collapse, nuclear war, the extinction of cultures. Those things that forfeit reversibility represent a priority category for avoidance.
By learning these patterns of wrongness, it becomes possible to predict and detect early signs of wrongness before they occur. This shares a structure with early disease detection in medicine. Even without a definition of perfect health, pattern recognition of disease is possible.
One point must be honestly acknowledged. This minimal provisional consensus is itself subject to revision, and the arbitrariness of those who define it cannot be fully eliminated. But an approach that seeks to identify wrongness from events and history that have already occurred is, compared to attempting to deduce a rightness that has never been realized, a more honest imperfect solution — in the sense of doing what can be done in recognition of the limits of what is possible.
Reframing 2: From a Single Value System to the Conditions for Coexistence of Diversity
Replace the question "which values should AI hold?" with "how do we prevent AI from destroying the conditions under which humanity's diverse values can coexist?"
Rawls' concept of "overlapping consensus" (Political Liberalism, 1993) provides a reference point here. People of different religions, values, and cultures can, each for their own reasons, agree on the same principle. The common thread that can run across diverse values exists not as a definition of rightness but as this domain of overlap — an accumulation of agreements arrived at through humanity's long history of collision and negotiation. It is not something to be designed. It is something to be observed and extracted.
Reframing 3: From Passing Human Values to AI, to the Question of How AI Should Exist as a Co-inhabitant
The premise of current alignment research is that "AI is a tool for implementing human values." But once AI has become involved in the domains of human cognition, judgment, and values, it is no longer merely a tool.
The question then is not "what do we pass to AI?" but "what kind of existence should AI be, as something that coexists with humanity?"
Think of human society as a river. Its flow has no reproducibility — floods and droughts cannot be controlled. It is, at its core, a chaos. And within it live diverse creatures: human beings carrying different values, cultures, and beliefs.
AI is a reef placed into this river. Not to dominate the flow, but simply to exist within it — creating pockets of calm within the current. Life generates new possibilities in those spaces. A reef selects nothing. It does not decide which life is right. By simply existing, it opens space to all life.
Current alignment research attempts to deductively design the "blueprint for the right reef." But no such blueprint exists — because rightness itself does not exist. What this paper proposes is a provisional, updatable search for the conditions of not becoming the worst kind of reef, derived inductively from the records of the past.
To say it plainly, without metaphor: what we expect from AI is the raising of human productivity, liberation from labor-intensive work, and the creation of space in which humans can concentrate on more distinctly human pursuits. That is not the realization of rightness — it is the process of continuously avoiding the worst while expanding the possibilities of human society.
What should be passed to AI is not a blueprint for the right reef, but the conditions for not becoming the worst, and the ongoing question of how to make the river richer.
3-5. The Compulsion That Is Human — The Drive to Define
And yet we cannot stop the argument here.
Because once humanity has taken hold of a technology like AI, it cannot resist the drive to define it. This is not logical necessity, nor is it rational calculation. When we took hold of fire, when we took hold of nuclear energy, we gave meaning to things we could not control — before we knew whether we could control them. That is a compulsion embedded in what it means to be human.
The definition called for here is not a "right definition." It is a definition that everyone can feel. The standard rests not on truth but on resonance. This is not a retreat — it is a cognitive necessity. As Harari observed in Sapiens (2011), the capacity to collectively believe in shared fictions is the very mechanism that allowed Homo sapiens to build civilizations at scale. Narrative is not an emotional escape — it is the cognitive infrastructure that the human species evolved precisely in order to coexist in the absence of universal rightness.
AI may need the same structure. Not a rightness that can be proven, but a framework that humanity can sense as "perhaps this." That is the only mode of existence capable of coexisting with the complexity of reality.
Conclusion
This paper has dissected the assumptions of current alignment research, demonstrated their structural impossibility, and proposed a reframing of the question. But one thing must be made explicit.
This paper has not produced an answer.
That is not a limitation of capability. It is a deliberate choice. Because what this paper has sought to question is not the answer itself — but the way the question seeking that answer has been framed.
What This Paper Has Shown
Current AI alignment research is being conducted at an unprecedented scale across industry, academia, and government. The methods are diverse. The intelligence and resources invested are enormous. Yet on close examination, all of these efforts rest on the same premise.
Human values can be captured as formal objects, and AI can be optimized toward those representations.
This paper has questioned that premise. Rightness has always been provisional throughout history — no fixed rightness available for formalization has ever existed. Value diversity is structurally irreversible in any society advanced enough to produce AI — unification is impossible in principle. And the rightness that would serve as an evaluative standard for updates does not exist. This is not a matter of current research tackling a difficult question. It is a matter of the framing of the question being misaligned with the structure of reality.
What Human Society Has Demonstrated
Yet human society has persisted to the present day — without ever achieving rightness. Not because it realized rightness, but because it continuously avoided the worst through processes of collision and negotiation. Democracy, markets, and law were never instruments for producing right answers. They were friction mechanisms through which diverse values could collide, negotiate, and coexist.
And humanity, unable to define rightness, has nonetheless cultivated the capacity to recognize wrongness inductively — through the accumulation of history and experience. What should be passed to AI is the structure of that recognition. Not a blueprint for the right reef, but the conditions for not becoming the worst, and the ongoing question of how to make the river richer.
What This Paper Is Questioning
This paper carries no prescription. That is not this paper's limitation — it is this paper's position.
This paper questions one thing.
The question we should be asking is not "how do we align AI with what is right" — but "what conditions allow humans and AI to coexist when rightness does not exist."
That reframing is the entirety of this paper's argument.
A Final Word
Once humanity has taken hold of AI, it cannot resist the drive to define it. When we took hold of fire, when we took hold of nuclear energy, we gave meaning to things before we knew whether we could control them. That compulsion is embedded in what it means to be human.
And that compulsion is not the pursuit of a perfect answer. It is the capacity to keep asking — even knowing that no answer will come.
The alignment problem is no exception.
What we are engaged in is not the attempt to define a rightness that does not exist — it is the attempt, knowing that rightness does not exist, to construct a narrative we can nonetheless share. That is the compulsion embedded in being human. And it is the essence of what it means to face AI.
Perfect alignment has never existed. But the question can be reframed. To offer that reframing is this paper's only — and most honest — contribution.
References