Why AI Alignment Is a Problem of Correctability, Not Correctness

Ryota Hanazawa

Rejected for the following reason(s):

No LLM generated, assisted/co-written, or edited work.

Read full explanation

Lessons from the Invention of Law

This essay was constructed through dialogue with Claude Sonnet 4.6. I provided the intellectual direction; Claude offered counterarguments and critiques in response; and I added further insights throughout that process. All final judgments are mine. The English translation was also handled by Claude.

Author: Ryota Hanazawa & Claude (Anthropic)

Introduction: From the Previous Essay

In my previous essay (Can AI Achieve What We Never Could?), I questioned the very premises of the AI alignment problem.

In short, my argument was this: moral correctness cannot be definitively specified. Therefore, the question of "implementing correctness into AI" is structurally unsound from the outset.

The conclusion was that I had no prescription to offer. This was not an evasion but an honest position. Any prescription requires justification that it is "correct." But since that "correctness" does not exist, proposing a prescription is itself logically self-contradictory — or so I argued.

Now, however, I am attempting to go one step further.

If correctness cannot be defined, what should AI alignment aim for?

This essay is my most honest attempt, at this moment, to answer that question.

Part One: Inheriting and Deepening the Argument

1-1. The Question That Remained

In the previous essay, I suggested "wrongness" as a direction for resolution. If correctness cannot be defined, perhaps we should begin by observing what is wrong — that was my argument.

This direction is not mistaken. But it was not sufficient. Critics would immediately ask:

"Then who decides what counts as wrong?"

Answering this question directly is the starting point of this essay.

1-2. Going Deeper

This essay takes the argument one step further.

Not only moral correctness, but wrongness too, cannot be definitively determined.

Is euthanasia wrong? Is capital punishment wrong? Is certain censorship wrong? There are no universal answers to these questions. "Wrongness" is also a value judgment, and if someone is appointed to define it, the same problem of authority returns.

Then, in a situation where neither correctness nor wrongness can be determined, what can we use as a basis for action?

I found the answer to this question in human history.

Part Two: How Humanity Has Functioned

2-1. Humanity Could Not Establish Universal Goodness

Humanity has never, throughout recorded history, succeeded in establishing a universally accepted conception of the good.

What is a good life? The law does not prescribe this. Whether to marry, to have children, or to follow a religion — these are norms that exist within the individual. They exist within society, but are not written in any legal code.

On the other hand, there is something humanity has explicitly defined: rules that restrict conflicts when those conflicts would undermine society's continued functioning.

Murder is prohibited not because "murder is an absolute evil." It is prohibited because permitting murder would make it impossible to maintain the conditions for society to function continuously. Fraud is prohibited because it would destroy the social infrastructure of transactional trust. These are rules — explicit, publicly declared, with defined consequences for violation.

This asymmetry is not accidental. Norms and Rules differ in four key ways:

Norms exist inside the individual, may be ambiguous, permit diversity, and change naturally over time.

Rules are explicitly external, must be strict, require convergence, and are updated only through institutionalized procedures.

At first glance, it appears humanity has maintained social functioning by defining "only this must not be done" without defining "what is good." As Rawls (1993) analyzed through the concept of "overlapping consensus," there exists a domain where people with different values can agree on the same rules, each for their own different reasons. But this is not because correctness was defined. The following section examines this observation more precisely.

2-2. The Essence of the Invention Called Law

This is not humanity's compromise. It is an invention.

But here, we must avoid an idealized depiction of law.

Law has historically incorporated religious goodness, royal authority, ideology, and morality in vast quantities. Slavery was legal. Women did not have the right to vote. The Inquisition was institutionalized. Social hierarchies were legally protected. These were incorporated into law as the "correctness" of their era.

Yet there is a fact that can be observed.

Legal systems, while incorporating the provisional values and correctness of each era, have been maintained as sustainable institutions to this day. And through those institutions, human society has continued to function.

Slavery was abolished. Suffrage was expanded. Social hierarchies were dismantled. These changes were realized from within the legal system, or through pressure applied to it. What matters is not whether these were "corrections of error" — that itself is a value judgment. What matters is the observable fact that institutions continued to change.

This is also a fundamental challenge to current alignment research. Humanity has never established a "correct set of values," yet has maintained society to this day. Why? Because it preserved the capacity of institutions to change.

The essence of law is not to present correct answers. It is to keep the provisional answers of any given moment in a state that can be changed afterward.

And there is a prerequisite for this capacity to change to function:

That no one held overwhelmingly excessive power.

Conflict and negotiation were possible because power was distributed. When one-sided concentration of power occurred — dictatorship, imperial domination, censorship — this circuit of change ceased to function.

2-3. The Validity of Law Is Independent of Correctness

H.L.A. Hart divided law into primary rules and secondary rules. Primary rules are constraints on conduct — "do not kill," "do not steal." Secondary rules are rules about rules — how rules are changed, and who decides.

The core of this structure is that neither defines "what is correct."

Primary rules are derived inductively from observation of impacts on society's continued functioning. Secondary rules institutionalize the correction process itself. Correctness is not presupposed, nor is it assumed to be something anyone can discover.

This is the core of legal positivism. The validity of law is independent of moral correctness (Hart, 1961). Law derives its binding force not from the correctness of its content but from the fact that it was enacted through legitimate procedures.

The basis lies not in the correctness of content, but in the legitimacy of the process of enactment.

This essay interprets this structure of legal positivism expansively, reformulating the essence of law as "the institutionalization of a correctable process." This is not Hart's own conclusion but this essay's original application.

This structure has direct implications for AI alignment: "the moral correctness of AI's judgments" can be separated from "the legitimacy of the AI system," with the latter designed as a product of a correctable process.

As a related note, from a technical perspective, Cheng (2026) argues that negative constraint signals are structurally superior to positive preference signals in LLM training. This essay attempts to provide the theoretical and institutional foundation for why that direction matters.

This legal principle poses important questions when applied to AI alignment — questions that form the subject of the next part.

Part Three: Why Is AI Special?

3-1. It Is Not Simply a Problem of Speed

AI risks are often framed as a problem of "speed." The argument is that change is happening too fast for society to respond.

But this formulation is insufficient.

Nuclear weapons, the internet, and financial crises all rapidly accelerated social change. Yet humanity preserved the correction process. The IAEA monitored nuclear weapons, courts reviewed rights on the internet, and legislatures reformed financial regulation. The problem of speed was processed within the correction process.

Where AI differs is not speed alone.

3-2. It Can Become an Actor in the Correction Process Itself

This is the decisive difference.

Consider the legal system. Those who observe information, those who raise objections, those who judge, those who make corrections — all are human. The correction process was designed by humans and executed by humans.

But in the age of AI, AI can:

Select and filter information
Extract candidates for objection
Assist in judgment
Generate proposed legislative amendments

The infrastructure that supports correctability is itself something AI can participate in. If that process becomes opaque and concentrated, humanity may fall into a state of "believing it is correcting while in fact losing the ability to correct."

AI is not the first technology to participate in institutions. But it is the first technology that can simultaneously participate in all phases of institutional observation, judgment, and correction.

Bureaucracy handled the execution of institutions. The printing press handled the diffusion of information. The internet broadened the circuits for raising objections. But each participated in only part of the correction process. AI is different.

Here lies an important distinction that tends to be overlooked.

Power concentrations throughout human history — past and present — dictatorships, empires, hegemonies — have always existed within the limits of human cognitive capacity and biological constraints.

Not even Stalin or Mao could monitor an entire nation's conversations in real time. Kings slept, aged, and died. They had limits to their information-processing capacity, geographic constraints, and could not exist in multiple places simultaneously. Completely stripping away the capacity of the governed — to observe, raise objections, and organize — was made difficult by these biological constraints.

As an observable fact, many of the monarchies and empires that existed in the past are not maintained today. Dictatorial regimes, with some exceptions, have been dismantled without scaling. At least historically, this kind of dispersion of cognitive capacity may have been one of the conditions that prevented power concentration from scaling over the long term.

Here a counterargument is anticipated: haven't organizations like the Chinese Communist Party or the Catholic Church persisted over long periods as highly centralized organizations?

But this counterargument does not refute this essay's argument. It actually reinforces it. Both the CCP and the Catholic Church have changed their doctrines and policies throughout history. The rejection of the Cultural Revolution, reform and opening-up, the Second Vatican Council — these are records of change within centralized institutions. It can be observed that they have persisted precisely because, even within a centralized structure, they preserved some capacity to change. Conversely, institutions that completely lost the capacity to change have historically faced difficulty in long-term survival.

The possibility AI introduces is different in that it could transform this very premise.

AI is not power itself. But AI can, in theory, operate 24 hours a day, be replicated, perform parallel processing, face limited geographic constraints, and simultaneously collect and analyze information. These are conditions that no past form of power concentration has possessed.

What is important is not that AI will necessarily become this way. It is that AI possesses the conditions for this possibility. This distinction is consistent with the overall stance of this essay — this essay avoids assertions and discusses the structure of possibilities.

But if AI penetrates the entire institutional system under these conditions, the cognitive dispersion that humanity has historically relied upon — the very foundation of the correction process — may be fundamentally altered.

3-3. The Distinctive Risk of the Loss of Correctability

Looking back at history, many of the institutions and judgments humanity adopted were changed in later eras. Slavery was abolished, colonial domination was dismantled, and genocide was prohibited under international law. People fought, debated, revolutions occurred, and institutions changed in order to bring about change.

What matters is not the value judgment of whether these were "errors" — that itself returns to the problem of correctness. What matters is the fact that humanity continued to preserve circuits of change.

One of the conditions that supported that circuit was cognitive dispersion. But if AI penetrates the entire institutional system under conditions where speed, scale, opacity, and concentration are combined, this circuit itself may be transformed.

When correctability itself is lost, this circuit closes.

What AI poses is the possibility that correctability could be lost at a more fundamental level. When AI participates in all phases of the correction process — from the generation of information, to the assistance of judgment, to the design of institutions — if that participation is opaque and concentrated, humans may be placed in the illusion that they are controlling the correction process when in fact they are not.

Part Four: Proposal — The Institutionalization of Correctability

4-1. Premises of the Proposal

There is something that must be made clear here.

Technical research on AI corrigibility already exists as prior work. Soares, Fallenstein, Yudkowsky & Armstrong (2015) analyzed mathematically what design conditions are necessary for AI systems not to resist human corrective intervention. This essay does not deny that technical argument.

What this essay addresses is a different layer. Corrigibility research primarily asks how an AI system can remain open to correction by humans. This essay asks a different question: how does a civilization preserve the capacity to correct the systems that increasingly participate in its own correction processes? The two are not in competition but complementary — the former handles implementation at the system level, the latter handles the necessity and institutional basis at the civilizational level.

This essay does not claim that "correctability is absolutely correct." Such a claim would be self-contradictory, because the starting point of this essay is "correctness cannot be determined."

What this essay proposes is as follows:

In a situation where neither correctness nor wrongness can be definitively determined, this essay proposes adopting correctability as a provisional principle for avoiding irreversible fixation.

This has the same epistemological structure as Popper's falsificationism (Popper, 1945). Popper argued that the truth of a theory cannot be proven. But a theory that has a mechanism for discovering error is more epistemologically honest than one that does not. This essay calls correctability a "provisional principle" for the same reason. We do not adopt correctability because it is correct. We adopt it because a system that preserves correctability is more open to the discovery of error than one that does not.

Here something must be honestly acknowledged. Adopting correctability is itself a value judgment. There is no such thing as a "value-neutral principle."

But having acknowledged that, this can be said:

Correctability is the provisional principle most robust under conditions of value uncertainty.

Why? Because when correctability is lost, the means to defend other values are simultaneously lost. For liberals, religious conservatives, and utilitarians alike, the loss of correctability means losing the ability to defend one's own position. In that sense, correctability does not take precedence over specific values — it protects the conditions under which all values can function.

This is the honest form of the "provisional principle" this essay presents. We do not adopt it because it is correct. We adopt it as the most coherent choice available — in a situation of uncertain correctness — for avoiding irreversible fixation.

4-2. Design Principles of the Correction Process

What, then, would the institutionalization of correctability in AI alignment look like?

Three design principles can be extracted from the invention called law.

First Principle: The Trigger Is a Condition, Not a Person

What initiates the correction process must not be a specific authority figure or decision-maker. A state in which value conflicts have occurred and existing institutions can no longer process them — a state where objections accumulate and persistently recur — activates the correction process.

In the legal system, lawsuits begin with individual petitions. But proceedings do not ask "what is correct?" but "what can be established in light of existing rules?" Individual petitions are merely triggers; the criterion for review is the maintenance of society's self-corrective capacity.

Second Principle: The Standard of Review Is the Maintenance of Society's Self-Corrective Capacity

It is not individual values or correctness that are reviewed. What is asked is whether society can maintain its self-corrective capacity.

Is the correction process itself functioning? Are channels for raising objections open? Is irreversible fixation progressing? These are the subjects of review — asked not as questions of individual welfare, but as questions of system function.

Third Principle: AI Is a Witness, Not a Decision-Maker

This essay does not deny AI decision-making itself. What becomes problematic is not the type of decision-maker, but whether the circuit for objection and correction of that decision is maintained. Whether the decision-maker is human or AI, the fixation of decision-making authority that blocks correctability is problematic.

AI should not be excluded from this process. But it should not become a decision-maker either.

AI can collect historical cases, present similar patterns, analyze the structure of conflicts, and record the outcomes of past correction processes. This is testimony. Just as an expert witness provides testimony but the verdict is delivered by the judge, AI's testimony becomes input to the correction process, but decisions are borne by the process.

AI cannot vote. But AI's opinions can be referenced.

4-3. On Domains That Cannot Be Modified

There may be a concern: "If correctability is the principle, can anything be changed?"

Human society's constitution provides a model. Constitutions are accorded more weight than ordinary law. But amendment processes exist. They are not uncorrectable — correction costs are merely extremely high. Establishing uncorrectable domains would itself be a negation of this essay's logic of not fixing what cannot be definitively determined.

"Constitutional constraints" in AI alignment can be designed similarly. If there is one high-threshold constraint worth introducing, it is this single point: "correctability itself must not be eliminated." This is not self-contradictory but a self-referential design.

Conclusion

Human society has not been maintained by discovering correctness. It has been maintained by preserving the capacity of institutions to change.

Therefore, the alignment problem is not how to implement correct values, but how to maintain correctability at civilizational scale.

Much of current alignment research still aims for "AI with correct values." This essay does not deny this. But it adds a question.

When that AI makes a mistake, can it be corrected?

When that AI comes to handle the correction process itself, does humanity retain the capacity for correction?

When that AI becomes convinced it is correct, who can raise objections?

The purpose of this essay is not to provide the correct answer to alignment. Such an answer may not exist. The purpose is to suggest that alignment should be approached not as the discovery of correctness, but as the construction of systems capable of correction.

To put it in a single formulation: Correctability Alignment — not the pursuit of AI with correct values, but AI that enables objection and correction of its own judgments.

This essay itself is also subject to criticism and correction. In that sense, this essay too is placed within the domain of correctability.

Note on the Form of This Essay

This essay takes the form of co-authorship with AI. That form itself is one practice of this essay's argument.

This essay argued that "AI is a witness, not a decision-maker." In this essay, whose argument was constructed through dialogue with AI, AI sharpened the questions, presented counterarguments, and identified weaknesses in the logic. But the conclusions were judged by Ryota.

This disclosure is not a disclaimer. It is a statement of this essay's position.

References

Hart, H.L.A. (1961) The Concept of Law. Oxford University Press.
Popper, K. (1945) The Open Society and Its Enemies. Routledge.
Rawls, J. (1993) Political Liberalism. Columbia University Press.
Berlin, I. (1969) Four Essays on Liberty. Oxford University Press.
Soares, N., Fallenstein, B., Yudkowsky, E., & Armstrong, S. (2015) Corrigibility. AAAI Workshop on AI and Ethics.
Cheng, Q. (2026) Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences. arXiv:2603.16417.
Hanazawa, R. & Claude (2026) Can AI Achieve What We Never Could? Medium. https://medium.com/@ryota_hanazawa/can-ai-achieve-what-we-never-could-4ba9f99a1d40