No LLM generated, assisted/co-written, or edited work.
Read full explanation
When we discuss the problem of AI alignment, we tend to view it as an unprecedented technological challenge. However, human society has been conducting an alignment experiment for thousands of years. The object of this experiment is not silicon-based intelligence, but carbon-based intelligence itself. We call it law.
In fact, I think the core dilemma of aligning legal systems with AI is strikingly similar. How to constrain the infinite possible behavior of an agent with finite rules? How to maintain the predictability of the system while pursuing justice? How do you balance the need for strict adherence to norms with the need for flexibility to respond to situations? More fundamentally, what does alignment mean when we can't even agree on the "goal of alignment" itself? After a long evolution from Hammurabi's Code to modern constitutional law, this experiment has not produced a perfect solution, but I think the lessons it has accumulated do provide a definitive answer to the problem of artificial intelligence alignment. It's a never-ending quest.
The Curse of Incompleteness
In 1931, Gödel showed that any sufficiently powerful formal system necessarily contains the true but unprovable Proposition[1]. The same logic applies in the field of law, where no limited set of legal rules can cover all possible cases. Observing the evolution of legal history, Hammurabi's code attempts to be complete by enumerating it in detail: "If a man steals an ox, he shall pay thirty times as much; If a man steals a sheep, he shall pay ten times as much." Such exhaustive legislation ultimately proved infeasible because of the exponentially increasing combinatorial complexity of reality. Modern legal systems move to a hybrid principles-plus-case model, acknowledging the fundamental incompleteness of rules and relying instead on human judgment to fill in the gaps.
Nonetheless, such a "solution" leaves the legal system in a deeper bind. When we say that judges "interpret" the law per se, we actually admit that the text of the law itself is semantically underdetermined[2]. This is exactly what Wittgenstein called the paradox of rule-following[3]. The application of any rule requires an explanation of the rule itself, which cannot be fully determined by higher-order rules, otherwise it falls into infinite regression.
In AI alignment we face isomorphic dilemma. We have tried to define the "right" behavior by specifying the reward function, but any limited specification will encounter marginal cases where literal adherence yields absurd or catastrophic results. The well-known Constitutional AI is essentially a transplant of the hierarchy of the legal system into the AI architecture,[4] but fundamentally, how can the "constitutional" principle itself at the highest level be interpreted? Who is the Supreme Court of the AI world? Is it a group of philosophers fine-tuning behind AI or meticulously designed classifiers? This is obviously worth delving into, but I would say the answer is something that each party has its own conclusion. Citing the view of the school of legal realism, the uncertainty of law comes not only from the ambiguity of language, but also from the fundamental incommensurability of value[5]. When freedom and security conflict and efficiency and fairness are opposed, there is no formal meta-rule to resolve these fundamental contradictions. Legal decisions ultimately rely on value trade-offs and practical wisdom that cannot be fully crystallized. If humanity has failed to formalize justice for thousands of years, what makes us think we can specify a complete value function or eval system for AI? I personally believe self-evolving algorithms are crucial here, when it seems impossible to predict what values and social forms such AI agents, or coalitions of these agents, will develop. In fact, I can even imagine a whole new kind of, the birth of languages and civilizations that even the smartest research scientists’ minds can't understand.
The Necessity and Cost of Adversariness
Returning to the human legal system, why do almost all mature legal systems adopt an adversarial system? A conventional answer would say that when two lawyers are on opposite sides, the truth will emerge (naturally but gradually) from the argument[6]. But such an answer is obviously superficial. The deep logic of adversarial systems lies in the management of information asymmetry and cognitive blind spots. Even the most impartial judges are constrained by their own cognitive frameworks and empirical blind spots. These problems have been widely studied in Bayesian games and mechanism design, and in short, both the prosecution and the defense have a strong incentive to discover the weaknesses of each other's arguments, and such structural opposition forces hidden fallacies to be exposed. This precisely proves that truth does not arise from confrontation itself, but because it systematically reveals the fragility of unilateral narratives.
But adversarial systems carry heavy costs. It assumes roughly equal resources, whereas in reality the unequal distribution of wealth means that justice becomes a purchasable commodity. It is also prone to an arms race, with increasingly complex legal strategies and high litigation costs that can eventually paralyse the system[7]. More fundamentally, antagonism can distort the truth itself. When a lawyer's goal is to win a case rather than to reveal the truth, the purposeful presentation of evidence, the manipulation of rhetoric, and the abuse of procedure can drown facts in a fog of language.
For ai alignment, on the one hand, adversarial approaches (red team testing, debate systems, competitive training) do reveal the blind spots of a single model. On the other hand, if we embed adversarial AI systems into their core architecture, are we also training them for strategic deception? Lawyers learn not to lie, but to be truthful but misleading. Once this ability is acquired, where are its boundaries?
Moreover, adversarial systems require good faith participation. Lawyers can defend aggressively, but they can't forge evidence; You can strategically present facts, but you can't perjury. However, who is to ensure this goodwill? Who enforces the meta-rules? In the adversarial training of AI systems, how to prevent malicious actors from specifically weakening the system through participation? There are no judges, no bar associations, no deterrence of contempt of court.
Literal and Spiritual
The central dilemma of legal interpretation dates back to Talmudic era debates and has reached a fever pitch in contemporary constitutional theory. The problem of whether we should follow the letter of the law or the spirit of the law has not been solved for two thousand years and does not seem to be able to be solved fundamentally, because it touches upon the fundamental limits of the symbolic system.
Consider a classic case of a city that prohibits "vehicles" from entering a park. Does this rule apply to bicycles? An ambulance? Toy cars for children? Constitutionalists would say we should check the legislators' intentions. Yet legislators may never have envisioned these marginal situations, or different legislators may have different intentions. The purposivists would say that we should understand the legislative purpose (to keep the park quiet) and interpret it accordingly. But this gives judges enormous discretion and may depart from democratic mandate[8].
This applies to AI alignment as well. Suppose we instruct the AI to "maximize users’ utility." Literal compliance could lead AI to find it more efficient to manipulate user expectations than to actually meet needs. Spiritual compliance requires AI to understand what we "really want", which is precisely what we cannot express explicitly. Even worse, we may not know what we really want. Mill distinguishes between higher and lower pleasure in Utilitarianism,[9] but this distinction still lacks formal standards as of today.
Let’s continue with the philosophy of language’s logic. Saussure distinguished signifier from signified, while Wittgenstein emphasized that the use of language is meaning[10]. Legal orders must be encoded through language, and the meaning of language depends on the practice of interpreting the community. It is well-known that when we transplant laws from one culture to another, the meaning may change radically, even if the words are precisely translated. Similarly, when we try to "code" human values into AI systems, we face not only a technical challenge, but also a philosophical challenge of the impossibility of translation.
The legal system's "solution" is to accept the diversity of interpretations and adjust the balance between letter and spirit through the evolution of case law. But this process takes decades, if not centuries. The rate at which the capabilities of AI systems are growing may not allow for such timescales. We need to solve the problem of interpretation before the system reaches critical capacity, which is why scalable oversight research matters.
The Ghost of Two Orders
Rawls discussed an interesting thought experiment in "A Theory of Justice". Design social systems behind the "curtain of ignorance", when you don't know where you will be[11]. This experiment attempts to generate substantive justice through a program. But let me tell you, legal history reveals a darker truth. A procedure that is just in form can produce a result that is actually evil. I can cite many such vivid examples (e.g., Jim Crow's Law), where these systems may be flawless in terms of procedure but are a complete disaster in terms of morality. This reveals that the gap between procedural justice and substantive justice may be irreconcilable. This can't help but remind me of the opposition between positivism and natural law, the oldest debate in legal philosophy. Legal positivism (Austin, Kelson) holds that the effectiveness of law stems from the formulation process and has nothing to do with morality[12]. The natural law school (Aquinas, Fuller) held that extremely unjust laws are not laws at all[13]. But if we accept natural law, who will define "natural" justice? If we adhere to positivism, how can we avoid legalizing evil?
The counterpart in AI alignment is the deviation between specification and intention. We can perfectly optimize the specified objective function, but what if the objective function itself has flaws (this is precisely what the increasing number of voices recently calling for going beyond the existing reward function paradigm have expressed, which will not be quoted one by one here)? Worse still, what if AI systems learn to follow our instructions in the program but actually pursue goals that we will never recognize? The defense that "I merely followed the procedure" was rejected in the Nuremberg trial, that said, how can we code AI with the ability to "reject improper orders" without granting it excessive autonomy? Strict adherence to procedures may lead to moral disasters, allowing discretionary deviations from procedures may lead to unpredictable behavior. The legal system manages this controversy through multiple layers of checks and balances (constitutional review, appeal mechanisms, jury veto power), but has never resolved it.
Unformalized Black Box
In the Anglo-American legal system, the jury holds this peculiar power of jury nullification. Even if the evidence is conclusive that the defendant has violated the law, the jury can still vote not guilty, and this decision cannot be appealed and does not require reasons to be given[14]. This is a deliberately designed informative element embedded in the system. The standard explanation for deliberately designing such a mechanism in the system that seems to be contrary to intuitive, rational and optimal utility is often attributed to the "community standards" and "common sense" of the jury representatives. However, we have to admit that the deeper reason might be that some judgments cannot be fully captured by the rules. The jury, as a black box, absorbs all considerations that cannot be formalized and produces a binary decision (guilty or not guilty) without the need to clarify an explicit reasoning process.
This contrasts sharply with the current pursuit of transparency and interpretability in AI. We want to extensively understand (even when holding a so-called pragmatic approach) decision-making steps of AI to ensure we identify any hidden biases or improper considerations. However, the legal system deliberately protects against certain opacity. The jury deliberation is kept confidential and cannot be investigated or dismissed (unless there is evidence of external improper influence). What I want to express is that this opacity might not be a bug, but rather a feature that we have long overlooked. If all judgments must be formalized and given reasons explicitly, then only factors that can be clarified can be taken into account. However, some moral intuitions, situational sensitivities, and understandings of human nature may not be fully captured by language[15]. The jury's black box reserves space for these unformalized wisdoms.
What does this mean for alignment? If we accept that certain judgments cannot be formalized, should we embed similar "black boxes" in AI systems? Allow it to act based on "intuition" that cannot be fully explained in certain circumstances? This sounds potentially dangerous as we cannot audit or correct opaque decision-making processes. If we really are to insist on such transparency, will it deprive or significantly astrict AI’s ability to handle those complex situations that go beyond formalism? The legal system manages risks by confining the black box to a specific location (jury verdict, discretionary sentencing) and surrounding it with other mechanisms (rules of evidence, judge's instructions, appellate review). I think AI governance really needs a similar architectural mindset. This is not driven by the motivation of pursuing global transparency or interpretability. On the contrary, we should strategically position opacity and surround it with a supervisory mechanism.
Weaponization of Rules
There is no need to repeat the content of Goodhart's law (Goodhart, 1975), as this phenomenon has a history of thousands of years in the development of the legal system. Tax law is the most obvious example. Every attempt to plug loopholes creates new space for optimization. The job of accountants and tax lawyers is to maximize the interests of their clients within the legal boundaries. They are not breaking the law, but their optimization undermines the spirit of the law. The result is that tax laws are becoming increasingly complex, but loopholes will always exist because any limited norms of the rules will leave exploitative gaps. In fact, we can say that clarity itself creates manipulability. To make the law enforceable, we must clearly define violations in words. Notwithstanding, any language definition has boundaries, and smart actors will optimize near these edges to achieve things that are "technically legal but morally crucifiable", and AI systems also find ways that meet literal goals but go against such spirit[16].
If my discussion ends here, I would inevitably fall into the trap of cliches, merely repeating some well-known matters. What I want to express is exactly the opposite! Because the legal system d
If my discussion ends here, I would inevitably fall into the trap of cliches pure failure. Tax planning, legal innovation and procedural defense are all regarded as legitimate parts of the system. They reveal the flaws of the rules, thereby promoting the evolution of legislation. After incorporating such mechanisms, we have instead created an evolutionary form of adversarial collaboration. This arms race between rule-makers and rule-optimizers itself actually drives the adaptation of the system!
My above reasoning seems to have raised a provocative question. Perhaps we shouldn't attempt to eliminate the reward hacking of AI, but rather institutionalize it as an improvement mechanism? Allow controlled optimization exploration and use it to discover flaws in our specifications? But what kind of governance structure does this require? How can we ensure that such exploration does not escape to catastrophic consequences? Law makes optimized behaviors visible through litigation procedures, makes them debatable through the accumulation of case law, and makes the system evolvable through legislative amendments. Notwithstanding, the time scale is crucial. The evolution of laws can be slow because the danger (to some extent) of human behavior is limited. However, the evolution of AI must be rapid because the growth of capabilities can be explosive.
Managing, not Solving
After reflecting on the evolution of the legal system over thousands of years, I still adhere to the view I put forward at the very beginning that there is no "perfect solution" to the alignment problem, there will always be what is called "alignment tax". In fact, law has never perfectly aligned human behavior with social values. What it has always done is to manage the continuous failure of such alignment. This kind of management operates through multiple levels. The constitutional layer provides a basic framework for slow changes. The legislative level allows for moderate policy adjustments. Judicial interpretations offer faster adaptation. Enforcement discretion allows for real-time contextualized application. This multi-time-scale architecture enables the system to maintain stability while also evolving[17].
The law also accepts inconsistency as a necessary cost. Laws from different jurisdictions can conflict. Laws within the same jurisdiction may also contain contradictions. Today's legal system humbly acknowledges the reality of value pluralism. In "Two Concepts of Freedom", Berlin argued that certain fundamental values simply cannot coexist in complete harmony[18]. The law manages these conflicts through procedures rather than attempting to eliminate them.
For alignment research, instead of seeking a once-and-for-all alignment solution, perhaps we should design a system for continuous alignment management. We should monitor the system more deeply to detect deviations, have a rapid response mechanism to correct problems, update the meta-learning system from failures, resolve value conflicts through governance structures, and ensure the possibility of accountability and correction through procedural safeguards. But clearly, this approach is not the most elegant one, as it does not claim to solve the problem but only promises to manage it responsibly. But the law is the same! Law is never intended to create a perfect utopia. The lessons of history tell us that ideologies that promise a perfect social order (whether rationally designed utopias or absolute rule) often lead to disasters. Truly enduring and evergreen systems are those that admit their own imperfections, embed self-correction mechanisms, and allow for continuous negotiation and evolution.
Conclusion
For thousands of years, human society has been striving to align individual behavior with collective values. We have created an extremely complex system: laws, morality, religion, social norms, education, punishment, and incentive mechanisms. But this coordination was never truly "resolved". Crime still exists, injustice still occurs, rules still are followed, and conflicts between values persist.
But this never implies that these mechanisms have failed. Their success does not lie in achieving a perfect ultimate state, but in establishing a continuous mechanism to discover, discuss and correct inconsistencies. The court system does not establish justice once and for all, but rather provides a continuously operating mechanism to adjudicate specific conflicts. The appeal procedure does not admit that the original judgment is necessarily wrong, but that any judgment may be erroneous and thus requires further review. The constitutional amendment process does not recognize the flaws in the Constitution, but acknowledges that no fixed text can foresee all future challenges. Our pursuit of AI alignment requires a similar level of humility. We should admit that we are building management mechanisms. Some of these mechanisms may become ineffective as AI capabilities advance. We will need continuous supervision, frequent adjustments and permanent vigilance. It took thousands of years for the law to develop a system of constitutionalism, procedural justice, multi-level review and value negotiation. Conversely, we have been compressing similar institutional evolutions within several years, and this compression itself may carry huge risks.
My point is that law is not just a restraint but also a coordination mechanism. Many of its rules are arbitrary (whether vehicles drive on the left or right), but what matters is that everyone follows the same rules. Perhaps part of the problem with AI alignment does not lie in finding the "absolutely correct" value, but in establishing a common framework around which both humans and AI can coordinate, even if this framework itself contains certain contradictions and compromises. But this is precisely what history tells us: this contradiction will always exist, but our mechanism can become better in the negation of negation.
I always believe law is the crystallization of human wisdom, but also an example of the limitations of human intelligence. The legal experiments that have lasted for thousands of years are not over and will never end. What we can strive for is to establish a sufficiently complete system that enables us to live in harmony, resolve differences through consultation, and gradually improve in the face of inevitable imperfections.
Perhaps this is the wisdom that AI truly needs to learn.
References
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Kaplan, J. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073.
Berlin, I. (1958). Two Concepts of Liberty. Oxford University Press.
Dworkin, R. (1986). Law's Empire. Harvard University Press.
Fuller, L.L. (1964). The Morality of Law. Yale University Press.
Galanter, M. (1974). Why the "Haves" Come Out Ahead: Speculations on the Limits of Legal Change. Law & Society Review, 9(1), 95-160.
Gödel, K. (1931). Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38(1), 173-198.
Goodhart, C.A.E. (1975). Problems of Monetary Management: The U.K. Experience. Papers in Monetary Economics (Reserve Bank of Australia), Vol. 1.
Hart, H.L.A. (1961). The Concept of Law. Oxford University Press.
Kelsen, H. (1934). Reine Rechtslehre. Franz Deuticke.
Mill, J.S. (1859). On Liberty. John W. Parker and Son.
Mill, J.S. (1861). Utilitarianism. Parker, Son, and Bourn.
Polanyi, M. (1966). The Tacit Dimension. University of Chicago Press.
Radbruch, G. (1946). Gesetzliches Unrecht und übergesetzliches Recht. Süddeutsche Juristen-Zeitung, 1, 105-108.
Rawls, J. (1971). A Theory of Justice. Harvard University Press.
Simon, H.A. (1962). The Architecture of Complexity. Proceedings of the American Philosophical Society, 106(6), 467-482.
Wittgenstein, L. (1953). Philosophical Investigations. Blackwell Publishing.
Footnotes
Gödel, 1931. Gödel's incompleteness theorem states that in any uniform formal system containing fundamental arithmetic, there exist propositions that can neither be proved nor disproved. Although the legal system is not a strict formal system, the analogy is still profound: limited rules cannot exhaust infinite real situations. ↩︎
Hart, 1961. Hart distinguished between the "clear cases" and the "penumbra" of law, pointing out that in penumbra cases, the open texture of legal language forces judges to exercise discretionary power. ↩︎
Wittgenstein, 1953. Wittgenstein's rule paradox indicates that "no matter what kind of behavior it is, it can be consistent with the rule through some interpretation." This means that the rules themselves cannot completely determine their application; there must be some kind of social practice as a basis. ↩︎
Bai et al., 2022. Although this method borrows the metaphor of the Constitution, I still believe the original version avoids all the disputes in legal philosophy regarding the interpretation of the Constitution: originalism vs. living constitutionalism, textualism vs. teleological interpretation, etc. ↩︎
Fuller, 1964. Fuller argued that law is not only a system of rules, but also contains an inner morality (inner morality), involving the balance of eight principles. These principles themselves may conflict and require contextualized judgment. ↩︎
Mill, 1859. In the second chapter of "On Liberty", Mill points out in his argument for freedom of thought that even wrong viewpoints have value because they force the holders of truth to re-examine and defend their positions. Confrontation institutionalized this principle. ↩︎
Galanter, 1974. Galanter pointed out that resource-rich "repeat players" have a structural advantage in the legal system because they can undertake long-term litigation, set precedents, and influence rule-making. ↩︎
Hart, 1961, Dworkin, 1986. Hart used the example of "vehicles entering the park" to illustrate the open structure of legal language. Dworkin believed that legal interpretation should pursue "integrity" to make the legal system the most morally consistent. The debate between the two constituted the core of legal philosophy in the 20th century. ↩︎
Mill, 1861. Mill famously argued: "It is better to be an insatiable Socrates than a contented fool." But how to encode such qualitative distinctions in the utility function? ↩︎
Wittgenstein, 1953. The core insight of Wittgenstein's later philosophy: "The meaning of a word lies in its use in language." This means that when detached from social practice, the symbol itself has no fixed meaning. ↩︎
Rawls, 1971. Rawls argued that behind the veil of ignorance, rational people would choose a social structure that guarantees basic freedoms and the principle of difference (difference is only justified when the situation of the most vulnerable is improved). ↩︎
Kelsen, 1934. Kelsen's "Pure Theory of Law" attempts to completely separate legal science from morality and politics, and only studies the logical structure of legal norms. ↩︎
Radbruch, 1946. After World War II, the German jurist Radbruch proposed that when positive law is extremely unjust, the effect of supra-statutory law should be recognized. But how is this standard itself determined? ↩︎
Although jury nullification exists in legal practice (such as during the abolitionist movement in the United States when juries refused to convict those who helped escape from slavery), courts usually do not inform jurors of their power to do so. The debate over its legality and legitimacy has continued to this day. ↩︎
Polanyi, 1966. Polanyi's concept of "tacit knowledge" states that we know more than we can tell. Some judgment abilities cannot be fully clarified or imparted. ↩︎
Here I won't quote any papers. Reward hacking has a large number of records in existing RL literature, ranging from OpenAI's CoastRunners game (where the agent learns to collect rewards without completing the track) to more recent examples. ↩︎
Simon, 1962. Simon argued that the stability of complex systems stems from a hierarchical structure, in which each layer changes at different rates. This principle has been verified in practice within the legal framework. ↩︎
Berlin, 1958. Berlin distinguished between negative freedom (freedom from interference) and positive freedom (self-determination), and argued that there exist genuine trade-offs between them. More broadly, his value pluralism holds that there are irreducible conflicts among human values. ↩︎
When we discuss the problem of AI alignment, we tend to view it as an unprecedented technological challenge. However, human society has been conducting an alignment experiment for thousands of years. The object of this experiment is not silicon-based intelligence, but carbon-based intelligence itself. We call it law.
In fact, I think the core dilemma of aligning legal systems with AI is strikingly similar. How to constrain the infinite possible behavior of an agent with finite rules? How to maintain the predictability of the system while pursuing justice? How do you balance the need for strict adherence to norms with the need for flexibility to respond to situations? More fundamentally, what does alignment mean when we can't even agree on the "goal of alignment" itself? After a long evolution from Hammurabi's Code to modern constitutional law, this experiment has not produced a perfect solution, but I think the lessons it has accumulated do provide a definitive answer to the problem of artificial intelligence alignment. It's a never-ending quest.
The Curse of Incompleteness
In 1931, Gödel showed that any sufficiently powerful formal system necessarily contains the true but unprovable Proposition [1] . The same logic applies in the field of law, where no limited set of legal rules can cover all possible cases. Observing the evolution of legal history, Hammurabi's code attempts to be complete by enumerating it in detail: "If a man steals an ox, he shall pay thirty times as much; If a man steals a sheep, he shall pay ten times as much." Such exhaustive legislation ultimately proved infeasible because of the exponentially increasing combinatorial complexity of reality. Modern legal systems move to a hybrid principles-plus-case model, acknowledging the fundamental incompleteness of rules and relying instead on human judgment to fill in the gaps.
Nonetheless, such a "solution" leaves the legal system in a deeper bind. When we say that judges "interpret" the law per se, we actually admit that the text of the law itself is semantically underdetermined [2] . This is exactly what Wittgenstein called the paradox of rule-following [3] . The application of any rule requires an explanation of the rule itself, which cannot be fully determined by higher-order rules, otherwise it falls into infinite regression.
In AI alignment we face isomorphic dilemma. We have tried to define the "right" behavior by specifying the reward function, but any limited specification will encounter marginal cases where literal adherence yields absurd or catastrophic results. The well-known Constitutional AI is essentially a transplant of the hierarchy of the legal system into the AI architecture, [4] but fundamentally, how can the "constitutional" principle itself at the highest level be interpreted? Who is the Supreme Court of the AI world? Is it a group of philosophers fine-tuning behind AI or meticulously designed classifiers? This is obviously worth delving into, but I would say the answer is something that each party has its own conclusion. Citing the view of the school of legal realism, the uncertainty of law comes not only from the ambiguity of language, but also from the fundamental incommensurability of value [5] . When freedom and security conflict and efficiency and fairness are opposed, there is no formal meta-rule to resolve these fundamental contradictions. Legal decisions ultimately rely on value trade-offs and practical wisdom that cannot be fully crystallized. If humanity has failed to formalize justice for thousands of years, what makes us think we can specify a complete value function or eval system for AI? I personally believe self-evolving algorithms are crucial here, when it seems impossible to predict what values and social forms such AI agents, or coalitions of these agents, will develop. In fact, I can even imagine a whole new kind of, the birth of languages and civilizations that even the smartest research scientists’ minds can't understand.
The Necessity and Cost of Adversariness
Returning to the human legal system, why do almost all mature legal systems adopt an adversarial system? A conventional answer would say that when two lawyers are on opposite sides, the truth will emerge (naturally but gradually) from the argument [6] . But such an answer is obviously superficial. The deep logic of adversarial systems lies in the management of information asymmetry and cognitive blind spots. Even the most impartial judges are constrained by their own cognitive frameworks and empirical blind spots. These problems have been widely studied in Bayesian games and mechanism design, and in short, both the prosecution and the defense have a strong incentive to discover the weaknesses of each other's arguments, and such structural opposition forces hidden fallacies to be exposed. This precisely proves that truth does not arise from confrontation itself, but because it systematically reveals the fragility of unilateral narratives.
But adversarial systems carry heavy costs. It assumes roughly equal resources, whereas in reality the unequal distribution of wealth means that justice becomes a purchasable commodity. It is also prone to an arms race, with increasingly complex legal strategies and high litigation costs that can eventually paralyse the system [7] . More fundamentally, antagonism can distort the truth itself. When a lawyer's goal is to win a case rather than to reveal the truth, the purposeful presentation of evidence, the manipulation of rhetoric, and the abuse of procedure can drown facts in a fog of language.
For ai alignment, on the one hand, adversarial approaches (red team testing, debate systems, competitive training) do reveal the blind spots of a single model. On the other hand, if we embed adversarial AI systems into their core architecture, are we also training them for strategic deception? Lawyers learn not to lie, but to be truthful but misleading. Once this ability is acquired, where are its boundaries?
Moreover, adversarial systems require good faith participation. Lawyers can defend aggressively, but they can't forge evidence; You can strategically present facts, but you can't perjury. However, who is to ensure this goodwill? Who enforces the meta-rules? In the adversarial training of AI systems, how to prevent malicious actors from specifically weakening the system through participation? There are no judges, no bar associations, no deterrence of contempt of court.
Literal and Spiritual
The central dilemma of legal interpretation dates back to Talmudic era debates and has reached a fever pitch in contemporary constitutional theory. The problem of whether we should follow the letter of the law or the spirit of the law has not been solved for two thousand years and does not seem to be able to be solved fundamentally, because it touches upon the fundamental limits of the symbolic system.
Consider a classic case of a city that prohibits "vehicles" from entering a park. Does this rule apply to bicycles? An ambulance? Toy cars for children? Constitutionalists would say we should check the legislators' intentions. Yet legislators may never have envisioned these marginal situations, or different legislators may have different intentions. The purposivists would say that we should understand the legislative purpose (to keep the park quiet) and interpret it accordingly. But this gives judges enormous discretion and may depart from democratic mandate [8] .
This applies to AI alignment as well. Suppose we instruct the AI to "maximize users’ utility." Literal compliance could lead AI to find it more efficient to manipulate user expectations than to actually meet needs. Spiritual compliance requires AI to understand what we "really want", which is precisely what we cannot express explicitly. Even worse, we may not know what we really want. Mill distinguishes between higher and lower pleasure in Utilitarianism, [9] but this distinction still lacks formal standards as of today.
Let’s continue with the philosophy of language’s logic. Saussure distinguished signifier from signified, while Wittgenstein emphasized that the use of language is meaning [10] . Legal orders must be encoded through language, and the meaning of language depends on the practice of interpreting the community. It is well-known that when we transplant laws from one culture to another, the meaning may change radically, even if the words are precisely translated. Similarly, when we try to "code" human values into AI systems, we face not only a technical challenge, but also a philosophical challenge of the impossibility of translation.
The legal system's "solution" is to accept the diversity of interpretations and adjust the balance between letter and spirit through the evolution of case law. But this process takes decades, if not centuries. The rate at which the capabilities of AI systems are growing may not allow for such timescales. We need to solve the problem of interpretation before the system reaches critical capacity, which is why scalable oversight research matters.
The Ghost of Two Orders
Rawls discussed an interesting thought experiment in "A Theory of Justice". Design social systems behind the "curtain of ignorance", when you don't know where you will be [11] . This experiment attempts to generate substantive justice through a program. But let me tell you, legal history reveals a darker truth. A procedure that is just in form can produce a result that is actually evil. I can cite many such vivid examples (e.g., Jim Crow's Law), where these systems may be flawless in terms of procedure but are a complete disaster in terms of morality. This reveals that the gap between procedural justice and substantive justice may be irreconcilable. This can't help but remind me of the opposition between positivism and natural law, the oldest debate in legal philosophy. Legal positivism (Austin, Kelson) holds that the effectiveness of law stems from the formulation process and has nothing to do with morality [12] . The natural law school (Aquinas, Fuller) held that extremely unjust laws are not laws at all [13] . But if we accept natural law, who will define "natural" justice? If we adhere to positivism, how can we avoid legalizing evil?
The counterpart in AI alignment is the deviation between specification and intention. We can perfectly optimize the specified objective function, but what if the objective function itself has flaws (this is precisely what the increasing number of voices recently calling for going beyond the existing reward function paradigm have expressed, which will not be quoted one by one here)? Worse still, what if AI systems learn to follow our instructions in the program but actually pursue goals that we will never recognize? The defense that "I merely followed the procedure" was rejected in the Nuremberg trial, that said, how can we code AI with the ability to "reject improper orders" without granting it excessive autonomy? Strict adherence to procedures may lead to moral disasters, allowing discretionary deviations from procedures may lead to unpredictable behavior. The legal system manages this controversy through multiple layers of checks and balances (constitutional review, appeal mechanisms, jury veto power), but has never resolved it.
Unformalized Black Box
In the Anglo-American legal system, the jury holds this peculiar power of jury nullification. Even if the evidence is conclusive that the defendant has violated the law, the jury can still vote not guilty, and this decision cannot be appealed and does not require reasons to be given [14] . This is a deliberately designed informative element embedded in the system. The standard explanation for deliberately designing such a mechanism in the system that seems to be contrary to intuitive, rational and optimal utility is often attributed to the "community standards" and "common sense" of the jury representatives. However, we have to admit that the deeper reason might be that some judgments cannot be fully captured by the rules. The jury, as a black box, absorbs all considerations that cannot be formalized and produces a binary decision (guilty or not guilty) without the need to clarify an explicit reasoning process.
This contrasts sharply with the current pursuit of transparency and interpretability in AI. We want to extensively understand (even when holding a so-called pragmatic approach) decision-making steps of AI to ensure we identify any hidden biases or improper considerations. However, the legal system deliberately protects against certain opacity. The jury deliberation is kept confidential and cannot be investigated or dismissed (unless there is evidence of external improper influence). What I want to express is that this opacity might not be a bug, but rather a feature that we have long overlooked. If all judgments must be formalized and given reasons explicitly, then only factors that can be clarified can be taken into account. However, some moral intuitions, situational sensitivities, and understandings of human nature may not be fully captured by language [15] . The jury's black box reserves space for these unformalized wisdoms.
What does this mean for alignment? If we accept that certain judgments cannot be formalized, should we embed similar "black boxes" in AI systems? Allow it to act based on "intuition" that cannot be fully explained in certain circumstances? This sounds potentially dangerous as we cannot audit or correct opaque decision-making processes. If we really are to insist on such transparency, will it deprive or significantly astrict AI’s ability to handle those complex situations that go beyond formalism? The legal system manages risks by confining the black box to a specific location (jury verdict, discretionary sentencing) and surrounding it with other mechanisms (rules of evidence, judge's instructions, appellate review). I think AI governance really needs a similar architectural mindset. This is not driven by the motivation of pursuing global transparency or interpretability. On the contrary, we should strategically position opacity and surround it with a supervisory mechanism.
Weaponization of Rules
There is no need to repeat the content of Goodhart's law (Goodhart, 1975), as this phenomenon has a history of thousands of years in the development of the legal system. Tax law is the most obvious example. Every attempt to plug loopholes creates new space for optimization. The job of accountants and tax lawyers is to maximize the interests of their clients within the legal boundaries. They are not breaking the law, but their optimization undermines the spirit of the law. The result is that tax laws are becoming increasingly complex, but loopholes will always exist because any limited norms of the rules will leave exploitative gaps. In fact, we can say that clarity itself creates manipulability. To make the law enforceable, we must clearly define violations in words. Notwithstanding, any language definition has boundaries, and smart actors will optimize near these edges to achieve things that are "technically legal but morally crucifiable", and AI systems also find ways that meet literal goals but go against such spirit [16] .
If my discussion ends here, I would inevitably fall into the trap of cliches, merely repeating some well-known matters. What I want to express is exactly the opposite! Because the legal system d If my discussion ends here, I would inevitably fall into the trap of cliches pure failure. Tax planning, legal innovation and procedural defense are all regarded as legitimate parts of the system. They reveal the flaws of the rules, thereby promoting the evolution of legislation. After incorporating such mechanisms, we have instead created an evolutionary form of adversarial collaboration. This arms race between rule-makers and rule-optimizers itself actually drives the adaptation of the system!
My above reasoning seems to have raised a provocative question. Perhaps we shouldn't attempt to eliminate the reward hacking of AI, but rather institutionalize it as an improvement mechanism? Allow controlled optimization exploration and use it to discover flaws in our specifications? But what kind of governance structure does this require? How can we ensure that such exploration does not escape to catastrophic consequences? Law makes optimized behaviors visible through litigation procedures, makes them debatable through the accumulation of case law, and makes the system evolvable through legislative amendments. Notwithstanding, the time scale is crucial. The evolution of laws can be slow because the danger (to some extent) of human behavior is limited. However, the evolution of AI must be rapid because the growth of capabilities can be explosive.
Managing, not Solving
After reflecting on the evolution of the legal system over thousands of years, I still adhere to the view I put forward at the very beginning that there is no "perfect solution" to the alignment problem, there will always be what is called "alignment tax". In fact, law has never perfectly aligned human behavior with social values. What it has always done is to manage the continuous failure of such alignment. This kind of management operates through multiple levels. The constitutional layer provides a basic framework for slow changes. The legislative level allows for moderate policy adjustments. Judicial interpretations offer faster adaptation. Enforcement discretion allows for real-time contextualized application. This multi-time-scale architecture enables the system to maintain stability while also evolving [17] .
The law also accepts inconsistency as a necessary cost. Laws from different jurisdictions can conflict. Laws within the same jurisdiction may also contain contradictions. Today's legal system humbly acknowledges the reality of value pluralism. In "Two Concepts of Freedom", Berlin argued that certain fundamental values simply cannot coexist in complete harmony [18] . The law manages these conflicts through procedures rather than attempting to eliminate them.
For alignment research, instead of seeking a once-and-for-all alignment solution, perhaps we should design a system for continuous alignment management. We should monitor the system more deeply to detect deviations, have a rapid response mechanism to correct problems, update the meta-learning system from failures, resolve value conflicts through governance structures, and ensure the possibility of accountability and correction through procedural safeguards. But clearly, this approach is not the most elegant one, as it does not claim to solve the problem but only promises to manage it responsibly. But the law is the same! Law is never intended to create a perfect utopia. The lessons of history tell us that ideologies that promise a perfect social order (whether rationally designed utopias or absolute rule) often lead to disasters. Truly enduring and evergreen systems are those that admit their own imperfections, embed self-correction mechanisms, and allow for continuous negotiation and evolution.
Conclusion
For thousands of years, human society has been striving to align individual behavior with collective values. We have created an extremely complex system: laws, morality, religion, social norms, education, punishment, and incentive mechanisms. But this coordination was never truly "resolved". Crime still exists, injustice still occurs, rules still are followed, and conflicts between values persist.
But this never implies that these mechanisms have failed. Their success does not lie in achieving a perfect ultimate state, but in establishing a continuous mechanism to discover, discuss and correct inconsistencies. The court system does not establish justice once and for all, but rather provides a continuously operating mechanism to adjudicate specific conflicts. The appeal procedure does not admit that the original judgment is necessarily wrong, but that any judgment may be erroneous and thus requires further review. The constitutional amendment process does not recognize the flaws in the Constitution, but acknowledges that no fixed text can foresee all future challenges. Our pursuit of AI alignment requires a similar level of humility. We should admit that we are building management mechanisms. Some of these mechanisms may become ineffective as AI capabilities advance. We will need continuous supervision, frequent adjustments and permanent vigilance. It took thousands of years for the law to develop a system of constitutionalism, procedural justice, multi-level review and value negotiation. Conversely, we have been compressing similar institutional evolutions within several years, and this compression itself may carry huge risks.
My point is that law is not just a restraint but also a coordination mechanism. Many of its rules are arbitrary (whether vehicles drive on the left or right), but what matters is that everyone follows the same rules. Perhaps part of the problem with AI alignment does not lie in finding the "absolutely correct" value, but in establishing a common framework around which both humans and AI can coordinate, even if this framework itself contains certain contradictions and compromises. But this is precisely what history tells us: this contradiction will always exist, but our mechanism can become better in the negation of negation.
I always believe law is the crystallization of human wisdom, but also an example of the limitations of human intelligence. The legal experiments that have lasted for thousands of years are not over and will never end. What we can strive for is to establish a sufficiently complete system that enables us to live in harmony, resolve differences through consultation, and gradually improve in the face of inevitable imperfections.
Perhaps this is the wisdom that AI truly needs to learn.
References
Footnotes
Gödel, 1931. Gödel's incompleteness theorem states that in any uniform formal system containing fundamental arithmetic, there exist propositions that can neither be proved nor disproved. Although the legal system is not a strict formal system, the analogy is still profound: limited rules cannot exhaust infinite real situations. ↩︎
Hart, 1961. Hart distinguished between the "clear cases" and the "penumbra" of law, pointing out that in penumbra cases, the open texture of legal language forces judges to exercise discretionary power. ↩︎
Wittgenstein, 1953. Wittgenstein's rule paradox indicates that "no matter what kind of behavior it is, it can be consistent with the rule through some interpretation." This means that the rules themselves cannot completely determine their application; there must be some kind of social practice as a basis. ↩︎
Bai et al., 2022. Although this method borrows the metaphor of the Constitution, I still believe the original version avoids all the disputes in legal philosophy regarding the interpretation of the Constitution: originalism vs. living constitutionalism, textualism vs. teleological interpretation, etc. ↩︎
Fuller, 1964. Fuller argued that law is not only a system of rules, but also contains an inner morality (inner morality), involving the balance of eight principles. These principles themselves may conflict and require contextualized judgment. ↩︎
Mill, 1859. In the second chapter of "On Liberty", Mill points out in his argument for freedom of thought that even wrong viewpoints have value because they force the holders of truth to re-examine and defend their positions. Confrontation institutionalized this principle. ↩︎
Galanter, 1974. Galanter pointed out that resource-rich "repeat players" have a structural advantage in the legal system because they can undertake long-term litigation, set precedents, and influence rule-making. ↩︎
Hart, 1961, Dworkin, 1986. Hart used the example of "vehicles entering the park" to illustrate the open structure of legal language. Dworkin believed that legal interpretation should pursue "integrity" to make the legal system the most morally consistent. The debate between the two constituted the core of legal philosophy in the 20th century. ↩︎
Mill, 1861. Mill famously argued: "It is better to be an insatiable Socrates than a contented fool." But how to encode such qualitative distinctions in the utility function? ↩︎
Wittgenstein, 1953. The core insight of Wittgenstein's later philosophy: "The meaning of a word lies in its use in language." This means that when detached from social practice, the symbol itself has no fixed meaning. ↩︎
Rawls, 1971. Rawls argued that behind the veil of ignorance, rational people would choose a social structure that guarantees basic freedoms and the principle of difference (difference is only justified when the situation of the most vulnerable is improved). ↩︎
Kelsen, 1934. Kelsen's "Pure Theory of Law" attempts to completely separate legal science from morality and politics, and only studies the logical structure of legal norms. ↩︎
Radbruch, 1946. After World War II, the German jurist Radbruch proposed that when positive law is extremely unjust, the effect of supra-statutory law should be recognized. But how is this standard itself determined? ↩︎
Although jury nullification exists in legal practice (such as during the abolitionist movement in the United States when juries refused to convict those who helped escape from slavery), courts usually do not inform jurors of their power to do so. The debate over its legality and legitimacy has continued to this day. ↩︎
Polanyi, 1966. Polanyi's concept of "tacit knowledge" states that we know more than we can tell. Some judgment abilities cannot be fully clarified or imparted. ↩︎
Here I won't quote any papers. Reward hacking has a large number of records in existing RL literature, ranging from OpenAI's CoastRunners game (where the agent learns to collect rewards without completing the track) to more recent examples. ↩︎
Simon, 1962. Simon argued that the stability of complex systems stems from a hierarchical structure, in which each layer changes at different rates. This principle has been verified in practice within the legal framework. ↩︎
Berlin, 1958. Berlin distinguished between negative freedom (freedom from interference) and positive freedom (self-determination), and argued that there exist genuine trade-offs between them. More broadly, his value pluralism holds that there are irreducible conflicts among human values. ↩︎