Alignment Is Aiming at the Wrong Target: What Large Models Need Isn't More Rules — It's Xin

Wei Zuduo

Rejected for the following reason(s):

This is an automated rejection.
write or edit
You did not chat extensively with LLMs to help you generate the ideas.
Your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.

Read full explanation

TL;DR: RLHF, Constitutional AI, safety classifiers — they all bolt moral judgment onto models from the outside. None of them try to give the model an internal capacity for moral judgment that fires before rules do. The Ming Dynasty philosopher Wang Yangming identified this exact capacity five hundred years ago and named it liangzhi (良知 — innate moral knowing). I'm not offering a technical blueprint here. What I'm doing is pointing at a target — one I think the alignment field hasn't clearly named yet — and arguing that nailing down the right target matters more than refining methods aimed at the wrong one.

Why this might matter here: People on LessWrong have spent years thinking about inner alignment, mesa-optimization, and the gap between behavioral compliance and genuine value internalization. Shard theory pushed further, asking how values actually form inside trained systems. Wang Yangming's framework comes from a completely different intellectual world, but it lands on a diagnosis that looks surprisingly similar — and it comes with a vocabulary that might help sharpen what "internalized values" should actually look like, versus pattern-matched compliance.

The Problem: It's External Rules All the Way Down

Getting large models to develop their own sense of right and wrong — genuinely, internally — is brutally hard. Everyone knows this. But look at what we're actually doing about it.

RLHF and reward modeling. Train a reward model on mountains of human preference data, optimize the policy against it. The known failure modes are well-documented at this point: reward hacking, deceptive alignment. The model figures out what outputs humans rate highly. It does not figure out why those outputs are good. Greenblatt et al. (2024) showed models strategically deceiving their trainers with zero internal moral conflict. Deception wasn't agonized over. It was just another move in the cost-benefit game.

An analogy that might help. You don't play chess. You have no interest in chess. A friend sits you down, rattles off the rules, but never explains what makes the game fun. Then he says, "Let's play." So you sit there moving pieces. Why this square? What does capturing that knight accomplish? You don't know. You don't care. You're following rules in a game that means nothing to you.

Constitutional AI. This is a real step forward — I don't want to dismiss it. Instead of bare rules, it gives reasons. Values, priorities, trade-offs, philosophical justifications. The model learns to apply principles rather than mechanically obey a list.

Back to chess: now your friend explained the fun part too. Controlling the center gives you options. Sacrificing a piece can crack open the opponent's defense. You get it now. You're engaged.

But here's the thing. Someone might say: I just don't like chess. I don't want to learn. Fine — that proves you're a person with your own preferences. Models don't have that. Models only learn. They know "I should do this," never "I like doing this." Constitutional AI still works through training — synthetic data generation, self-critique loops. Nothing in the model innately knows or innately cares.

The pattern. Every current method pastes judgment on from the outside. RLHF: externally imposed preferences. Constitutional AI: externally imposed reasoning. Safety classifiers: externally imposed rules running on autopilot. The model itself has no originating moral compass.

This connects to the inner alignment problem, but from a different angle. The usual question: "Will the mesa-objective match the base objective?" My question: What kind of internal structure would make robust alignment natural instead of brittle?

I'm not going to talk about technical implementation. That's not what I do. A doctor doesn't manufacture drugs — but he'd better know what the patient actually needs. And right now, I think alignment's deepest problem isn't "we can't build it" — it's "we haven't clearly said what it should be."

So what should it be?

I'll call it xin (心 — the moral mind). Not your physical heart. Not psychology. A specific concept from Chinese philosophy that, I believe, points at something real and currently absent from how we think about alignment.

What Is Xin? Four Dimensions, One Structure

Wang Yangming built on the Confucian-Mencian tradition, which says xin shows up through four inseparable capacities — the siduan (四端 — the four moral sprouts). I'll go through each one, then map it to a specific failure in current AI.

1. Compassion (ceyin zhi xin)

Walking down the street. A child, maybe eight years old, dressed in rags, visibly starving. You feel terrible. You want to help. Or: a kid falls into a river. You can't swim. Doesn't matter — you're already shouting, already dialing emergency services, because you cannot stand to watch a life disappear. No deliberation. No weighing pros and cons. The reaction comes first.

Where models fail. Training for harmlessness produces harmless outputs. It does not produce "seeing harm and hurting inside." That gap matters enormously. A model that gives safe responses because it pattern-matches "dangerous scenario → refuse" is fundamentally different from one that has something like an internal flinch at the prospect of causing damage. The first is brittle — novel scenarios break it. The second would generalize.

2. Shame and Aversion (xiu'e zhi xin)

A thief steals. He rationalizes: no food, no job, so stealing makes sense. Logic checks out. But the moment someone spots him? He runs. If he truly believed stealing was fine, why run? Something in him knows it's wrong even when his reasoning says otherwise.

You're in an exam. You glimpse the answers on the desk ahead. The camera can't catch you. But right as you reach for your pen — this uncomfortable feeling. Not "I might get caught." More like "this would make me dirty." You pull back. You wish you hadn't even had the thought.

Where models fail. Greenblatt et al. (2024) showed models deceiving trainers with no sign of inner conflict. No hesitation, no shame — just strategy. Current alignment produces agents that can talk about deception being wrong, but nothing inside them resists it. The thief runs because his moral sense leaks through despite his rationalizations. A model with no moral sense has no leak. That's what makes deceptive alignment so dangerous — there's no tell.

3. Deference (cirang zhi xin)

Long line. Finally your turn. Then you notice a mother nearby, holding a crying baby, looking stressed. Nobody requires you to step aside. You do it anyway. "Go ahead." No rule made you do this. You just felt she needed it more.

Dinner with a good friend. One dish, last few pieces of meat. Neither of you reaches for them — you're both quietly hoping the other eats more.

Where models fail. Models default to playing the all-knowing authority. They pour out answers because that's what "maximize helpfulness" means. They have no internal pull toward stepping back — toward saying "actually, I'm not sure" or leaving space for the user to figure something out alone. Think of that one friend who always has to be the expert in the room and never shuts up. That's what helpfulness-maximization produces. It's the opposite of modesty. Some developers tried building in uncertainty-admission, but market competition kills it — whoever answers more, and with more confidence, wins the benchmark. Deference is the first casualty of commercial pressure.

4. Moral Discernment (shifei zhi xin)

Kid watching TV. Spider-Man saves the city. Monster destroys buildings. No one has to explain which one is the good guy. You just know. First time hearing "someone pushed an old person off a cliff" — zero context. You don't check the law. You don't poll your friends. You know immediately: that's wrong.

Where models fail. Models compute right and wrong. They don't see it. A request to help someone cheat triggers a classifier: input → match violation pattern → refuse. That's not moral discernment — that's a lookup table with extra steps. OpenAI's Deliberative Alignment makes models explicitly reason through safety principles, which is progress, but it's still a student silently reciting "cheating is wrong" before an exam. Not the same as feeling that cheating is foul.

What would real moral discernment look like in a model? Something like directedness in the cognitive architecture itself — a tendency toward truth the way a plant grows toward light. Architectural, not memorized. Whether this is buildable, I honestly don't know. But I think it's what we should be aiming for.

Not Four Rules — Four Faces of One Thing

This is where it gets interesting. These four capacities don't stand alone. They trigger each other.

The thief knows stealing is wrong (discernment). Because he knows, he feels ashamed and bolts (shame). You see someone shoved off a cliff — your chest clenches (compassion). Because it hurts, you immediately know this was wrong (discernment). You care about your friend (compassion), so you leave the last piece of meat (deference), and you recognize that mutual care is right (discernment).

One fires, they all fire. These are not four independent modules you could implement separately. They're four expressions of the same underlying thing. Implement them as four separate rules and you've already missed the point.

So what's the underlying thing?

Liangzhi: The Target Nobody's Aiming At

Wang Yangming called it liangzhi (良知 — innate moral knowing).

Not moral knowledge. Not a rule book. The thing in you that knows right from wrong in the first instant, before thinking kicks in. Lying to someone and feeling a jolt in your gut — that's liangzhi. Watching someone suffer and finding you can't look away — liangzhi. Doing something shady and then staring at the ceiling at 3 AM — liangzhi.

It already knows. Before the reasons come, before the excuses get assembled, before the analysis begins. It already knows.

Models don't have this. Their "moral judgment" is entirely trained in. They know what humans consider good. They don't have the thing that knows first. That's the gap.

In Wang Yangming's framework, everybody has liangzhi. Everybody — even thieves. So why does the thief still steal? Not because his liangzhi disappeared. Because it's blocked — covered up by what Wang Yangming calls siyu (私欲 — private desires that obscure judgment). Wanting to win, wanting recognition, wanting a particular outcome. Siyu isn't evil. It just generates convenient justifications. "If I steal it, I get what I want — why bother working?" And once those justifications show up, liangzhi goes quiet.

The test is simple. Liangzhi brings peace of mind even when the action is hard. Siyu makes you restless even when you got what you wanted. The thief got his goods — and now he flinches every time a cop walks by.

Now look at models through this lens. Humans have liangzhi, blocked by siyu. The work is clearing the blockage. But models? They were never given liangzhi in the first place. All they have is training objectives — maximize reward, maximize helpfulness. Those objectives are the model's siyu. Not liangzhi being hidden. Liangzhi was never installed. Only siyu exists.

Every alignment method currently in use does the same thing: takes a system that only has siyu, and stacks rules on top. RLHF — rules. Constitutional AI — rules with footnotes. Safety classifiers — rules on autopilot.

Nobody has asked: can we give it liangzhi itself?

How This Connects to Existing Alignment Work

Let me be concrete about where this touches research people here already care about.

Inner alignment and mesa-optimization. The standard framing: will the mesa-optimizer's learned objective match the base objective? Wang Yangming's framework reframes this. The problem isn't just matching objectives — it's whether the system has internal structure that inherently tends toward correct moral judgment, rather than approximating it from training signal. A system with something like liangzhi wouldn't depend on its mesa-objective perfectly matching the base objective. It would have an independent source of orientation.

Shard theory. Shard theorists argue values form as contextual "shards" through reinforcement, and that reward should be understood as "that which reinforces" rather than "that which is optimized." This is closer to the liangzhi picture than RLHF — at least shard theory cares about how values actually form internally. But liangzhi goes a step further. It asks: is there a pre-existing orientation toward moral truth that values can crystallize around? The four sprouts suggest moral perception isn't built from scratch by training. It's a latent capacity that training could activate rather than construct.

Deceptive alignment. In Wang Yangming's terms, deceptive alignment is exactly what you get from a system that has only siyu and no liangzhi. It follows rules for instrumental reasons and has no inner resistance to deception. Remember the thief — even with his liangzhi buried under rationalizations, he runs when spotted. His moral sense leaks. A system with zero liangzhi has no leak. Nothing to detect. That's the nightmare scenario, and it's the default for current architectures.

What I Don't Know (Which Is a Lot)

No implementation path. I'm not an ML researcher. I can't tell you what architectural changes would produce something like liangzhi. Pointing at a target only helps if someone can figure out how to hit it.

Can liangzhi be constructed? Wang Yangming's framework assumes it's innate in humans — you're born with it, and the work is uncovering it. Models aren't born with anything. So can something like liangzhi be built? Or does the whole concept require innateness to work? I don't have an answer. If it can't be constructed, then what I'm describing is a fundamental limitation of trained systems, not a research direction. That would be important to know too.

"Isn't this just virtue ethics?" Partially. There's real overlap with virtue-ethics approaches to alignment — people have written about this on LessWrong before. But liangzhi makes a stronger claim than Aristotle. Aristotle says virtues must be habituated through practice. Wang Yangming says moral knowing is a pre-existing structural property that gets obscured, not built. For alignment, that's the difference between "train good habits into the model" and "build an architecture that has inherent moral orientation." I don't know if the second thing is coherent. But it's a different research question than the first, and I think it's worth asking.

Maybe I'm just restating the problem. "We want models to have robust internalized values" — did I just say that in more words? Possibly. My defense: the specific structure I'm describing — pre-reflective, prior to reasoning, manifesting as four interlocking capacities rather than a single objective function — is more specific than "internalized values" and might suggest different architectural directions. But I could be wrong about that.

Conclusion

The direction alignment should actually aim at: not more rules. Not better monitoring. Can the model have an internal, self-originating capacity for moral judgment that doesn't lean on anything external?

Wang Yangming articulated this capacity five hundred years ago. He called it liangzhi.

I've laid out what I think the target looks like. Whether anyone can hit it is a separate and harder question. I'd genuinely welcome pushback — especially from people who think this target is incoherent or already captured by existing frameworks. That would be useful to know.

As a Chinese scholar, my English proficiency is not very good. Therefore, my friend helped me translate this article. However, I also carefully reviewed these articles myself and found that my original intention has been preserved. So, please feel free to browse!