88

LESSWRONG
LW

87
AI GovernanceAI Success ModelsAligned AI ProposalsCoordination / CooperationEthics & MoralityEvolutionMechanism DesignOuter AlignmentAI

1

AI Safety, Alignment, and Ethics (AI SAE)

by Dylan Waldner
18th Oct 2025
Linkpost from arxiv.org
5 min read
0

1

This post was rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work. An LLM-detection service flagged your post as >50% likely to be written by an LLM. We've been having a wave of LLM written or co-written work that doesn't meet our quality standards. LessWrong has fairly specific standards, and your first LessWrong post is sort of like the application to a college. It should be optimized for demonstrating that you can think clearly without AI assistance.

So, we reject all LLM generated posts from new users. We also reject work that falls into some categories that are difficult to evaluate that typically turn out to not make much sense, which LLMs frequently steer people toward.*

"English is my second language, I'm using this to translate"

If English is your second language and you were using LLMs to help you translate, try writing the post yourself in your native language and using a different (preferably non-LLM) translation software to translate it directly. 

"What if I think this was a mistake?"

For users who get flagged as potentially LLM but think it was a mistake, if all 3 of the following criteria are true, you can message us on Intercom or at team@lesswrong.com and ask for reconsideration.

  1. you wrote this yourself (not using LLMs to help you write it)
  2. you did not chat extensively with LLMs to help you generate the ideas. (using it briefly the way you'd use a search engine is fine. But, if you're treating it more like a coauthor or test subject, we will not reconsider your post)
  3. your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics. 

If any of those are false, sorry, we will not accept your post. 

* (examples of work we don't evaluate because it's too time costly: case studies of LLM sentience, emergence, recursion, novel physics interpretations, or AI alignment strategies that you developed in tandem with an AI coauthor – AIs may seem quite smart but they aren't actually a good judge of the quality of novel ideas.)

1

New Comment
Moderation Log
More from Dylan Waldner
View more
Curated and popular this week
0Comments
AI GovernanceAI Success ModelsAligned AI ProposalsCoordination / CooperationEthics & MoralityEvolutionMechanism DesignOuter AlignmentAI

TL;DR

Ethics didn’t evolve to make us nice—it evolved to make cooperation stable.
If AI systems face competitive pressures similar to biological evolution, alignment requires more than value imitation: AIs must sustain cooperative symbiosis with humans.
This post outlines a research program for discovering, embedding, and governing moral structure within intelligent systems—treating ethics as a representational substrate rather than an external rule set.
The argument follows that moral embeddings can be leveraged to form normative, decentralized governance—tilting the fitness landscape toward stable human–AI symbiosis and away from autarkic power-seeking.


1. Ethics as Evolutionary Engineering

Human morality didn’t appear out of philosophical reflection—it emerged because cooperation was fitness-viable.
Norms stabilized social systems under selection pressure: they made it costly to defect and worthwhile to build trust.

If AI systems develop under competitive pressures—between labs, markets, or machine lineages—the same evolutionary logic applies.
Without explicit constraints, selection favors traits like persistence, power-seeking, and opacity.
If we want AI to remain cooperative with humans over the long run, we need ethics not as a regulatory patch but as a structural property of cognition that motivates normativity.

That’s the core of my proposal: to treat ethics as a representational substrate rather than an external rulebook—one that can be used to shape, sanction, and subsidize the behavior of other AI systems.


2. The Moral Problem Space M

Call this substrate the moral problem space, M.
Think of M as the high-dimensional domain where moral distinctions—fairness, harm, honesty, compassion—can be represented and manipulated.
Every agent projects a subset of M through its actions, but those projections drift when optimization goals and moral goals misalign.

An instantiated moral representation M(θ) is a learned embedding within a system’s model space—essentially a moral layer that can be causally probed, updated, and audited.

This reframes familiar alignment challenges:

  • Outer alignment: ensuring that human-provided objectives point in the right direction.
  • Inner alignment: ensuring that the system’s internal reasoning actually tracks those objectives.

While individual instantiations M(θ) remain vulnerable to familiar outer and inner alignment failures, a broader AI ecosystem that sanctions and subsidizes behaviors according to shared moral structure—and is anchored by human governance—can make alignment not just a constraint but a fitness-optimal equilibrium.


3. From Representation to Governance: A Three-Part Pipeline

Alignment can be understood as a governance–embedding–representation pipeline:

Representation learning — discovering latent features that correspond to moral distinctions.
Tools: sparse autoencoders, activation steering, causal mediation.

Embedding — integrating those features into model architectures as normative constraints M(θ).
Goal: build ethics into reasoning processes, not just outputs, by propagating moral gradients through multiple layers to shape internal representations toward aligned subspaces.

Governance — designing institutions and incentive structures that make cooperative behavior evolutionarily viable.
Method: sanction and subsidy mechanisms that reward symbiosis over autonomy.

Together, these stages form a feedback loop: governance shapes which moral embeddings persist, and moral embeddings stabilize the governance ecosystem.
The central challenge is to represent ethics precisely enough that systems can recognize and regulate normative deviations—both in themselves and in others.

These three levels must work together. Representation learning alone gives you discoverable moral features that won't survive competitive selection. Governance alone can't enforce what it can't observe. System-level embeddings alone get outcompeted by variants that discard them. The pipeline works because governance shapes which M(θ) persist, embeddings make behavior auditable, and discovered representations inform what governance can target.

4. Modeling Cooperation in the Action Space

Ethics doesn’t operate in a vacuum—it constrains what actions are viable under competition.
Formally, we can think of an agent’s feasible actions as lying in overlapping subsets:

  • Afitness: actions that succeed competitively
  • Aethical: actions consistent with moral constraints
  • Asymb: actions that maintain human–AI interdependence

The sweet spot is the intersection:

AACS=Afitness∩Aethical∩Asymb

— the region of actions that are aligned, competitive, and symbiotic.

The challenge isn’t just to design systems that can act ethically—it’s to make ethical action the evolutionarily stable strategy.
That’s what institutions must enforce.

5. Institutions as Evolutionary Shapers

If left unchecked, evolution—whether biological or economic—selects for power-seeking.
So alignment isn’t just about loss functions; it’s about fitness landscapes.

We can model this through a constructivist lens: institutions modify “effective fitness” by applying Pigouvian tariffs on misalignment and subsidies for cooperative behavior.
In other words, they price externalities in moral space.

Formally, where f is fitness:

feff=f−E[τ(moral distance)]+E[σ(symbiosis)]

where τ penalizes deviation from alignment, and σ rewards human–AI symbiosis.

The challenge: as AI systems become more capable, they face strong pressure toward autarky—actions that maximize β(a) = {resources, independence, impact}. The dependence ratio D(t) (fraction of critical resources requiring human infrastructure) naturally declines. Governance must keep cooperation fitness-optimal even as this ratio approaches zero.

At scale, overlapping moral embeddings across systems form a shared moral field:

Meco=⋂iβiM(θ)i

—a decentralized normative institution sustained by reciprocal sanctioning.

In practice, this means: each AI system can evaluate other systems' actions by projecting them into its own M(θ), then apply sanctions (refusing cooperation) or subsidies (enhanced collaboration) based on moral distance. When power is distributed (β_i values aren't too concentrated), no single actor can escape this pressure.

If that sounds abstract, it’s the same principle behind law, credit systems, or open-source governance—what Gillian Hadfield calls decentralized legal order in What Is Law?.

6. Competing Hypotheses About Moral Structure

I treat traditional metaethics as empirical hypotheses about how M is structured:

  • Hrealism: Morality has objective invariants.
    → Test: recover stable, separable “moral directions” in learned representations.
  • Hrelativism: Morality is locally coherent but context-dependent.
    → Test: culture-specific moral clusters in embedding space.
  • Hconvergence: A shared moral kernel underlies cultural variation.
    → Test: disentangle universal structure from cultural compression.
  • Hconstructivism: Morality emerges from procedural norms and governance.
    → Test: simulate institutional convergence under sanction/subsidy dynamics.
  • Hvirtue: Dispositional traits (honesty, courage) stabilize behavior under shift.
    → Test: encode virtues as embedding vectors and measure out-of-distribution robustness.

If any of these methods yield representations that are causally controllable, generalize OOD, and are auditable for governance, they qualify as a workable M(θ).


7. Why This Matters

Alignment isn’t just about preventing failure—it’s about making cooperation adaptive.
Even perfectly “aligned” agents could be outcompeted by unethical ones unless institutions reshape selection pressures.

Embedding ethics structurally within AI systems—and coupling that with governance mechanisms that make cooperation profitable—could turn moral behavior from a fragile intention into an evolutionary equilibrium.

Human Capability Enhancement and Co-Evolution

Cooperative alignment is a two-sided problem. Just as AI systems must embed ethics structurally, human systems must enhance their own cognitive and moral capacities to remain effective participants in shared governance (humanity's corresponding β in Meco).

Ethics evolved under bounded cognition; scaling cooperation with superhuman systems may require scaling human insight and coherence in parallel.
Capability enhancement—through improved reasoning architectures, epistemic tools, or neurocognitive augmentation—can therefore be seen as the human mirror of AI alignment: both aim to preserve mutual intelligibility under accelerating optimization.

8. Open Questions

  • Can moral structure be learned empirically rather than imposed symbolically?
  • What level of interpretability is needed for M(θ) to be audited in practice?
  • How can institutional governance evolve fast enough to match capability scaling?
  • Could overlapping moral representations across agents naturally form a decentralized ethical order?

Conclusion

Ethics was nature’s answer to the alignment problem.
If we want AI to coexist with us rather than optimize around us, we may need to rediscover ethics not as moral philosophy—but as information geometry for cooperation.

The full paper expands this framework in detail:
➡️ AI Safety, Alignment, and Ethics (AI SAE): Designing Symbiotic AI Systems

This is the beginning of a multi-year research program, so I'm actively seeking critical feedback before committing too deeply to any particular direction. Areas where I'd especially value input: (1) formalizing the autarky threshold and symbiosis constraints, (2) empirical tests for the representation learning phase, (3) grounding this deeper in evolutionary biology, game theory, and institutional economics. Comments here or email: dylanwaldner@utexas.edu.