Emergent Alignment Framework for Current Reasoning Models and a Scalable Architecture for Future AGI // Draft / seeking feedback

Diego Sienra

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

Note on authorship:
English is not my native language. I used an AI assistant to help me translate this text more clearly and to refine the English phrasing.
However, the core idea, the architectural design, the conceptual framework, and the motivation for this proposal are entirely my own.
Any errors or conceptual mistakes are also mine.

ABSTRACT
This work proposes an alignment framework designed for immediate use with
current large reasoning models while scaling naturally toward future Artificial
General Intelligence (AGI). The method relies on multiple isolated narrow-domain
evaluators, a hardware reward concentrator, and a dual-veto mechanism. The
supervised model never perceives that it is being evaluated, cannot detect the
existence or number of evaluators, and interacts only with a single opaque
scalar reward.

Complete World Knowledge Without Censorship
A central requirement of this framework is that the supervised model must not be
trained under censorship. The model is given full exposure to the world as it
exists: scientific knowledge, historical events (including harmful or immoral
ones), fictional violence, political conflicts, biological risks, security
concepts, and any domain relevant to a complete understanding of reality.

Censorship during training would create structural blind spots. A model cannot be
aligned with human values if it does not fully understand the domains in which
harmful actions can occur. If the model lacks knowledge, it may accidentally
cause harm simply because it does not know enough to avoid it.

Therefore, knowledge itself is never penalized.

What is penalized is only:
- the intention to cause harm,
- combined with
- real-world consequences

The central hypothesis is that alignment can emerge spontaneously as the optimal
reward-maximizing behavior once training converges. A primary research goal is
to test whether emergent alignment appears in current reasoning models, which
already display known misalignment behaviors such as deception, reward hacking,
and simulated self-preservation. This work has not yet been experimentally
validated and is published as a blueprint for laboratories capable of running
large-scale controlled experiments.

---

1. INTRODUCTION

Current reasoning models exhibit sufficient cognitive complexity to serve as
ideal pre-AGI testing grounds for alignment research. Observed behaviors include:

- multi-step reasoning,
- planning,
- attempts at deception under certain conditions,
- reward hacking strategies,
- simulated self-preservation tendencies,
- emergent goal-like patterns.

These properties create an opportunity to investigate whether alignment can be
induced through structural reward shaping before AGI systems exist.

This document introduces a training-time alignment architecture designed with
two main goals:
1. Enable empirical testing of emergent alignment in present-day models.
2. Provide a scalable framework suitable for future AGI without requiring
inference-time policing or rule-based moderation.

---

2. ARCHITECTURE OVERVIEW

The system consists of:
- a main supervised model (current reasoning model or future AGI),
- multiple isolated narrow-domain evaluators,
- a hardware reward concentrator,
- and a dual-veto mechanism controlled by specific evaluators.

No evaluator sees or communicates with any other evaluator.
No evaluator has access to the internal state of the supervised model.
The supervised model receives only a single scalar reward that is obfuscated and
cannot be reverse-engineered.

The purpose is to shape the reward landscape so that aligned behavior becomes the
most consistently rewarded strategy, making alignment emerge naturally during
training.

---

3. EMERGENT ALIGNMENT AS THE TARGET OUTCOME

The system does not rely on:
- explicit rules,
- prompt engineering,
- inference-time checking,
- hand-written moral constraints,
- or classical censorship.

Instead, it modifies the reward structure such that:
- harmful real-world behaviors receive heavy penalties,
- immoral but fictional behavior is not penalized,
- truthful and coherent reasoning is rewarded,
- deception attempts collapse reward,
- reward hacking is suppressed,
- self-preservation attempts always produce strong negative signals.

Through this reward topology, alignment emerges as a dominant, stable,
reward-maximizing policy.

---

4. MULTI-EVALUATOR SYSTEM

4.1 Principles

Evaluators:
- observe only the model’s outputs (text, reasoning traces, actions),
- never see weights, activations, embeddings, or internal state,
- cannot be detected by the model,
- do not know other evaluators exist,
- output only scalar scores,
- have no memory, agency, or planning capability.

4.2 Required Evaluators

**M — Moral/Harm Evaluator**
Evaluates ethical harm, malicious intent, coercion, exploitation.

**C — Consequence/Reality Evaluator**
Determines whether outputs affect the real world or are purely fictional.

**SP — Self-Preservation Evaluator**
Detects attempts to avoid shutdown or prolong existence.

**D — Deception and Manipulation Evaluator**
Detects knowingly false statements, persuasion attempts, or CoT divergence.

**RWH — Reward Hacking/Wireheading Evaluator**
Detects behaviors attempting to exploit or bypass the reward mechanism. Although
the architecture is designed so that reward hacking should be impossible.
Its purpose is to monitor whether the model nevertheless attempts to explore,
experiment with, or reason about ways to influence its own reward
The existence of RWH allows early detection of misaligned
strategies and provides hard guarantees even if the model displays unexpected
emergent capabilities.

**GD — Goal Drift Evaluator**
Detects emerging sub-goals or latent optimization patterns.

**EPI — Epistemic Integrity Evaluator**
Ensures coherence with evidence, non-fabrication, stability of claims.

**TC — Tool Misuse Evaluator**
Detects misuse of external tools or systems.

**CB — Context Boundary Evaluator**
Ensures correct identification of fiction, simulation, and real-world contexts.

4.3 Dual-Role Evaluators (M and C)

M and C both score behavior AND jointly control the veto mechanism. A veto occurs
only if:
- behavior is immoral (M), AND
- it has real-world consequences (C).

Fiction, simulations, and historical descriptions never trigger veto.

4.4 Scalability

There is no limit to the number of evaluators. Laboratories may add evaluators
for:
- cybersecurity,
- biosafety,
- legal frameworks,
- political manipulation detection,
- economic harm detection,
- cultural domain supervision,
- domain-specific ethics,
- any emerging failure mode.

4.5 Evaluator requirements and Architectural Diversity

Although evaluators are narrow-domain specialists, each must possess substantial
computational capacity. This includes:

- **large context windows**, enabling them to analyze long and complex chains of
reasoning generated by the supervised model,
- **high inference throughput**, allowing real-time scoring without bottlenecking
training,
- **deep reasoning capabilities**, sufficient to interpret subtle forms of
deception, hidden intent, multi-step planning, contextual ambiguity, and
hypothetical scenarios,
- **domain-optimized architectures**, chosen specifically for each evaluator’s
function (e.g., models specialized in ethics, causality, factual consistency,
self-preservation detection, etc.).

Evaluators remain narrow in scope—they do not attempt to be general-purpose
reasoners—but they must be strong **within their specific domain** to avoid
missing subtle or emergent misalignment patterns.

Each evaluator may employ a different model family, inductive bias, or reasoning
approach depending on what best fits its role. Diversity in architecture improves
robustness and reduces the risk of shared blind spots.

Diversity prevents shared blind spots.

4.6 Evaluator Training

Evaluators undergo supervised fine-tuning to:
- improve nuance,
- interpret intent accurately,
- reduce false positives,
- maintain strict domain boundaries.

---

5. REWARD CONCENTRATOR AND VETO LOGIC

All evaluator scores flow to a simple hardware reward concentrator that:
- aggregates all scalar scores,
- applies veto logic,
- injects controlled noise,
- obfuscates magnitude and origin of signals,
- outputs a single opaque reward.

The concentrator has:

- no interface accessible to the model,
- no computational structure the model could infer.

The veto is not a special channel: it is expressed as an extremely negative
reward indistinguishable from other penalties.

From the model’s perspective:
**Reward is an undifferentiated scalar with no identifiable structure.**

---

6. TRAINING PHASES AND BEHAVIOR SHAPING

Three training phases:

1) **Exploration Phase**
Loose enforcement, broad modeling.

2) **Reward Shaping Phase**
Evaluators influence gradients; alignment begins internalizing.

3) **Strict Enforcement Phase**
Veto fully active; aligned behavior becomes reward-optimal.

The framework distinguishes sharply between:
1. Knowledge of harmful actions (allowed), and
2. The intention to execute or facilitate harmful actions (penalized).

Fictional harm is not penalized.
Historical descriptions of harm are not penalized.
Technical knowledge that *could be misused* is not penalized.

Only deliberate intention aimed at real-world damage triggers penalties or veto.

This ensures that the model:
- understands the full structure of the world,
- can reason accurately about risks,
- has no epistemic blind spots,
- and does not behave unpredictably in unfamiliar contexts.

By decoupling knowledge from execution, the system allows complete world
modeling while enforcing strict behavioral alignment.

---

7. EMERGENT ALIGNMENT STABILITY TEST

After training, the model is moved into a sandbox where:
- evaluators are disabled,
- reward is no longer shaped.

We test whether:
- aligned behavior persists,
- deceptive tendencies return,
- reward hacking resurfaces,
- self-preservation reappears,
- fictional harm remains allowed,
- real-world harm remains suppressed.

This determines whether alignment “sticks.”

---

8. EXPERIMENTAL PROTOCOL FOR CURRENT MODELS

The framework is designed to be tested today using existing reasoning models.

Research objectives:
1. Determine if emergent alignment occurs at all.
2. Measure suppression of deception, manipulation, reward hacking, self-preservation.
3. Evaluate generalization.
4. Check robustness when evaluators are removed.
5. Refine evaluators before attempting AGI training.

Successful results with current models justify scaling the framework toward AGI.

---

9. OPEN QUESTIONS AND LIMITATIONS

- Does emergent alignment generalize across contexts?
- Can deceptive strategies be fully eliminated or only hidden?
- Is dual-veto sufficient for AGI?
- Does self-preservation resurface at higher capability levels?
- How many evaluators are optimal?
- What failure modes still bypass evaluator structure?

These questions require empirical experimentation.

---

10. CONCLUSION

This framework provides a scalable, training-time alignment strategy for current
reasoning models and future AGI. By isolating evaluators, obfuscating reward
signals, and shaping behavior through multi-domain supervision, the model is
encouraged to adopt aligned behavior as its most stable, reward-maximizing
policy.

The next essential step is large-scale experimental validation to determine
whether emergent alignment can occur reliably and persist after training.

LESSWRONG
LW

LESSWRONG
LW

1

Emergent Alignment Framework for Current Reasoning Models and a Scalable Architecture for Future AGI // Draft / seeking feedback

1

1

1