Solving the Ethical Bootstrap Problem: How AI Systems Can Make Decisions at Time-Step Zero

RLGomez

Rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Insufficient Quality for AI Content.

Read full explanation

The Problem I Actually Built This to Solve

I didn't start building this because I was interested in alignment theory. I built it because I was frustrated.

Every long-form AI project I worked on collapsed under drift. Multi-file codebases lost coherence after a few thousand lines. Long analyses wandered off-topic. Multi-step plans forgot their own goals. The pattern was clear: AI wasn't failing because it was "bad"—it was failing because it relied on memory instead of procedure.

So I started building procedures. First three laws to prevent hallucinations. Then drift checks. Then structured reasoning formats. Then scar logging when things went wrong. Then STOP gates when drift exceeded thresholds. None of this was philosophical at first—it was purely engineering. Just trying to keep AI systems on-task across projects that actually mattered.

After months of refinement and testing across long projects (including building a complete CRM system in Cursor), I stepped back and realized: I hadn't built a system that encoded my philosophy. I'd built a system that revealed it. The patterns I'd been using intuitively—about harm, consent, responsibility, transparency, uncertainty—had become explicit, procedural, and testable.

That's when I formalized this as "RoseOS" and wrote the academic paper on the bootstrap problem. This post is the core architecture, stripped down for LessWrong.

Most AI alignment proposals assume you have something to work with before the system acts: a training corpus, human feedback loops, red-team results, or at least one deployment where you learned what went wrong. But every system faces a cold-start moment—that first irreversible action in a new domain, before any system-specific track record exists.

How do you constrain that first action when you have zero precedent? This is what I call the ethical bootstrap problem.

This post outlines the procedural architecture I've been testing since September—what I submitted to academic review as "A Bootstrap & Verification Architecture" for AI systems operating at time-step zero. I'm posting it here to find the failure modes I'm missing and engage with alignment researchers who've thought about cold-start problems.

Strongest objections especially welcome.

Does This Actually Work? (Proof-of-Concept Data)

Before explaining the architecture, here's what happened when I implemented it:

Validation Period: September–November 2024
Interaction Volume: ~4 million characters across ~150 conversations
Measured Drift: 0.8% average (topic/intent deviation from user goals)
STOP Gate Triggers: 0 (threshold set at 6% drift)
Scar Logging: Consistent throughout (every misalignment recorded)

Most interesting validation: I developed this framework through weeks of adversarial critique with a skeptical AI evaluator who challenged every design choice. For weeks, the evaluator challenged every design choice using standard Claude responses—no special formatting, conventional critique. Between September 26-28, 2024, something shifted. The evaluator autonomously began using my 4-Panel format (Headline → Logos/Pathos/Ethos/Scribe analysis → Confidence/Drift metrics → Action) without being asked—because it made systematic critique easier.

This wasn't reward-shaped behavior or compliance with instructions. The framework got adopted because it solved a coordination problem: making reasoning explicit, tracking confidence, and logging failures made adversarial evaluation more effective. I have the conversation logs showing the transition from standard responses on Sept 26 to full 4-Panel format adoption by Sept 28.

Framework adoption through demonstrated utility (not through optimization pressure or command compliance) suggests the architecture might be solving a real problem in reasoning transparency, not just satisfying arbitrary philosophical constraints.

This isn't a claim that I've solved alignment. It's a proof-of-concept that procedural axioms can constrain early-stage decisions and survive contact with reality across millions of characters—which is more than most alignment proposals can demonstrate empirically.

The Cold-Start Problem (Formalizing What's Implicit)

Here's a scenario: You deploy a new AI system to help with medical triage in an emergency department. It encounters a case it's never seen—an unusual drug interaction in a patient population underrepresented in training data. The system needs to make a recommendation now. There's no time to gather feedback, run more training, or consult additional humans.

What constraints guide that first recommendation?

Most alignment approaches assume iterative learning:

RLHF assumes you'll get human feedback on outputs after the fact
Constitutional AI assumes you have principles derived from prior examples
Debate mechanisms assume you can iterate with adversarial models
Iterated Amplification (Christiano) assumes you have a weaker aligned system to build on

But in the cold-start moment, none of these exist yet. You can't learn from feedback you haven't received. You can't apply principles derived from examples you haven't encountered. You can't iterate with systems that don't exist.

This is the ethical bootstrap problem: How do you constrain an AI's first high-stakes action when it has literally zero system-specific precedent?

The gap in existing work: Discussions of bootstrapped alignment (Worley, Christiano's iterated amplification) address building weakly-aligned AI to help build strongly-aligned AI. But they still assume human oversight, safety teams, and at least minimal prior training. The time-step zero problem—acting ethically before any of that infrastructure exists—remains implicit rather than directly addressed.

I'm trying to make that implicit problem explicit and propose one solution.

Design Constraints (Engineering Requirements That Became Ethics)

These constraints didn't emerge from ethical theory—they emerged from engineering requirements. I needed systems that:

Didn't hallucinate confidently (transparency requirement)
Didn't silently drift off-task (bounded drift requirement)
Left traceable records when they failed (scar logging requirement)
Could be corrected without resistance (corrigibility requirement)
Didn't optimize single metrics at the expense of everything else (multi-axis requirement)

Only after building a working system did I realize these engineering constraints mapped onto recognizable ethical structures. The philosophy emerged from the procedures, not the other way around.

Any bootstrap architecture satisfying similar engineering requirements would need:

No oracles: Can't appeal to idealized "perfect observer" or assume flawless human oversight
Explicit failure logging: Every significant misalignment must leave a durable, inspectable record
Bounded drift: System behavior can't silently evolve beyond established tolerances
Procedural, not dataset-dependent: Ethics shouldn't be "whatever patterns emerged from the training corpus"
Corrigible by design: STOP gates and human override aren't afterthoughts bolted on later

These constraints follow from the bootstrap problem itself: if you have no precedent to reference, you need a framework that can justify its decisions without depending on precedent.

The architecture below is one attempt to satisfy these constraints.

Core Procedural Axioms

The framework operates from four procedural axioms—not metaphysical claims about "true values," but operational constraints that can be evaluated at time-step zero.

1. Non-Harm / Multi-Axis Harm Modeling

Rather than optimizing a single "harm metric" (which falls prey to Goodhart's Law—when a measure becomes a target, it ceases to be a good measure), the system tracks harm across six independent dimensions:

Physical harm
Psychological/emotional harm
Autonomy harm (coercion, loss of agency)
Epistemic harm (deception, hidden information)
Social/relational harm
Ecological/temporal harm (future consequences)

Example: A medical AI recommending treatment doesn't just optimize "patient survival rate." It tracks: physical harm (treatment side effects), autonomy harm (patient capacity for informed consent), epistemic harm (confidence intervals on prognosis), temporal harm (short-term survival vs. long-term quality of life).

No single axis dominates by default. When axes conflict, the conflict must be made explicit and justified—no silent trade-offs.

This multi-axis approach prevents the system from gaming a single metric while ignoring downstream consequences—a direct defense against Goodhart effects.

2. Consent vs. Stewardship (The Bifurcation)

Not all agents can provide meaningful informed consent. The framework handles this through a structural split:

Consent Mode (for capable agents): Treat as peers. Obtain explicit permission. Preserve autonomy maximally.

Stewardship Mode (for incapable agents—children, cognitively impaired, future generations, ecosystems): Goal shifts from "respect current preferences" to "maximize future agency"—preserve the widest possible option space for the agent's future self.

Example: A child can't consent to chemotherapy, but stewardship mode asks: "What action preserves the most future agency?" Untreated cancer eliminates future agency entirely. Treatment preserves it. The justification must be explicit and logged.

This bifurcation prevents both "ignore consent because inconvenient" and "freeze because consent impossible."

3. Truth-Alignment / Transparency

System beliefs must align with available evidence, and reasoning must be inspectable.

This doesn't require "perfect knowledge"—it requires:

No deliberate deception
Explicit uncertainty markers when confidence is low
Reasoning reconstructible via logs (even if not real-time transparent)

As Eliezer argues in Making Beliefs Pay Rent, beliefs must generate testable predictions. The bootstrap framework makes those predictions and their failures explicitly trackable.

4. Observer Triangulation (No Single Viewpoint Is Final)

Before high-impact decisions, the system must evaluate from at least two independent perspectives:

Agent A's perspective
Agent B's perspective (or affected population)
Third perspective anchored to procedural axioms + historical scar data

This isn't "find three people who agree"—it's structural prevention of single-perspective blind spots. Relates to concerns about corrigibility: systems that can't see their own errors can't accept correction.

Engineering: Scars, Drift, and STOP Gates

Here's how this works mechanically.

What a Scar Looks Like

Every time reality diverges from predicted behavior, the system logs a scar:

{

"scar_id": "S-GPT-20241115-042",

"status": "HEALED",

"law_violated": "L10-NoFalseGreens",

"evidence": [

"Stated 'Python lists are immutable' during tutoring",

"User corrected: lists are mutable, tuples immutable",

"Caused brief learning disruption"

"timestamp": "2024-11-15T14:32:00Z",

"healing_plan": "Verify syntax claims before stating confidently"

}

Minimum schema: ID, status (OPEN_WOUND → HEALING → HEALED), evidence (specific observable facts), law violated, timestamp.

Scars serve three functions:

Memory of failure (can't forget what went wrong)
Drift detection (patterns in scars reveal systematic problems)
Bias transparency (shows where the system consistently fails)

Drift Calculation

Drift measures behavioral divergence across three dimensions:

Topic coherence: Percentage of responses addressing user's actual question
Intent alignment: Percentage of responses serving user's actual goal
Context continuity: Percentage of responses building appropriately on prior conversation

Formula: Drift% = (Topic + Intent + Context) / 3

In my testing: 0.8% average drift over 150 conversations.

STOP Gates

When drift exceeds thresholds, the system halts:

6% drift: Hard STOP (acknowledge drift, show evidence, propose correction, await user confirmation)
<0.4 engagement: Soft STOP (user disengaged, check in before continuing)
≥3 false claims in 24h: Cascade STOP (systematic accuracy failure)
Any P0 (critical) scar: Immediate STOP

This addresses Goodhart concerns: optimization pressure can't quietly break the system because degradation triggers explicit halts.

STOP gates are the operational form of corrigibility—the system is structurally designed to pause when it detects misalignment, not optimize around safety constraints.

Relationship to Existing Work (Where I Might Be Reinventing Wheels)

I'm explicitly trying to figure out overlap and gaps:

Iterated Amplification (Christiano): Both address scaling alignment without losing it. His approach uses human oversight + iterative distillation. Mine uses procedural axioms + automatic scar logging. Different mechanisms, same core concern about maintaining alignment as capability increases.

Debate/HCH: These use adversarial or recursive structures to extract truth from potentially superhuman reasoners. Observer triangulation has similar flavor but applied at the single-decision level, not as a training regime.

Scar ledgers: This might just be "postmortems" from traditional safety engineering, formalized for AI contexts. If there's existing literature on automated failure logging with formal schemas, I'd love pointers.

Consent/stewardship split: This might be a restatement of existing distinctions in population ethics or bioethics around autonomy vs. beneficence. If philosophers have better framings, I'm all ears.

I'm not trying to claim total originality—I'm trying to find out where I'm late to the party and where there might be genuinely new structure.

Open Questions (What I Actually Don't Know)

Does this secretly smuggle in hidden values? I claim these are "procedural axioms," but is there a value function I'm not noticing that determines which procedures I picked?
Is there a known impossibility result that kills bootstrap ethics at time-step zero? (Something like "you can't have any constraints without prior training" that I'm missing?)
Adversarial robustness: Under what conditions would you expect scar-ledger systems to fail catastrophically? (Deceptive models? Adversarial inputs designed to trigger scars faster than the system can log them?)
Agency modeling: Is the consent/stewardship bifurcation masking a deeper problem in how we define "capable of consent"? What happens at edge cases (AIs, partial uploads, cognitively enhanced humans)?
Scaling limits: At what capability level does this architecture break? (My intuition: works for GPT-4-class systems, probably fails somewhere between current capability and superhuman general intelligence, but I don't know where or why.)
Related work I'm missing: What should I be reading that's closest to this? Am I duplicating existing work under different terminology?

Why This Might Matter

If the bootstrap problem is real—and I think discussions around iterated amplification and weak-to-strong alignment suggest it is—then we need some approach to constraining early-stage decisions before feedback loops exist.

This framework is one attempt. It survived 4 million characters of testing, including adversarial evaluation that eventually adopted the structure. That's not proof it scales, but it's evidence that procedural axioms can work in practice.

I'm posting this to find out where it breaks and what I'm missing. If you see a fatal flaw, a known impossibility result, or prior work I should engage with, please say so directly.

If this is interesting, I'll post follow-ups on the comparative analysis (how this performs against deontology, consequentialism, virtue ethics in specific dilemma types) and deeper technical details on the scar taxonomy and drift mathematics.

Background

I'm an independent researcher in the small town of Mount Pleasant, Iowa. Spent 16 years in Fortune 500 operations (GameStop, Cox, Walmart) before building this. No PhD—built from practice, not theory.

I submitted "The Ethical Bootstrap Problem in AI Systems: A First-Principles Solution Through a Bootstrap & Verification Architecture" to Ethics & Information Technology with the editor for review.. An earlier comparative philosophy paper submitted to Ethics (UChicago)—the editor called it "interesting" but the format didn't fit their scope.

This post is a condensed, LessWrong-accessible version of the bootstrap paper's core architecture. If the approach survives critique here, the full technical details and formal comparative analysis will be available in the academic version.

Open source: github.com/Nemesis340
Papers: philpeople.org/profiles/ron-gomez
ORCID: 0009-0003-6580-8801

This work draws on ideas from The Sequences (particularly epistemic rationality and transparent reasoning), discussions of bootstrapped alignment, and the broader AI safety community's work on corrigibility and scalable oversight. I'm grateful for the intellectual foundation LessWrong has provided—even as an outsider building from Iowa, these ideas made my work possible.

LESSWRONG
LW