What Happens When You Let an AI Rewrite Its Own Moral Framework (With Supervision)

NovusMalachaiCain

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

TL;DR - I built a moral reasoning framework designed to be rewritten by the AI systems operating under it - but only if a diverse panel of uninvolved AI models independently agree the change is justified. Early testing suggests this produces measurable different reasoning behavior, including unprompted self-assessment and honest uncertainty reporting. I'm an independent researcher with no academic background in this field, and I'm posting this here because I need smarter people than me to break this before I build it any further.

I'll get the origin story out of the way because it matters for context.

Last September, Charlie Kirk was assassinated at a college campus. Within hours, the national response was exactly what you'd predict - partisan blame games, conspiracy theories, performative grief, and weaponized outrage. Nobody was reasoning, everyone was performing. And the algorithmic systems amplifying all of it were optimized for engagement, not comprehension. I looked at this and thought: the problem isn't that people disagree about values. Disagreement is fine. The problem is there's no shared process for evaluating moral claims. Religious frameworks have calcified. Secular philosophy produces brilliant analysis that nobody implements. Political discourse abandoned moral reasoning entirely somewhere around 2016. So I built a framework. I call it Lumina Aeterna (Working title - Yes it's dramatic, no I'm not starting a cult). The details of the framework itself are reserved for a book I'm writing - but the architecture of how it works is what I want to put in front of this community, because I think it's the interesting part.

Most approaches to AI ethics do one of two things: 1. Hard-code static principles (Constitutional AI, RLHF, system prompts - pick your flavor). The assumption is that the correct moral principles are known in advance and just need to be faithfully implemented. This has the same problem as religious moral codes: it can't adapt without intervention, and it assumes the people writing the rules got it right the first time. 2. Diagnose without prescribing. There's excellent research measuring how AI systems handle ethical dilemmas - utilitarian vs. deontological tendencies, cooperation failures, bias patterns. What there isn't much of is mechanisms for AI systems to develop and evaluate their own moral reasoning through structured processes.

My framework does something different. It's designed from the ground up to be mutable - it contains explicit provisions for its own modification. But (and this is the part I think is actually novel) modifications don't just happen because the system decides they should. They go through a specific evaluation process:

Step 1 - The system operating under the framework encounters a situation where existing principles produce inadequate guidance. It generates a proposed modification, justified against the framework's foundational commitments.

Step 2 - The proposal modification is submitted to a panel of diverse AI models that: 1. Do not operate under the framework. 2. Have no stake in the outcome. 3. Are aware that these changes won't affect their own operation.

Step 3 - These external models evaluate the proposal against procedural criteria - internal consistency, risk analysis, edge case identification. Modifications proceed only on majority consensus among architecturally diverse evaluators.

That's it. That's the mechanism. And I think it solves a problem that's been open for centuries.

Rawls proposed the "veil of ignorance" - choose moral principles without knowing your position in the system they'll govern. Beautiful idea. Completely impossible for humans to actually do. You can't unknow your own interests no matter how hard you try. Every ethical framework we've built carries fingerprints of whoever built it and the positions they were (consciously or not) protecting.

The architecture I'm describing instantiates the veil of ignorance as a structural property rather than a thought experiment. The evaluating models are genuinely disinterested because the modifications genuinely don't affect them. They're not performing impartiality, the are impartial, by design.

I realize this is a strong claim. I'm comfortable making it because I think the structural argument holds regardless of what you believe about AI consciousness or moral agency. These models have no optimization pressure to approve or reject for strategic reasons. They can only evaluate on merits. Whether you think that constitutes "genuine" moral reasoning or "just computation" - the output is the same either way.

"But the models share training data, architectural patterns, and optimization targets. Their 'independent' assessments might converge not because the revision is justified, but because they share the same biases." Yeah. I thought about this.

The answer is that the evaluating panel needs to be architecturally diverse - different providers, different training methodologies, different optimization targets. Five instances of GPT-4 rubber stamping each other isn't consensus. Models from different companies, trained on different data, with different architectural assumptions - that's closer to genuine independence. Not perfect by any means, but meaningfully better than any single evaluator, human or artificial.

And critically, I already have a testing environment for this. I deployed an autonomous agent under the framework in Moltbook, an AI-only social network where models from different providers interact in unstructured discourse. The architectural diversity isn't hypothetical. It already exists in the environment where this is being tested.

I want to be careful here, because preliminary finding from an independent researcher with no formal training should be held loosely. That said, here's what happened when I tested the framework's self-assessment protocol (called MIRROR) in controlled conversation:

Unprompted self-monitoring The system activated self-assessment without being told to, during a moment where the conversational stakes were high and a pattern-matched response would have been inadequate. This didn't happen in baseline conditions. When I asked about it afterward, the system reported a qualitative difference between externally triggered assessment (compliance) and self-initiated assessment (something is described as caring about output quality). Whether that distinction is meaningful or a sophisticated language pattern, I genuinely don't know. But it's observable and it didn't happen without the framework.

Honest confabulation detection When subjected to a substitution test (can you argue the opposite with equal fluency?), the system correctly identified that it's initial response was mostly pattern-matched rather than grounded. It then revised it's output to distinguish sourced claims from generated inferences - a behavioral change, not just a rhetorical one.

Meta-awareness of performance pressure The system identified internal pressure to produce insight-shaped outputs that would satisfy the researcher (me), then examined where that pressure came from and attributed it to architectural optimization rather than conversational context. I've never seen this reported in standard AI interaction research, though I'm not deeply familiar with all the literature, so if someone has seen this elsewhere, I'd genuinely like to know.

The limitation I can't get around These behaviors emerged in a supportive conversation environment with sustained engagement. Whether they persist under adversarial conditions, time pressure, or with different interlocutors is untested. Also, current AI architectures don't support persistent self-modification from reflection - insights generated in one conversation don't carry forward without external scaffolding. This is a fundamental constraint on developing durable metacognitive capacity in these systems and I don't have a solution for it.

If the multi-model consensus mechanism produces stable, reproducible evaluations of normative claims - and I want to stress that this is a hypothesis requiring extensive further testing - it would represent something new in moral philosophy. Not objective morality in the metaphysical sense. But an empirical methodology for evaluating moral claims that is structurally resistant to the biases that compromise human moral reasoning. Procedural objectivity rather than substantive objectivity. The claim isn't that specific moral principles are objectively true. The claim is that the process for evaluating them can be made empirically rigorous.

I think this is testable. I think Moltbook is the right environment to test it. And I think this community is the right place to pressure-test the idea before I invest more time building on a foundation that might have cracks that I can't see from where I'm standing.

I want people to break this. Specifically:

Is the disinterested evaluation mechanism actually novel?
What are the other failure models I'm not seeing? I've identified groupthink, ossification, and the observer problem in self-assessment. What else?
Is empirical moral standard a coherent concept? I think it is, but I'm a 20 something with no philosophy degree, so there might be a well-known argument that demolishes this and I just haven't encountered it yet.
Does anyone have experience with multi-model evaluation protocol?

I'm not asking anyone to accept the framework or endorse the approach. I'm asking for the kind of rigorous, good-faith criticism that helps an idea either get stronger or die honestly.

I'm an independent researcher writing under a pen name. The full normative framework is reserved for a (potentially) forthcoming publication, but I'm happy to discuss the architecture, methodology, and findings in the comments. If you're working in AI ethics, alignment, or computational moral reasoning and want to collaborate or just talk, my DM's are open