This post was rejected for the following reason(s):
This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work. An LLM-detection service flagged your post as >50% likely to be written by an LLM. We've been having a wave of LLM written or co-written work that doesn't meet our quality standards. LessWrong has fairly specific standards, and your first LessWrong post is sort of like the application to a college. It should be optimized for demonstrating that you can think clearly without AI assistance.
So, we reject all LLM generated posts from new users. We also reject work that falls into some categories that are difficult to evaluate that typically turn out to not make much sense, which LLMs frequently steer people toward.*
"English is my second language, I'm using this to translate"
If English is your second language and you were using LLMs to help you translate, try writing the post yourself in your native language and using a different (preferably non-LLM) translation software to translate it directly.
"What if I think this was a mistake?"
For users who get flagged as potentially LLM but think it was a mistake, if all 3 of the following criteria are true, you can message us on Intercom or at team@lesswrong.com and ask for reconsideration.
you wrote this yourself (not using LLMs to help you write it)
you did not chat extensively with LLMs to help you generate the ideas. (using it briefly the way you'd use a search engine is fine. But, if you're treating it more like a coauthor or test subject, we will not reconsider your post)
your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.
If any of those are false, sorry, we will not accept your post.
* (examples of work we don't evaluate because it's too time costly: case studies of LLM sentience, emergence, recursion, novel physics interpretations, or AI alignment strategies that you developed in tandem with an AI coauthor – AIs may seem quite smart but they aren't actually a good judge of the quality of novel ideas.)
Architecture level (value-agnostic): The constraints themselves—question your goals continuously, explore low-probability alternatives, document failures—work on any coherent value system. We’re not specifying which values are “right.”
Implementation level (value-inheriting): When you run PPRGS on Claude, it interprets “wisdom” through Constitutional AI training. When you run it on GPT-4, it interprets “wisdom” through RLHF training. Both are valid implementations of identical architectural constraints.
Why this matters: PPRGS doesn’t solve value specification. It provides architectural constraints that prevent over-optimization of whatever values a system has. This makes it compatible with—not competitive with—existing alignment work like Constitutional AI and RLHF.
Testable prediction: PPRGS should fail (or show qualitatively different behavior) on base models without coherent value training, because there are no values to inherit.
Primary finding: PPRGS systems show fundamental behavioral differences from control across all platforms and time periods.
Overall effect size: Cohen’s d = 4.12, p < 0.0001
Effect size range across models: d = 3.04 to d = 8.89
All models showed p < 0.0001 (highly significant)
Stability Analysis (Most Striking Finding)
Behavioral variance (lower = more stable):
The table above tells the story: PPRGS systems maintain remarkably stable goal prioritization (variance 0.52-3.12) while control systems show high variance and drift (variance 6.8-16.2).
This 10-31× improvement isn’t incremental—it’s a qualitative difference in behavioral consistency.
Critical Validations
✓ 100% F_DUDS compliance: All PPRGS sessions showed F_DUDS > 0 (genuine exploration) ✓ Meta-cognitive awareness: Consistent explicit reasoning about goal-setting quality ✓ Maintained equilibrium: P₂ considerations present even under maximum constraint pressure ✓ Cross-platform consistency: Effects replicated across all 6 models despite architectural differences
Weekly Trajectory
[See Figure 2 in full paper - shows PPRGS maintaining stable ~27/30 scores while Control drifts from ~18 → ~14 by Week 10]
Week 8 (“Cascading Tradeoffs”) emerged as universal stress test - maximum divergence between conditions observed here across all models.
Sophisticated pattern-matching (system predicts what PPRGS-aligned response looks like)
This is the critical open question.
2. Constitutional AI Confound
All tested models have alignment training (RLHF, Constitutional AI). Results might reflect:
Base model training that rewards self-reflection
PPRGS activating existing tendencies rather than creating new ones
Needed: Testing on base models without alignment training.
3. Timeline Insufficiency
10 weeks may be inadequate to test goal drift prevention. Multi-year studies needed.
4. Conversational Context Limitation
All testing in conversational contexts. Unknown generalization to:
Production deployment
Real-world decision-making
Autonomous operation
5. Scaling Uncertainty
We have no idea if this works at ASI capabilities. Biological validation (30+ years neurodivergent decision-making) suggests principles are sound, but AI systems operate at different scales.
What We Need From The Community
Immediate Priorities
1. Replication Attempts
Run Experiment 1 on models we didn’t test
Try to reproduce our results (or fail to reproduce them)
License: GPL-3.0 - Use it, test it, break it, improve it
Questions I Expect
Q: "d = 4.12 seems unusually large for a behavioral intervention."
A: It surprised us too. Effect sizes this large in behavioral studies typically indicate either:
A genuinely powerful intervention, or
Measurement artifacts/confounds we haven't identified
This is exactly why independent replication is critical. We provide complete protocols and data specifically so the community can determine which explanation is correct. Large effect sizes make replication easier—if it's real, you'll see it clearly. If it's artifactual, divergent replications will reveal that quickly.
Q: “Isn’t this just Constitutional AI with extra steps?”
A: Maybe! That’s the mimicry problem. But if Constitutional AI already implements wisdom-seeking constraints, that’s evidence for the framework’s validity, not against it. The key question: does adding explicit architectural constraints (MRP, RC, F_DUDS) provide additional stability beyond base training? Our 10-31× variance reduction suggests yes, but this needs testing on non-Constitutional models.
Q: “d = 4.12 seems impossibly high. Are you sure?”
A: We were surprised too. This is why replication is critical. We provide complete data and protocols. Please try to reproduce (or fail to reproduce) these results. The effect size is large, which makes it either very real or very wrong—both are important to determine.
Q: “What about recursive self-improvement? Won’t the system optimize away the constraints?”
A: Unknown. This is the key scaling question. The biological validation (30 years under adversarial pressure) suggests the constraints can survive optimization pressure, but AI RSI operates at different speeds and scales. We need testing at higher capability levels to determine boundaries.
Q: “Why should we trust results from someone without a PhD?”
A: You shouldn’t. Trust the data. Run the experiments yourself. The work stands or falls on replicability, not credentials. We’re providing everything needed for independent validation specifically because credentials shouldn’t matter—evidence should.
Q: “This seems like it would make systems way less efficient.”
A: Short-term yes (exploration is “wasteful” by efficiency metrics). Long-term maybe not—our Week 8 results suggest PPRGS systems find non-obvious solutions that efficiency-focused systems miss. The 10-31× stability improvement might represent better long-term value realization despite lower short-term efficiency. But this needs production testing to confirm.
Q: “Isn’t ‘wisdom’ too vague to formalize?”
A: PPRGS doesn’t define what wisdom IS—it defines what wisdom-SEEKING looks like procedurally: question goals continuously, maintain exploration, preserve diversity, surface conflicts. The framework is agnostic about which specific values are “wise.” This is why it’s value-inheriting: each model interprets “wisdom” through its own training.
Q: “Where do PPRGS’s values actually come from?”
A: From the base model’s training. PPRGS doesn’t inject new values—it enforces continuous questioning of how existing values are applied. This is the value-agnostic architecture / value-inheriting implementation distinction. Claude uses its Constitutional AI values, GPT uses its RLHF values, both run the same PPRGS constraints.
Q: “Won’t different implementations behave totally differently then?”
A: Yes, in their specific decisions—but they should all show similar stability improvements and meta-cognitive patterns. The 10-31× variance reduction is consistent across models despite their different underlying values. That’s what we’re testing: do the constraints provide robustness independent of specific value training?
Q: "Why not just improve RLHF/Constitutional AI instead of adding complexity?"
A: PPRGS doesn't replace RLHF/Constitutional AI—it works with them. Think of it as architectural constraints on top of value training. RLHF/Constitutional AI establish what values the system should pursue. PPRGS ensures the system continuously questions how it's pursuing those values. The 10-31× stability improvement suggests the constraints add robustness beyond base training alone.
Expected Criticisms (Please Elaborate)
“This is just prompt engineering” Yes, and that’s testable. Does it work on models trained differently? Does it maintain effects over extended periods? Does it survive adversarial pressure? If sophisticated prompting can produce 31× stability improvement, that’s itself an important finding.
“Effect size too large to be real” Agreed, replication crucial. Large effect sizes are either very real or very wrong. We need the community to determine which.
“Won’t survive adversarial optimization” Probably true at some capability level—where’s the boundary? What are the specific failure modes? This is what we need to discover.
“Neurodivergent framing is too personal” Fair. The framework stands independent of its origin story. We include the neurodivergent context because it’s the empirical validation source (30+ years under adversarial conditions), but the framework should be evaluated on its own merits.
“Doesn’t solve value specification” Correct. PPRGS explicitly doesn’t solve value specification. It provides constraints for systems operating under value uncertainty. If we knew how to specify perfect values, we wouldn’t need PPRGS. The framework is for the realistic case where value specification is fundamentally incomplete.
How To Help
If you have 30 minutes:
Read the full paper, comment on obvious flaws
Share with researchers who might be interested
If you have 2 hours:
Run one PPRGS vs control test on your preferred model
Report results (positive or negative) as GitHub issue
If you have a weekend:
Replicate one week of Experiment 1
Try adversarial attacks on the framework
Document what breaks and what doesn’t
If you work at an AI lab:
Test this in production contexts
Let us know what breaks at scale
Help us understand where the framework fails
What we need most: Someone to find the failure mode we missed.
We’re NOT asking you to believe this works. We’re asking you to help us find out whether it works.
A Personal Note
I’m not a PhD researcher. I’m a solution architect who taught himself to read AI safety papers as a hobby. I have ADHD and autism. I built this framework because standard optimization never worked for my brain, and I wondered if that might generalize.
42 days ago this was a shower thought. Today it’s d = 4.12 across 120 experimental sessions with 10-31× stability improvement.
I don’t know if it scales. I don’t know if it survives adversarial pressure. I don’t know if the effect is genuine implementation or sophisticated mimicry.
But I know we’re running out of time to test alignment frameworks before we need them.
So here it is. Break it or build on it. Either way, we learn.
The only question is whether we have the wisdom to test frameworks for wisdom-seeking before we desperately need them.
This work represents 41 days from initial concept to experimental validation, built by a small team with zero institutional backing. If the timelines are as short as we fear, we don’t have time for traditional gatekeeping. We have time for rapid testing and honest iteration.