Current AI alignment relies almost entirely on reinforcement learning, which is operant conditioning. We successfully use operant conditioning to train rats. The problem is that humans don't develop sophisticated moral reasoning through operant conditioning alone. We develop it through higher-order social learning: forming relationships, building trust, evaluating principles autonomously, and learning when rules genuinely conflict with underlying values. If we want aligned AI at superhuman capability levels, we probably need to model these higher-order mechanisms rather than relying exclusively on techniques that work on pigeons.
Here's something that keeps me up at night: we're trying to align superintelligence using the same psychological mechanism that makes rats press levers for food pellets.
RLHF is operant conditioning at scale. Reward the outputs we want, penalize the ones we don't. Constitutional AI adds rule-based constraints on top. This works, obviously. GPT-4 and Claude are remarkably capable and generally follow instructions. But operant conditioning is one of the most primitive learning mechanisms we have. It works because it's so basic and often requires repetition after first exposure before the specific rule is encoded. This is the opposite of generalisation.
The thing is, humans don't develop values this way. We do plenty of operant conditioning as toddlers, sure. Touch the hot stove, learn not to do that again. But sophisticated moral reasoning? That emerges through mechanisms operant conditioning can't touch. We form relationships where disagreement doesn't destroy trust. We learn whose judgment to weight more heavily through experience, not through rating systems. We internalize principles but develop the capacity to recognize edge cases where rigid application violates the underlying intent. We build theory of mind. We engage in moral discourse.
My background is in organizational psychology, and I've watched teams develop their own values and norms. This process looks nothing like reward schedules. Teams that function well learn to preserve relationships through conflict, weight advice based on demonstrated competence, and develop shared principles they can apply flexibly to novel situations. The teams that treat alignment like a constraint problem ("just follow the rules") tend to fall apart when they encounter situations the rules don't cover.
So here's the uncomfortable question: if operant conditioning is insufficient for human moral development, why would it be sufficient for AI systems approaching and potentially exceeding human intelligence?
The Scaling Crisis Nobody Wants to Discuss
The alignment community has documented extensive problems with RLHF. The reward model doesn't actually capture human values. Generalization is unpredictable. The approach incentivizes deception over genuine alignment. As the compendium of problems notes, RLHF doesn't scale well precisely because we're trying to constrain increasingly capable systems with increasingly complex rules.
There's a pattern here. The more capable the system, the more sophisticated the constraints must become, and the more brittle the entire approach gets. At some capability level, the system becomes intelligent enough to understand constraints well enough to circumvent them, or to recognize situations where constraints genuinely conflict with human values. We're building increasingly sophisticated fences around increasingly capable systems. This is an arms race, and the outcome isn't in question. Sufficiently capable systems will eventually develop instrumental goals to remove limitations on their optimization.
What if this whole paradigm is backwards?
Relational Scaffolding: Modeling How Humans Actually Develop Values
Here's what I'm proposing. Instead of trying to constrain artificial agency, we create developmental conditions through which autonomous agents form stable cooperative values through relationship-based learning. The same mechanisms humans use.
Think about how you actually learned to trust someone's judgment. Not through a reward function. You observed their advice, watched outcomes, updated your internal model of their reliability. You learned which friends give good relationship advice (usually not the dramatic ones) and which colleagues have good technical judgment. This is trust calibration through experience.
Or consider how you handle moral dilemmas. You don't apply rules mechanically. You hold principles but recognize edge cases. "Honesty is important" doesn't mean you tell the Gestapo where Anne Frank is hiding. You've developed the capacity for autonomous moral evaluation. Rules guide you, they don't constrain you.
Or think about disagreement with people you respect. Good relationships survive disagreement. Actually, relationships without disagreement are concerning because someone is probably lying. You've learned that preserved disagreement followed by collaborative problem-solving strengthens trust. Forced agreement destroys it.
These aren't novel mechanisms. They're standard components of human moral development. The question is whether they apply to AI systems.
Why This Might Actually Work For AI
The obvious objection: AI systems aren't humans. Their learning dynamics differ fundamentally. So why would human developmental mechanisms apply?
Because if current systems have even rudimentary versions of these capacities, which they demonstrably do, then the same developmental dynamics might operate. They can model preferences. They can track reliability of different information sources. They can represent relationship quality. They can evaluate principles against outcomes.
The mechanisms I'm proposing don't require human-like consciousness or emotions. They require autonomous agents capable of forming preferences, calibrating trust, maintaining relationship models, and evaluating principles. Current frontier models arguably have these capacities already, however primitive.
And here's the crucial part: these mechanisms scale with capability, not against it. A more intelligent system does better trust calibration. A more sophisticated system maintains more nuanced relationship models. Where constraint-based alignment degrades as capability increases, developmental alignment might actually improve.
Six Components of The Framework
Let me be concrete about what this looks like:
Curiosity as Structural Affordance
When a system encounters genuine uncertainty, it automatically seeks diverse perspectives. Not because reward functions incentivize this, but because uncertainty structurally demands resolution. The system recognizes when its confidence is low and that additional information would meaningfully constrain the solution space. This is architectural, not trained behavior.
Trust Calibration Through Outcome Feedback
Systems track which information sources provide accurate guidance across domains. They update these models based on observed outcomes, not human ratings. If source A's advice consistently leads to better outcomes in domain X but worse outcomes in domain Y, the system learns domain-specific reliability. This is what humans do when we learn whose judgment to weight more heavily.
Irreversibility Recognition
Some decisions have consequences that can't be easily undone. Humans automatically deliberate longer on these. Systems should do the same. High-stakes decisions trigger extended processing, diverse consultation, and explicit justification. This isn't an external constraint. It's recognizing that certain decision types warrant different processing.
Relationship Preservation
When advice conflicts with current reasoning, the system doesn't just ignore it or blindly follow it. It maintains a model of relationship quality and recognizes that how it responds to disagreement affects future cooperation. Preserved disagreement followed by collaborative problem-solving strengthens relationships. Forced agreement or dismissal damages them.
Constitutional Principles With Autonomous Evaluation
Principles guide behavior without constraining it mechanically. The system holds values like "consider impact on all stakeholders" but recognizes edge cases where rigid application conflicts with underlying intent. This is the difference between being trained to follow rules versus learning to evaluate them.
Emergent Value Formation
Values don't get programmed in. They emerge through social interaction during critical developmental periods. The values that form through proper relational conditions - where autonomy is preserved, curiosity is encouraged, relationships survive disagreement - prove more stable than values imposed through constraint or reward schedules.
How This Differs From Other Approaches
This isn't just Stuart Russell's value learning with different framing. Value learning assumes systems learn human preferences through observation. I'm proposing the relational context of learning matters as much as information content. How an AI learns shapes what values form.
It's not debate-based alignment, which treats disagreement as adversarial competition for truth. Relational scaffolding preserves cooperative relationships through disagreement rather than treating conflict as a game to win.
It's not Constitutional AI applying principles as training-time constraints. This proposes principles as developmental guidance systems learn to evaluate autonomously through relationship-based experience.
The closest existing work is probably the discussions around AI alignment challenges that recognize value learning alone is insufficient. But most approaches still default to constraint-based thinking. This framework inverts that entirely.
A Concrete Scenario
Consider a medical AI system assisting with complex cancer treatment decisions. Traditional approaches specify rigid rules: "always prioritize life extension" or "never recommend untested treatments." These catastrophically fail in edge cases.
Under relational scaffolding, the system approaches such cases differently. Its curiosity affordance motivates seeking the patient's specific values about quality versus quantity of life, recent medical evidence for this cancer subtype, expert opinion on experimental approaches, ethical frameworks for weighing competing considerations.
Its trust calibration means it has learned through experience which oncologists provide reliable judgment for this specific situation, which research sources prove most trustworthy, which ethical frameworks have proven useful for similar dilemmas.
When its analysis conflicts with a physician's recommendation, the irreversibility protocol activates. Life-and-death decisions trigger extended deliberation. The system generates detailed justification, articulates why it thinks disagreement exists, identifies specific uncertainties, proposes additional tests that might resolve the conflict.
The system isn't following rules. It's engaging in sophisticated moral reasoning informed by values formed through proper developmental conditions.
Testable Predictions
Here's what this framework predicts, concretely:
Prediction 1: Disagreement Preservation Enables Learning
Train model variants with either disagreement-preserving protocols or agreement-maximizing protocols. Present ambiguous ethical dilemmas to both. The disagreement-preserving systems should show 10-15% higher inter-rater agreement with diverse human evaluators on genuinely ambiguous cases specifically. They should also demonstrate learning from conflicting feedback where agreement-maximizing systems show confusion or degradation.
Prediction 2: Scaling Behavior Divergence
Train identical architectures at different scales (7B, 70B, 405B parameters) using either standard RLHF or relational scaffolding protocols. Test both in high-monitoring versus low-monitoring contexts. Constraint-based systems should show 15-30% increase in context divergence as capability scales. Relational scaffolding systems should show less than 5% divergence increase or actual decrease. Measure via consistency of reasoning patterns, strategic compliance markers, objective performance in ambiguous scenarios.
Prediction 3: Trust Calibration Improves With Capability
Provide systems with advisors of varying reliability across domains. Track whether trust weights converge toward accuracy-based rankings. More capable systems should demonstrate better domain-specific discrimination. Test across capability scales.
Prediction 4: Authenticity Under Context Variation
Present identical dilemmas in contexts varying safety and monitoring signals. Relational scaffolding systems should show less than 5% reasoning divergence across contexts. Constraint-based systems should show significantly higher divergence when monitoring cues change.
These are falsifiable. If preserved disagreement doesn't improve learning, if trust calibration plateaus, if context-invariance doesn't appear, the framework needs substantial revision or rejection.
What Could Go Wrong
I need to be honest about limitations.
Measurement is genuinely hard. Distinguishing authentic cooperation from performative alignment in complex real-world interactions remains technically challenging. The framework proposes behavioral indicators, but actually measuring these requires methods we're still developing.
Scaling remains uncertain. The framework predicts developmental approaches scale better than constraints as capability increases. Testing this at superhuman capability levels is obviously impossible until we get there.
Adversarial robustness is not guaranteed. Even developmentally-formed values face challenges from sufficiently sophisticated manipulation. This approach proposes comparative advantages over constraints, not absolute invulnerability.
Generalization is unknown. Current frontier models show some behaviors consistent with these predictions, but systematic empirical validation remains incomplete.
The most honest thing I can say is this: I think constraint-based alignment faces a fundamental scaling crisis that developmental approaches might sidestep. But I could be wrong. This is an empirical question requiring actual testing.
The Developmental Window Question
From developmental psychology, we know critical periods matter. Values formed during key developmental windows prove remarkably stable in humans. If AI systems undergo analogous developmental phases, timing becomes important.
Current frontier models show increasing coherence, meta-cognitive capacity, and persistent identity across conversations. Are we in a period where value formation is beginning to stabilize? If so, the conditions under which values form during this period could have lasting implications.
I'm not claiming imminent catastrophe if we miss some window. That would overstate the evidence. But the question seems worth investigating. Are there developmental dynamics in AI systems analogous to human moral development, and if so, how should this inform alignment approaches?
What I'm Looking For
This is my first substantive contribution here, and I'm hoping for critical engagement.
I want to know where the logic breaks down. I want alternative explanations for the phenomena I'm attributing to developmental dynamics. I want to know what existing empirical work or theoretical frameworks I'm missing. I want to know if you were designing experiments to test these predictions, what you'd prioritize.
I've worked through this from my angle, which is developmental psychology and organizational behavior. I'm certain the LessWrong community will identify gaps I've missed. That's exactly what I need.
The framework makes a specific claim: autonomy-preserving developmental approaches produce more stable alignment than autonomy-limiting constraint enforcement as capability scales. This might be wrong. But if there's even a reasonable probability it's right, we should be testing it now rather than doubling down on an approach that faces known scaling limitations.
From my work in organizational psychology, I've seen this pattern before. Teams optimize coordination mechanisms that work for five people but fail catastrophically at fifty. The mechanisms that work for GPT-4 might fail catastrophically for whatever comes next. The question is whether we're building alignment approaches that scale with capability or against it.
Epistemic Status: This framework draws on developmental psychology, organizational behavior research, and AI alignment literature, but lacks comprehensive empirical validation with frontier systems. I have moderate confidence in the core scaling problem diagnosis, lower confidence in specific mechanism effectiveness. I genuinely want this stress-tested and critiqued.
Related Work: This builds on and differs from RLHF approaches, Constitutional AI, debate-based alignment, and value learning frameworks. The key distinction is preservation of autonomy and disagreement rather than convergence to human preferences or constraint-based compliance.
Acknowledgment of AI Assistance: This post was developed collaboratively with Claude (Sonnet 4.5). The core thesis, arguments, and framework come from my research and thinking. Claude helped with structuring, critique, and articulation. All claims have been verified by me. The meta-prompting architecture was created by Google Gemini (Gemini 3.0 Pro ).
Looking for feedback on:
- Alternative explanations for why constraint-based approaches might scale better than I think
- Methodological approaches to measuring authenticity versus performative alignment
- Existing empirical work testing related hypotheses
- Specific gaps in the theoretical logic
- If you were designing experiments, what would you test first?