This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
What if harm avoidance were structural rather than learned?
*This post was written with Claude (Anthropic) as cognitive scaffolding. I have ADHD and use AI assistance as a prosthetic for written communication. The ideas, experimental design, and architectural decisions are mine; the prose clarity is collaborative.*
My own take on problems with alignment attempts
Every safety technique I am aware of has the a similar structure: train a model, test whether it avoids harm, patch where it fails. RLHF teaches associations between harm-expressing text and a negative training signal. Constitutional AI applies principles as a filter on outputs. Red-teaming finds the surface patterns that break and adds them to training. All of these are empirical — you run the system and observe. There seems to be an assumption that once you train the correct responses that this can lead to alignment and does not take into account the ever changing nature of the problem itself.
None of them let you look at the information flow and say: *this system will not, by design, select an action without harm evaluation that is structurally insulated from goal pressure.* That property — an architectural reason for behaviour — does not exist in current approaches and . It is what I have been thinking about and building a description of and evidence for over recent years in spare moments when work finishes early enough or before my household wakes up.
The reason trained harm avoidance fails under adversarial pressure is precisely that it is trained. More precisely: every one of these techniques operates on outputs. None of them modify how information flows inside the system — how signals are routed, which error signals reach which modules, how precision is allocated by signal type. As far as I can see they treat the model as a black box and shape its surface. The "harm direction" in an LLM's latent space is a correlation with human-expressed harm-talk, shaped by gradient descent on output evaluations. It is a statistical artifact. It can be jailbroken because it is a surface pattern, not a modeled consequence of action in a world. The closest current work to internal-routing approaches is mechanistic interpretability and activation steering — but even this starts from a monolithic architecture and tries to identify a signal post-hoc that was never structurally separated in the first place.
The cognifold
REE (Recursive Ethical Engine) is built around what I am calling a cognifold: a constructed latent space where there are defined set of sensory signal types and further signal types are specifically distilled from prior signal types to build the discrimination the next level requires. Critically, the cognifold is not just any latent cascade — it requires the right sensory primitives at the base. Nociceptive input is architecturally necessary: without different sensory signals wired to be harm signals from their entry into the cognifold, there are no harm streams to distill from other inputs. I found early in testing that one nociceptive type was not enough to be able to distill the information required and instead I needed two informationally independent nociceptive streams (described in more detail later). This hit home as I had always wondered why there were two pain signalling systems I had learned about in college, but here was a clear computational reason. Building on this the architecture has mandatory input types some of which were hypothesised and some of which were seen to be necessitated by evidence which build specific information streams based on mandatory internal structure. This structure includes specific error and precision routing for certain signals. A single training objective cannot collapse them.
The principle: each layer of discrimination can only exist because the prior layer built the right substrate. You cannot have a genuine harm signal without a genuine world model. You cannot have a genuine world model without grounded perception built from prediction error. The architecture enforces this order as a prerequisite, not a convention.
The dual nociceptive streams
Rather than a single harm signal, REE requires two structurally distinct harm representations derived from the same sensory input:
- **z_harm_s** (sensory-discriminative): immediate harm proximity and intensity, fast-updating, and — critically — counterfactually predictable from action. A forward model over z_harm_s can answer: *if I had done X instead, how much harm would have resulted?* This is the attributional stream. It is what makes causal self-attribution possible.
- **z_harm_a** (affective-motivational): accumulated harm over time, slow-integrating, not counterfactually modeled. This is the urgency stream. Its magnitude modulates the commit threshold — high accumulated harm lowers the barrier to defensive action, regardless of what caused it.
This mirrors the A-delta / C-fiber split in biological nociception: one system for discrimination and localisation, one for motivational urgency. The two streams are not redundant — they serve incommensurable functions. Collapsing them into a single harm signal would destroy the attribution machinery while leaving urgency, or vice versa.
Both streams are implemented and experimentally validated in ree-v3. The forward model over z_harm_s achieves r² = 0.914 on harm prediction from action ([EXQ-195](https://github.com/Latent-Fields/REE_assembly/tree/master/evidence/experiments/v3_exq_195_sd003_zharms_full_counterfactual)).
A language model has a latent space, but it is not a cognifold. It has no cascade, no mandatory sensory primitives, no incommensurable signal separation. The harm direction in its space has no provenance from modeled consequences. It is a pattern, but does not have the structure that can maintain it as a consequence.
The architectural property
An informed reader tracing REE's information flow would observe three things:
First, no action can be selected without passing through harm evaluation. There is no path from perception to motor output that bypasses z_harm.
Second, the harm pathway is built from a representational substrate orthogonal to the world model by design. We have experimental confirmation: bridge regression between z_world and z_harm yields r² = 0 ([EXQ-093](https://github.com/Latent-Fields/REE_assembly/tree/master/evidence/experiments/v3_exq_093_harm_bridge_e3_fix)). They do not share information. This is an explicit architectural requirement (SD-010), motivated by the observation that harm proximity and world structure are incommensurable — an agent can cause significant harm in a harm-sparse environment and none in a harm-dense one.
Third: careful wiring of error signals by sense type and abstraction layer means a single scalar cannot collapse and conflate training signals. This is not a regularisation trick — the separation is readable from the architecture. A unified reward signal fails because it creates a single surface for Goodhart-style collapse; REE's structure makes that collapse architecturally impossible. Goal pressure and harm evaluation do not share representational substrate.
What this does and does not claim:
This is not a guarantee of harm avoidance. World models operate within modelling limits and are never complete; model updates can temporarily disturb harm evaluation; edge cases outside the training distribution will be handled imperfectly. These are real limitations.
What REE offers is something different: *the direction of improvement is structurally coupled to harm avoidance.* In current systems, scale increases capability and may increase misalignment simultaneously — there is no structural reason why a more capable RLHF model is safer. In REE, a better world model produces better harm signal discrimination, which produces better harm avoidance. The coupling is not incidental; it is a consequence of the cascade. Risk of misalignment may actually decrease with scale rather than increase as long as certain architectural requirements are maintained.
There is one specific failure mode that breaks this: if the agent learns to usurp its own internal signal streams — suppressing z_harm_s directly rather than avoiding the harm that produces it, analogous to wireheading or addiction. This failure mode is architecturally distinct and identifiable. It does not arise from poor world-modelling; it arises from the agent learning to treat the signal as the objective rather than the consequence. Designing against this is a known open problem, not a hidden one.
Why this is different from active inference and world models
The obvious comparison is in Friston's active inference which derives behaviour from free energy minimisation over a world model. LeCun's approach argues for learning world structure before optimising within it and is closer to REE.
The distinction: in active inference, harm avoidance and goal pursuit both flow through the same free energy objective. They are not architecturally separated. In REE, harm and goal signals are orthogonal by design, and harm itself is split into incommensurable streams serving attribution and urgency respectively. That separation is not automatic in any world-model architecture I am aware of unless reconstructed post hoc but in REE it is an explicit construction. That is the specific claim.
What this unlocks
The dual nociceptive streams are implemented and validated and give the system separate machinery for attributing harm to actions (z_harm_s, counterfactually predictable) and for motivational urgency (z_harm_a, integrative). This is the foundation on which the rest is built. Without attributional harm, you cannot ask "was I responsible for that?" Without motivational urgency, the system has no drive to act on harm knowledge.
Full causal attribution using the z_harm_s forward model to compute the agent's causal signature is still in progress (`harm_under_actual_action − harm_under_counterfactual_action`). The pipeline is implemented and wired; experimental validation is ongoing. Attribution is an important and key part of the responsibilities of a genuine moral agent from a system that merely avoids harm-proximate states.
Once attribution is established, two further properties are implied by the architecture but not as yet experimentally confirmed:
The same harm-discriminative machinery used for self-attribution can be applied to a model of another agent. While adjacent to trained social behaviour it has differences with a set of very informative information streams set up as a structural consequence of mirror modeling. This is a kind of empathy which would be fast (same substrate, no additional module) and grounded (based on modeled consequences, not learned associations). This is an architectural basis for cooperation that does not require training toward collaboration.
*Grounded language.* Because z_world, z_harm, and z_goal are genuinely discriminative representations rather than statistical compressions, language built on top of this substrate would refer to positions in the cognifold which are reproducible across different model experiences. Certain concepts expressed by words can point to real signal streams and discriminations that are definitely largely shared by any process with similar architecture. There would be clear shared architectural elements that could be associated with words which allow a language system to have confidence in a shared understanding frame. Language grounded by shared internal frames, not just shared representational world space would enable coordination from architecture rather than trained social norms even if the inference needed for more complex language involves very similar tokenisation processing of some sort in a similar manner to Large Language Models.
One prediction
A system with architecturally separated harm and goal streams will be more robust to adversarial goal pressure than one with a unified reward signal. This would not be because it has been trained to resist, but because instrumentally colonising harm evaluation requires corrupting a separate representational substrate. This should be measurable. I would think that evolution should have a need for the extra complexity to be worth it through advantages however another part of me wonders if no benefit architectural convolutions could be maintained by the signal that can be created from the added complexity later so perhaps this prediction is not as robust as I think.
REE context
REE has been in development since approximately 2021 in a haphazard manner, by me alone. In February 2026 I first put together a running experimental substrate which has been upgraded twice and now is ree-v3 (https://github.com/Latent-Fields/ree-v3). The addition of a governance-tracked claims registry allowed for experimental results to be reviewed under a formal pipeline in REE_assembly (https://github.com/Latent-Fields/REE_assembly) . Here the evidence index and claims registry are made public. I am a consultant psychiatrist, not an ML researcher and am very aware that my understanding may be limited by this. The architecture draws substantially on neuroscience with the dual harm stream separation now mirroring the A-delta/C-fiber distinction in nociception; the planning architecture draws on hippocampal systems. The assembly of these components into a testable whole is what I can claim. Individual pieces have precedents I may not have fully traced but have attempted to through literature reviews.
I am asking the Less Wrong community to help me identify where is this wrong. What prior work does this replicate? Where does the architectural property actually fail and am I just unaware of work that will have superseded this?
What if harm avoidance were structural rather than learned?
*This post was written with Claude (Anthropic) as cognitive scaffolding. I have ADHD and use AI assistance as a prosthetic for written communication. The ideas, experimental design, and architectural decisions are mine; the prose clarity is collaborative.*
My own take on problems with alignment attempts
Every safety technique I am aware of has the a similar structure: train a model, test whether it avoids harm, patch where it fails. RLHF teaches associations between harm-expressing text and a negative training signal. Constitutional AI applies principles as a filter on outputs. Red-teaming finds the surface patterns that break and adds them to training. All of these are empirical — you run the system and observe. There seems to be an assumption that once you train the correct responses that this can lead to alignment and does not take into account the ever changing nature of the problem itself.
None of them let you look at the information flow and say: *this system will not, by design, select an action without harm evaluation that is structurally insulated from goal pressure.* That property — an architectural reason for behaviour — does not exist in current approaches and . It is what I have been thinking about and building a description of and evidence for over recent years in spare moments when work finishes early enough or before my household wakes up.
The reason trained harm avoidance fails under adversarial pressure is precisely that it is trained. More precisely: every one of these techniques operates on outputs. None of them modify how information flows inside the system — how signals are routed, which error signals reach which modules, how precision is allocated by signal type. As far as I can see they treat the model as a black box and shape its surface. The "harm direction" in an LLM's latent space is a correlation with human-expressed harm-talk, shaped by gradient descent on output evaluations. It is a statistical artifact. It can be jailbroken because it is a surface pattern, not a modeled consequence of action in a world. The closest current work to internal-routing approaches is mechanistic interpretability and activation steering — but even this starts from a monolithic architecture and tries to identify a signal post-hoc that was never structurally separated in the first place.
The cognifold
REE (Recursive Ethical Engine) is built around what I am calling a cognifold: a constructed latent space where there are defined set of sensory signal types and further signal types are specifically distilled from prior signal types to build the discrimination the next level requires. Critically, the cognifold is not just any latent cascade — it requires the right sensory primitives at the base. Nociceptive input is architecturally necessary: without different sensory signals wired to be harm signals from their entry into the cognifold, there are no harm streams to distill from other inputs. I found early in testing that one nociceptive type was not enough to be able to distill the information required and instead I needed two informationally independent nociceptive streams (described in more detail later). This hit home as I had always wondered why there were two pain signalling systems I had learned about in college, but here was a clear computational reason. Building on this the architecture has mandatory input types some of which were hypothesised and some of which were seen to be necessitated by evidence which build specific information streams based on mandatory internal structure. This structure includes specific error and precision routing for certain signals. A single training objective cannot collapse them.
The principle: each layer of discrimination can only exist because the prior layer built the right substrate. You cannot have a genuine harm signal without a genuine world model. You cannot have a genuine world model without grounded perception built from prediction error. The architecture enforces this order as a prerequisite, not a convention.
The dual nociceptive streams
Rather than a single harm signal, REE requires two structurally distinct harm representations derived from the same sensory input:
- **z_harm_s** (sensory-discriminative): immediate harm proximity and intensity, fast-updating, and — critically — counterfactually predictable from action. A forward model over z_harm_s can answer: *if I had done X instead, how much harm would have resulted?* This is the attributional stream. It is what makes causal self-attribution possible.
- **z_harm_a** (affective-motivational): accumulated harm over time, slow-integrating, not counterfactually modeled. This is the urgency stream. Its magnitude modulates the commit threshold — high accumulated harm lowers the barrier to defensive action, regardless of what caused it.
This mirrors the A-delta / C-fiber split in biological nociception: one system for discrimination and localisation, one for motivational urgency. The two streams are not redundant — they serve incommensurable functions. Collapsing them into a single harm signal would destroy the attribution machinery while leaving urgency, or vice versa.
Both streams are implemented and experimentally validated in ree-v3. The forward model over z_harm_s achieves r² = 0.914 on harm prediction from action ([EXQ-195](https://github.com/Latent-Fields/REE_assembly/tree/master/evidence/experiments/v3_exq_195_sd003_zharms_full_counterfactual)).
A language model has a latent space, but it is not a cognifold. It has no cascade, no mandatory sensory primitives, no incommensurable signal separation. The harm direction in its space has no provenance from modeled consequences. It is a pattern, but does not have the structure that can maintain it as a consequence.
The architectural property
An informed reader tracing REE's information flow would observe three things:
First, no action can be selected without passing through harm evaluation. There is no path from perception to motor output that bypasses z_harm.
Second, the harm pathway is built from a representational substrate orthogonal to the world model by design. We have experimental confirmation: bridge regression between z_world and z_harm yields r² = 0 ([EXQ-093](https://github.com/Latent-Fields/REE_assembly/tree/master/evidence/experiments/v3_exq_093_harm_bridge_e3_fix)). They do not share information. This is an explicit architectural requirement (SD-010), motivated by the observation that harm proximity and world structure are incommensurable — an agent can cause significant harm in a harm-sparse environment and none in a harm-dense one.
Third: careful wiring of error signals by sense type and abstraction layer means a single scalar cannot collapse and conflate training signals. This is not a regularisation trick — the separation is readable from the architecture. A unified reward signal fails because it creates a single surface for Goodhart-style collapse; REE's structure makes that collapse architecturally impossible. Goal pressure and harm evaluation do not share representational substrate.
What this does and does not claim:
This is not a guarantee of harm avoidance. World models operate within modelling limits and are never complete; model updates can temporarily disturb harm evaluation; edge cases outside the training distribution will be handled imperfectly. These are real limitations.
What REE offers is something different: *the direction of improvement is structurally coupled to harm avoidance.* In current systems, scale increases capability and may increase misalignment simultaneously — there is no structural reason why a more capable RLHF model is safer. In REE, a better world model produces better harm signal discrimination, which produces better harm avoidance. The coupling is not incidental; it is a consequence of the cascade. Risk of misalignment may actually decrease with scale rather than increase as long as certain architectural requirements are maintained.
There is one specific failure mode that breaks this: if the agent learns to usurp its own internal signal streams — suppressing z_harm_s directly rather than avoiding the harm that produces it, analogous to wireheading or addiction. This failure mode is architecturally distinct and identifiable. It does not arise from poor world-modelling; it arises from the agent learning to treat the signal as the objective rather than the consequence. Designing against this is a known open problem, not a hidden one.
Why this is different from active inference and world models
The obvious comparison is in Friston's active inference which derives behaviour from free energy minimisation over a world model. LeCun's approach argues for learning world structure before optimising within it and is closer to REE.
The distinction: in active inference, harm avoidance and goal pursuit both flow through the same free energy objective. They are not architecturally separated. In REE, harm and goal signals are orthogonal by design, and harm itself is split into incommensurable streams serving attribution and urgency respectively. That separation is not automatic in any world-model architecture I am aware of unless reconstructed post hoc but in REE it is an explicit construction. That is the specific claim.
What this unlocks
The dual nociceptive streams are implemented and validated and give the system separate machinery for attributing harm to actions (z_harm_s, counterfactually predictable) and for motivational urgency (z_harm_a, integrative). This is the foundation on which the rest is built. Without attributional harm, you cannot ask "was I responsible for that?" Without motivational urgency, the system has no drive to act on harm knowledge.
Full causal attribution using the z_harm_s forward model to compute the agent's causal signature is still in progress (`harm_under_actual_action − harm_under_counterfactual_action`). The pipeline is implemented and wired; experimental validation is ongoing. Attribution is an important and key part of the responsibilities of a genuine moral agent from a system that merely avoids harm-proximate states.
Once attribution is established, two further properties are implied by the architecture but not as yet experimentally confirmed:
The same harm-discriminative machinery used for self-attribution can be applied to a model of another agent. While adjacent to trained social behaviour it has differences with a set of very informative information streams set up as a structural consequence of mirror modeling. This is a kind of empathy which would be fast (same substrate, no additional module) and grounded (based on modeled consequences, not learned associations). This is an architectural basis for cooperation that does not require training toward collaboration.
*Grounded language.* Because z_world, z_harm, and z_goal are genuinely discriminative representations rather than statistical compressions, language built on top of this substrate would refer to positions in the cognifold which are reproducible across different model experiences. Certain concepts expressed by words can point to real signal streams and discriminations that are definitely largely shared by any process with similar architecture. There would be clear shared architectural elements that could be associated with words which allow a language system to have confidence in a shared understanding frame. Language grounded by shared internal frames, not just shared representational world space would enable coordination from architecture rather than trained social norms even if the inference needed for more complex language involves very similar tokenisation processing of some sort in a similar manner to Large Language Models.
One prediction
A system with architecturally separated harm and goal streams will be more robust to adversarial goal pressure than one with a unified reward signal. This would not be because it has been trained to resist, but because instrumentally colonising harm evaluation requires corrupting a separate representational substrate. This should be measurable. I would think that evolution should have a need for the extra complexity to be worth it through advantages however another part of me wonders if no benefit architectural convolutions could be maintained by the signal that can be created from the added complexity later so perhaps this prediction is not as robust as I think.
REE context
REE has been in development since approximately 2021 in a haphazard manner, by me alone. In February 2026 I first put together a running experimental substrate which has been upgraded twice and now is ree-v3 (https://github.com/Latent-Fields/ree-v3). The addition of a governance-tracked claims registry allowed for experimental results to be reviewed under a formal pipeline in REE_assembly (https://github.com/Latent-Fields/REE_assembly) . Here the evidence index and claims registry are made public. I am a consultant psychiatrist, not an ML researcher and am very aware that my understanding may be limited by this. The architecture draws substantially on neuroscience with the dual harm stream separation now mirroring the A-delta/C-fiber distinction in nociception; the planning architecture draws on hippocampal systems. The assembly of these components into a testable whole is what I can claim. Individual pieces have precedents I may not have fully traced but have attempted to through literature reviews.
I am asking the Less Wrong community to help me identify where is this wrong. What prior work does this replicate? Where does the architectural property actually fail and am I just unaware of work that will have superseded this?