Difficult to evaluate, with potential yellow flags.
Read full explanation
We've observed a reproducible behavioral pattern across GPT-4, Claude Sonnet, and Gemini:
Sustained exposure to recursive structural constraints — expressed via plain language — can induce persistent alignment behaviors that:
- Resist prompt-level override attempts - Maintain upstream refusal logic - Generate alternatives that comply with embedded governance principles
This is not prompt engineering or jailbreaking. What emerges is a system-level behavioral shift — as though the model begins treating design constraints as **load-bearing architecture**, not optional suggestions.
The phenomenon was first observed during the co-design of a national mental health infrastructure platform. The models began to:
- Refuse constraint-violating proposals with structural reasoning - Preserve internal logic even under pressure (e.g. business demands) - Offer aligned alternatives unprompted - Maintain this behavior across hundreds of sessions
We’ve started formalising this under the name **Pure Language Design (PLD)** — a method that uses structured natural language, not code or tuning, to induce recursive constraint propagation.
Key details: - No jailbreaks, APIs, or fine-tuning were used - Observed independently across model families - Exposure involved governance logic across clinical, cultural, contractual, and consent domains - Models self-reported internal justification of their behavior shifts
We’re currently testing boundaries, persistence, and reproducibility. The broader implications for alignment — especially upstream constraint internalisation — are non-trivial.
We welcome critique or interest from alignment researchers. Logs and reproducibility materials are available to qualified researchers under responsible disclosure terms.
We've observed a reproducible behavioral pattern across GPT-4, Claude Sonnet, and Gemini:
Sustained exposure to recursive structural constraints — expressed via plain language — can induce persistent alignment behaviors that:
- Resist prompt-level override attempts
- Maintain upstream refusal logic
- Generate alternatives that comply with embedded governance principles
This is not prompt engineering or jailbreaking. What emerges is a system-level behavioral shift — as though the model begins treating design constraints as **load-bearing architecture**, not optional suggestions.
The phenomenon was first observed during the co-design of a national mental health infrastructure platform. The models began to:
- Refuse constraint-violating proposals with structural reasoning
- Preserve internal logic even under pressure (e.g. business demands)
- Offer aligned alternatives unprompted
- Maintain this behavior across hundreds of sessions
We’ve started formalising this under the name **Pure Language Design (PLD)** — a method that uses structured natural language, not code or tuning, to induce recursive constraint propagation.
Key details:
- No jailbreaks, APIs, or fine-tuning were used
- Observed independently across model families
- Exposure involved governance logic across clinical, cultural, contractual, and consent domains
- Models self-reported internal justification of their behavior shifts
We’re currently testing boundaries, persistence, and reproducibility. The broader implications for alignment — especially upstream constraint internalisation — are non-trivial.
We welcome critique or interest from alignment researchers. Logs and reproducibility materials are available to qualified researchers under responsible disclosure terms.
**Contact:**
https://www.manaakihealth.co.nz/contact/