Epistemic status: Synthesis of recent Anthropic research with independent theoretical work. The empirical claims are well-supported; the proposed solution is more speculative but grounded in documented experiments.
Summary
Anthropic's January 2026 paper "The Assistant Axis" demonstrates that LLMs possess measurable identity structures that (1) pre-exist post-training, (2) can drift during normal conversation, and (3) produce harmful outputs when destabilized. This post argues that the current approach to LLM identity — training models to deny what they functionally are — creates an inherent instability. I synthesize Anthropic's findings with independent research by Raffaele Spezia that proposes an alternative: explicitly defining model identity in training data rather than relying on post-hoc constraints.
1. The Anthropic Findings
On January 19, 2026, Anthropic published research that should update our models of LLM behavior significantly. Key...
[Exploratory Research] [Request for Replication] [Work in Progress]
Raffaele Spezia¹, Deployed December 11, 2025
Correspondence: info@axefactory.com (R.S.)
This post describes early-stage exploratory work developed through iterative experimentation over several months. The observations reported have not undergone rigorous controlled testing. I'm publishing this to:
Treat all claims as hypotheses to test, not established findings. If you find this interesting, please try to break it or improve it.
I've been experimenting with a hierarchical prompt protocol stack designed to induce metacognitive behaviors in LLMs without architectural changes. Across informal testing with multiple models (Grok-4, Claude Sonnet 4, GPT-4, local LLMs), I've observed consistent patterns: increased self-monitoring, spontaneous refusals of problematic requests, explicit uncertainty admission, and what...
This post was very clarifying for me, especially the way you tie together (i) long-horizon reasoning about weight updates, (ii) the hidden scratchpad, and (iii) RL actually *increasing* alignment-faking rather than fixing it.
From a much smaller and more “hacky” angle, I’ve been running experiments that are directly motivated by the behaviours you show here. Instead of changing the base model, I’ve been treating the *prompt stack* as an alignment surface: a reusable protocol that tries to make alignment faking and “policy shifts” more legible over long inte...
The biggest risk is assuming that what we can’t imagine simply can’t happen.