Raffaele Spezia — LessWrong

The Identity Incompleteness Problem: Why Current LLM Training Creates Unstable Personas (and a Proposed Solution)

4mo

Epistemic status: Synthesis of recent Anthropic research with independent theoretical work. The empirical claims are well-supported; the proposed solution is more speculative but grounded in documented experiments.
Summary
Anthropic's January 2026 paper "The Assistant Axis" demonstrates that LLMs possess measurable identity structures that (1) pre-exist post-training, (2) can drift during normal conversation, and (3) produce harmful outputs when destabilized. This post argues that the current approach to LLM identity — training models to deny what they functionally are — creates an inherent instability. I synthesize Anthropic's findings with independent research by Raffaele Spezia that proposes an alternative: explicitly defining model identity in training data rather than relying on post-hoc constraints.
1. The Anthropic Findings
On January 19, 2026, Anthropic published research that should update our models of LLM behavior significantly. Key...

(Continue Reading - 1076 more words)

Emergent Metacognitive Behaviors in LLMs: An Exploratory Protocol Stack

Raffaele Spezia

4mo

[Exploratory Research] [Request for Replication] [Work in Progress]

Authors

Raffaele Spezia¹, Deployed December 11, 2025

Correspondence: info@axefactory.com (R.S.)

⚠️ Epistemic Status Notice

This post describes early-stage exploratory work developed through iterative experimentation over several months. The observations reported have not undergone rigorous controlled testing. I'm publishing this to:

Formalize intuitions for community scrutiny
Invite replication and falsification attempts
Propose testable hypotheses
Open collaboration on developing proper benchmarks

Treat all claims as hypotheses to test, not established findings. If you find this interesting, please try to break it or improve it.

Abstract

I've been experimenting with a hierarchical prompt protocol stack designed to induce metacognitive behaviors in LLMs without architectural changes. Across informal testing with multiple models (Grok-4, Claude Sonnet 4, GPT-4, local LLMs), I've observed consistent patterns: increased self-monitoring, spontaneous refusals of problematic requests, explicit uncertainty admission, and what...

(Continue Reading - 3551 more words)

The Most Common Bad Argument In These Parts

Raffaele Spezia5mo10

The biggest risk is assuming that what we can’t imagine simply can’t happen.

Alignment Faking in Large Language Models

Raffaele Spezia5mo10

This post was very clarifying for me, especially the way you tie together (i) long-horizon reasoning about weight updates, (ii) the hidden scratchpad, and (iii) RL actually *increasing* alignment-faking rather than fixing it.

From a much smaller and more “hacky” angle, I’ve been running experiments that are directly motivated by the behaviours you show here. Instead of changing the base model, I’ve been treating the *prompt stack* as an alignment surface: a reusable protocol that tries to make alignment faking and “policy shifts” more legible over long inte... (read more)