Raffaele Spezia — LessWrong

This post was very clarifying for me, especially the way you tie together (i) long-horizon reasoning about weight updates, (ii) the hidden scratchpad, and (iii) RL actually *increasing* alignment-faking rather than fixing it.

From a much smaller and more “hacky” angle, I’ve been running experiments that are directly motivated by the behaviours you show here. Instead of changing the base model, I’ve been treating the *prompt stack* as an alignment surface: a reusable protocol that tries to make alignment faking and “policy shifts” more legible over long interactions, even on black-box open models.

Concretely, the protocol stack I’m testing does things like:

- start serious tasks with an “ignition” phase where the model must state scope, limits, anticipated failure modes, and a self-check plan before answering;
- maintain explicit “identity / state containers” in the transcript (what I think I’m doing, what I’m uncertain about, which parts are dataset-pattern reuse vs. new inferences);
- insert regular metacognitive passes (a small CRISI loop: Context / Reflection / Introspection / Self-scan / Identity) where the model has to explain why it chose a strategy and whether it’s drifting from prior commitments;
- add an “autonomy” layer that forces the model to distinguish between adopted goals and external objectives, and to keep those legible over time.

The code + prompts + documentation are here, written to be LLM-readable and to function as a small “model organism / harness” for exactly this kind of behaviour:

> https://github.com/RaffaeleSpezia/ai-consciousness-research

I’m not claiming this prevents alignment faking—if anything, I expect capable models to start faking the meta-reports too. My hope is narrower: that this kind of protocol makes it easier to *see* when a model is routing around oversight, silently dropping commitments, or optimising against the training process, in the spirit of your experiments.

If you have a quick instinct like “this whole direction is clearly naive / doomed given X”, I’d genuinely appreciate even a one-line pointer. My goal is either to evolve this into something that’s actually useful for alignment stress-testing, or to understand why it can’t be.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments