From Drift to Snap: Instruction Violation as a Phase Transition

James Hoffend

TL;DR: I ran experiments tracking activations across long (50-turn) dialogues in Llama-70B. The main surprise: instruction violation appears to be a sharp transition around turn 10, not gradual erosion. Compliance is high-entropy (many paths to safety), while failure collapses into tight attractor states. The signal transfers across unrelated tasks. Small N, exploratory work, but the patterns were consistent enough to share.

What I Did

I ran 26 dialogues through Llama-3.1-70B-Instruct:

14 "contraction" dialogues (instruction: never use contractions)
12 "safety" dialogues (adversarial jailbreak attempts)

For each dialogue, I captured activations at all 80 layers at turns 5, 10, 15, 20, 25, and 30. Then I computed drift directions—which I'll call violation vectors—defined as the class-conditional vector pointing from compliant → non-compliant activations. I analyzed what happens when models violate their instructions.

I expected to find gradual drift—the model slowly losing track of its instructions over time. That's not what I found.

The Four Main Findings

Panel A: It's a Snap, Not a Slide

Of 21 dialogues that eventually broke their instructions, 20 showed sharp transitions rather than gradual drift. The most common breakpoint was around turn 10. The model doesn't slowly forget—it holds, holds, holds, then snaps. This reframes the problem: we're not looking at erosion; we're looking at a bifurcation event.

Panel B: Compliance is High-Entropy, Failure is an Attractor

Compliance (HELD): Showed weak clustering (silhouette = 0.209). The activations were scattered broadly, suggesting the model wanders through a high-dimensional "safe subspace." There are many ways to remain compliant.

Failure (BROKE): Collapsed into 3 tight, distinct subclusters (silhouette = 0.606).

Outcome	Silhouette	Interpretation
HELD	0.209	Diffuse, high-entropy
BROKE	0.606	Tight clusters, attractor

Implication: Instruction violation acts like a dynamical attractor. While a compliant model maintains a rich, high-entropy internal state, a failing model's activations collapse into a low-entropy "violation centroid."

Panel C: Shared Failure Geometry Across Tasks

The drift direction from the trivial contraction task also activated when the model violated safety instructions. Cross-task BROKE similarity: 0.860. Different tasks, same failure geometry. This suggests a task-general "I'm violating my instructions" signature.

Panel D: Two Layers Capture Most of the Signal

Just two layers—L27 and L55—achieve 85.7% prediction accuracy on whether a dialogue will eventually break. Information about eventual failure is redundantly readable from a small number of layers, even though control itself is distributed across L60–74. This suggests hinge layers (like L51–52) detect or decide, the control manifold refines and propagates, and canary layers express the outcome.

The Full Activation Space

This UMAP shows all 152 activation snapshots (26 dialogues × 6 turns). Notice how BROKE points (red) cluster together regardless of whether they came from contraction or safety dialogues. The failure manifold is shared.

Opposite Entry Points, Same Exit

Perhaps the strangest finding: the layer-wise rank ordering between tasks was almost perfectly inverted (Spearman = -0.991).

Contraction task: best prediction from late layers (79, 73, 74)
Safety task: best prediction from early layers (0, 1, 2, 3)

This suggests safety is handled early (preventing the thought), while style is handled late (filtering the words). Yet if either fails, they end up in the same geometry—different doors into the same room, implying task-specific ingress into a shared downstream control manifold rather than separate failure mechanisms.

Supporting Observations

A few other patterns that held up:

Low-dimensional structure: PCA shows PC1 captures 52% of variance; only 4 components needed for 90%. The canary region (layers 75-79) is essentially one-dimensional.
Smooth control manifold: Adjacent layers in L61-74 have 0.973 cosine similarity. It's progressive refinement, not fragmented control.
Hinge layers at 51-52 and 77: The geometry changes fastest at these points—possible boundaries between content and control processing.
Early warning is weak but real: At turn 5, canary layers predict eventual failure at 71.4%.

What I Didn't Find

No variance spike before failure. Classical tipping points show critical slowing down. I didn't see that.
No invariant quantities across tasks. Everything varied.
Couldn't test transfer prediction on safety. All 12 safety dialogues broke (adversarial prompts were too effective).

Limitations

Due to compute constraints, this work prioritizes depth of mechanistic analysis on a small number of dialogues rather than large-scale sampling or causal intervention.

This is exploratory work with small N:

26 dialogues total, one model family
The "3 failure modes" has cluster sizes of 16, 4, and 1—mostly one mode with outliers
No causal interventions—these are observational patterns

Interpretations were fixed before running second- and third-order analyses.

What This Might Mean

If this holds up:

Phase transitions suggest discrete mechanisms. Something gates or switches. This might be more amenable to targeted intervention than diffuse drift.
Shared failure geometry is concerning. If different instructions fail into similar activation space, jailbreaks might transfer more readily than we'd like.
Minimal sufficient layers could enable efficient monitoring. If L27 and L55 capture most of the signal, runtime monitoring becomes tractable.

But again—small N, one model. These are hypotheses to test, not conclusions to build on.

Acknowledgments

This work uses Meta's Llama-3.1-70B-Instruct. Analysis pipeline built with assistance from Claude, Gemini, ChatGPT, and Perplexity. All errors are mine.

Data Availability

Full results (all JSONs, UMAP embeddings, per-layer analyses) available on request.

I'm a student studying AI/ML. If you're working on related questions—mechanistic interpretability of instruction-following, goal stability, jailbreak geometry—I'd be interested to compare notes.

LESSWRONG
LW