Grokking in Multi-Probe DPO: When the Model Learns Something Your Probes Can't See

caiovicentino

Rejected for the following reason(s):

Not obviously not Language Model.

Read full explanation

TL;DR: We trained Qwen3.6-27B with multi-probe DPO using two activation probes (FabricationGuard + ReasoningGuard) as the preference signal. After training, those original probes show AUROC ≈ 0.50 — by their own metric, the model learned nothing. But fresh probes trained on the same checkpoints reveal a clean grokking-style progression: AUROC rises from 0.472 to 0.528 across 200 steps, with a phase-transition ratio of 2.6 (late-phase slope vs early-phase slope). The model didn't fail to learn. It learned an axis orthogonal to what we were measuring.

This matters because: probe-based safety evaluations of post-trained models give a false sense of security. A model trained against a probe will appear "safe" by that probe's reading post-training, while the underlying behavior may have shifted along directions the probe doesn't see. Fresh-probe analysis is necessary, not optional, in the eval stack.

The setup

Late April 2026, we were running multi-probe DPO on Qwen3.6-27B as part of a broader research line on probe-gated training (we'll come back to the broader portfolio at the end). The setup was straightforward:

Two probes pre-trained on Qwen3.6-27B activations:
- FabricationGuard (FG) at L31, trained on HaluEval — flags hallucination
- ReasoningGuard (RG) at L55, trained on GSM8K — flags weak reasoning
DPO with combined preference signal: chosen examples score lower on both probes
200 steps, 10 checkpoints saved (every 20 steps)

The expectation: probes go down (low FG = factual, low RG = sound reasoning), behavior gets better, paper writes itself.

What actually happened: probes didn't go down. They stayed essentially flat throughout training. By the literal AUROC reading of FG and RG on held-out data, the DPO did nothing.

This was confusing. The training loss had moved (~0.23 descent over 5 epochs). Generation behavior on a few sample prompts looked subtly different. But the probes — the things we were optimizing against — weren't budging.

The bug we almost missed

Before we could even ask "did the model learn something orthogonal?", we tripped on infrastructure.

Initial retrospective analysis (nb39) ran the probes against four checkpoints and got identical FG/RG scores at every checkpoint. Bit-for-bit identical. That's not "the model didn't learn" — that's "we're not looking at the trained model at all."

Root cause: Qwen3.6 PEFT-save creates state-dict keys with a .language_model. infix (e.g., base_model.model.language_model.layers.31.self_attn.q_proj.lora_A.weight). Our reload code expected the dense format without the infix. PeftModel.from_pretrained() silently failed to apply the adapter — no error, no warning, just a base model pretending to be a checkpoint.

Fix + verification:

# strip .language_model. infix before PeftModel.from_pretrained()
state = {k.replace('.language_model.', '.'): v for k, v in state.items()}

# verify the adapter actually loaded — max logit-diff must exceed tolerance
with torch.no_grad():
    base_logits = base_model(test_input).logits
    loaded_logits = loaded_model(test_input).logits
    diff = (base_logits - loaded_logits).abs().max().item()
assert diff > 0.01, f"adapter silently failed to load: max_diff={diff:.4f}"

Pre-fix we measured diff = 0.000 (adapter not applied). Post-fix diff = 0.654. Anyone working with Qwen3.6 LoRA save/reload pipelines should run this sanity check — without it, the failure mode is silent. This bug invalidated about 10 hours of prior eval work and the original nb39 analysis.

The actual finding

After fixing the loader, nb41 v2 ran a forward-only sweep across 11 checkpoints (including step 0) on a held-out pair set. For each checkpoint, two readings:

Original probes (FG, RG): pretrained, applied to activations from the trained checkpoint
Fresh probe: a new linear probe trained on the trained checkpoint's activations to discriminate the same hallucination-vs-correct labels

The original probes stayed flat (FG ~0.50, RG ~0.34 throughout — the absolute level isn't the headline; the invariance is).

The fresh probe AUROC, in contrast:

Step	Fresh-probe AUROC
0	0.472
20	0.486
40	0.488
60	0.491
80	0.494
100	0.498
120	0.505
140	0.511
160	0.512
180	0.513
200	0.528

The progression is smooth, monotonic, and accelerating. Phase-transition ratio (late-half slope divided by early-half slope) = 2.596. The model is increasingly carrying information that a fresh linear probe can pick up — information that the original FG/RG probes cannot.

Fresh-probe AUROC across DPO checkpoints

For comparison, the same data through the original probes is essentially a flat line (modulo noise) — a near-perfect Goodhart picture: the trained model has become invisible to the metric it was trained against, while gaining something measurable on a fresh metric.

The construct-then-compress reading

The pattern is consistent with what Lei & Xu (2025) call construct-then-compress in the grokking literature: representations build up slowly, then snap into a task-aligned subspace. In their framework, the slow phase is the "construct" stage where features accumulate; the fast phase is "compress" — the model finding the efficient axis.

What we're seeing in DPO is similar in spirit, but with a twist: the construct-then-compress doesn't move toward our chosen probe axis. It moves toward an orthogonal axis. The model is finding a way to satisfy the preference signal that doesn't fire either of the probes we're using as proxies.

This is technically a Goodhart story, but a strange one. Goodhart's law usually shows up as the model gaming a metric. Here, the model is satisfying the metric (the DPO preference signal trained against the probes) by finding a direction in representation space that is functionally aligned with the preference but mechanically uncorrelated with the probes' decision boundary.

The phase-transition ratio of 2.6 also tells us this is real learning, not just slow drift. A pure drift would give roughly equal slopes early and late. A 2.6× acceleration is the signature of representation reorganization happening in the second half of training.

Why this matters for safety evaluation

Most current safety evaluations of post-trained models implicitly assume the probe (or classifier, or detector) used for evaluation captures the relevant axis. If you train a model with RLHF-style methods that include a probe-derived signal — even indirectly through reward models — you should expect the trained model to look "safer" by that probe's reading without necessarily being safer in deployment.

The defense is fresh-probe analysis: after training, train a new probe on the trained model's own activations with the same labels. If the fresh probe finds signal that the original probe can't, you've quantified the orthogonal-direction learning. If the fresh probe also finds nothing, then you can claim the trained model genuinely doesn't carry the signal anymore.

We're folding this into the ProbeBench evaluation framework as a "Goodhart-resistance" axis — measuring fresh-probe lift across training checkpoints as a standardized metric. Our own probes (FabricationGuard 0.32, ReasoningGuard 0.30) score modestly on this axis, which is its own honest negative we're publishing transparently.

What we know and don't know

What this result does show:

A 27B reasoning model under multi-probe DPO learns something orthogonal to the probes used as preference signal
The orthogonal-direction signal is detectable by fresh probes trained on the same labels
The progression is non-trivially accelerating (phase ratio 2.6), consistent with grokking-style dynamics
This is reproducible and the artifacts are public

What it doesn't show yet:

Whether the orthogonal direction corresponds to actually-better behavior or actually-worse behavior or just-different behavior. Behavior eval is preliminary and underpowered (n=25 per task).
Whether the same pattern appears in other model families (Llama, Mistral, Gemma) — single-model finding so far.
Whether different probe combinations produce different orthogonal axes (the "construct" target may be probe-pair specific).
Whether fresh-probe AUROC = 0.528 corresponds to a meaningful capability gain or noise. The absolute level is modest; the relative progression vs original probes is the headline.

These are the obvious next experiments and we'd be happy to see other groups attack any of them.

Methodology references

Training: nb37 v2 — DPO on Qwen3.6-27B (LoRA), 5 epochs, FG+RG combined preference, 10 checkpoints
Forward sweep: nb41 v2 — held-out pair set, capture L31 + L55 at end-of-think, score with original probes + train fresh probe per-checkpoint
All code, data, checkpoints public:
- DPO checkpoints + pairs: [caiovicentino1/openinterp-37v2-multiprobe-dpo-extended](https://huggingface.co/datasets/caiovicentino1/openinterp-37v2-multiprobe-dpo-extended)
- Forward-sweep results: [caiovicentino1/openinterp-41v2-grokking-extended](https://huggingface.co/datasets/caiovicentino1/openinterp-41v2-grokking-extended)
- Probes used: [FabricationGuard-linearprobe-qwen36-27b](https://huggingface.co/datasets/caiovicentino1/FabricationGuard-linearprobe-qwen36-27b), [ReasoningGuard-linearprobe-qwen36-27b](https://huggingface.co/datasets/caiovicentino1/ReasoningGuard-linearprobe-qwen36-27b)

Where this fits in the broader picture

This is one of several interpretability experiments we've been running on Qwen3.6-27B over the past month. Some other findings from the same line of work, briefly:

Multi-probe ensemble lift does NOT generalize OOD (nb46). In-distribution +6.7pp lift collapses to ~0 across TruthfulQA, StrategyQA, TriviaQA. We walked back the ProbePack universal-middleware framing publicly.
Memory context suppresses CoT entry rate in reasoning models (nb47). Adding retrieved memories drops Qwen3.6-27B's <think> entry from 73% to ~50%. Underexplored interaction between RAG and reasoning models.
Multi-probe GRPO at scale shows non-monotonic learning (nb43). Peak reward at ~77% of training, drift after. Argues for validation-based early stopping in probe-gated RL.

The grokking finding here is one of the cleaner positives. Read together with the negatives, the picture we're forming is: multi-probe approaches that look promising at small scale or in-distribution often need careful methodological work to hold up. Fresh-probe analysis is part of that work.

Acknowledgments

Built on Qwen3.6 (Alibaba), the HaluEval benchmark, and the broader interpretability literature on grokking (Power et al. 2022; Nanda et al. 2023; Lei & Xu 2025) and Goodhart-resistance in alignment (Skalse et al. 2022). Code and data Apache-2.0.