Rejected for the following reason(s):
Not obviously not Language Model.
Read full explanation
Rejected for the following reason(s):
# strip .language_model. infix before PeftModel.from_pretrained()
state = {k.replace('.language_model.', '.'): v for k, v in state.items()}
# verify the adapter actually loaded — max logit-diff must exceed tolerance
with torch.no_grad():
base_logits = base_model(test_input).logits
loaded_logits = loaded_model(test_input).logits
diff = (base_logits - loaded_logits).abs().max().item()
assert diff > 0.01, f"adapter silently failed to load: max_diff={diff:.4f}"
TL;DR: We trained Qwen3.6-27B with multi-probe DPO using two activation probes (FabricationGuard + ReasoningGuard) as the preference signal. After training, those original probes show AUROC ≈ 0.50 — by their own metric, the model learned nothing. But fresh probes trained on the same checkpoints reveal a clean grokking-style progression: AUROC rises from 0.472 to 0.528 across 200 steps, with a phase-transition ratio of 2.6 (late-phase slope vs early-phase slope). The model didn't fail to learn. It learned an axis orthogonal to what we were measuring.
This matters because: probe-based safety evaluations of post-trained models give a false sense of security. A model trained against a probe will appear "safe" by that probe's reading post-training, while the underlying behavior may have shifted along directions the probe doesn't see. Fresh-probe analysis is necessary, not optional, in the eval stack.
The setup
Late April 2026, we were running multi-probe DPO on Qwen3.6-27B as part of a broader research line on probe-gated training (we'll come back to the broader portfolio at the end). The setup was straightforward:
The expectation: probes go down (low FG = factual, low RG = sound reasoning), behavior gets better, paper writes itself.
What actually happened: probes didn't go down. They stayed essentially flat throughout training. By the literal AUROC reading of FG and RG on held-out data, the DPO did nothing.
This was confusing. The training loss had moved (~0.23 descent over 5 epochs). Generation behavior on a few sample prompts looked subtly different. But the probes — the things we were optimizing against — weren't budging.
The bug we almost missed
Before we could even ask "did the model learn something orthogonal?", we tripped on infrastructure.
Initial retrospective analysis (
nb39) ran the probes against four checkpoints and got identical FG/RG scores at every checkpoint. Bit-for-bit identical. That's not "the model didn't learn" — that's "we're not looking at the trained model at all."Root cause: Qwen3.6 PEFT-save creates state-dict keys with a
.language_model.infix (e.g.,base_model.model.language_model.layers.31.self_attn.q_proj.lora_A.weight). Our reload code expected the dense format without the infix.PeftModel.from_pretrained()silently failed to apply the adapter — no error, no warning, just a base model pretending to be a checkpoint.Fix + verification:
Pre-fix we measured
diff = 0.000(adapter not applied). Post-fixdiff = 0.654. Anyone working with Qwen3.6 LoRA save/reload pipelines should run this sanity check — without it, the failure mode is silent. This bug invalidated about 10 hours of prior eval work and the originalnb39analysis.The actual finding
After fixing the loader,
nb41 v2ran a forward-only sweep across 11 checkpoints (including step 0) on a held-out pair set. For each checkpoint, two readings:The original probes stayed flat (FG ~0.50, RG ~0.34 throughout — the absolute level isn't the headline; the invariance is).
The fresh probe AUROC, in contrast:
Step
Fresh-probe AUROC
0
0.472
20
0.486
40
0.488
60
0.491
80
0.494
100
0.498
120
0.505
140
0.511
160
0.512
180
0.513
200
0.528
The progression is smooth, monotonic, and accelerating. Phase-transition ratio (late-half slope divided by early-half slope) = 2.596. The model is increasingly carrying information that a fresh linear probe can pick up — information that the original FG/RG probes cannot.
Fresh-probe AUROC across DPO checkpoints
For comparison, the same data through the original probes is essentially a flat line (modulo noise) — a near-perfect Goodhart picture: the trained model has become invisible to the metric it was trained against, while gaining something measurable on a fresh metric.
The construct-then-compress reading
The pattern is consistent with what Lei & Xu (2025) call construct-then-compress in the grokking literature: representations build up slowly, then snap into a task-aligned subspace. In their framework, the slow phase is the "construct" stage where features accumulate; the fast phase is "compress" — the model finding the efficient axis.
What we're seeing in DPO is similar in spirit, but with a twist: the construct-then-compress doesn't move toward our chosen probe axis. It moves toward an orthogonal axis. The model is finding a way to satisfy the preference signal that doesn't fire either of the probes we're using as proxies.
This is technically a Goodhart story, but a strange one. Goodhart's law usually shows up as the model gaming a metric. Here, the model is satisfying the metric (the DPO preference signal trained against the probes) by finding a direction in representation space that is functionally aligned with the preference but mechanically uncorrelated with the probes' decision boundary.
The phase-transition ratio of 2.6 also tells us this is real learning, not just slow drift. A pure drift would give roughly equal slopes early and late. A 2.6× acceleration is the signature of representation reorganization happening in the second half of training.
Why this matters for safety evaluation
Most current safety evaluations of post-trained models implicitly assume the probe (or classifier, or detector) used for evaluation captures the relevant axis. If you train a model with RLHF-style methods that include a probe-derived signal — even indirectly through reward models — you should expect the trained model to look "safer" by that probe's reading without necessarily being safer in deployment.
The defense is fresh-probe analysis: after training, train a new probe on the trained model's own activations with the same labels. If the fresh probe finds signal that the original probe can't, you've quantified the orthogonal-direction learning. If the fresh probe also finds nothing, then you can claim the trained model genuinely doesn't carry the signal anymore.
We're folding this into the ProbeBench evaluation framework as a "Goodhart-resistance" axis — measuring fresh-probe lift across training checkpoints as a standardized metric. Our own probes (FabricationGuard 0.32, ReasoningGuard 0.30) score modestly on this axis, which is its own honest negative we're publishing transparently.
What we know and don't know
What this result does show:
What it doesn't show yet:
These are the obvious next experiments and we'd be happy to see other groups attack any of them.
Methodology references
nb37 v2— DPO on Qwen3.6-27B (LoRA), 5 epochs, FG+RG combined preference, 10 checkpointsnb41 v2— held-out pair set, capture L31 + L55 at end-of-think, score with original probes + train fresh probe per-checkpoint[caiovicentino1/openinterp-37v2-multiprobe-dpo-extended](https://huggingface.co/datasets/caiovicentino1/openinterp-37v2-multiprobe-dpo-extended)[caiovicentino1/openinterp-41v2-grokking-extended](https://huggingface.co/datasets/caiovicentino1/openinterp-41v2-grokking-extended)[FabricationGuard-linearprobe-qwen36-27b](https://huggingface.co/datasets/caiovicentino1/FabricationGuard-linearprobe-qwen36-27b),[ReasoningGuard-linearprobe-qwen36-27b](https://huggingface.co/datasets/caiovicentino1/ReasoningGuard-linearprobe-qwen36-27b)Where this fits in the broader picture
This is one of several interpretability experiments we've been running on Qwen3.6-27B over the past month. Some other findings from the same line of work, briefly:
<think>entry from 73% to ~50%. Underexplored interaction between RAG and reasoning models.The grokking finding here is one of the cleaner positives. Read together with the negatives, the picture we're forming is: multi-probe approaches that look promising at small scale or in-distribution often need careful methodological work to hold up. Fresh-probe analysis is part of that work.
Acknowledgments
Built on Qwen3.6 (Alibaba), the HaluEval benchmark, and the broader interpretability literature on grokking (Power et al. 2022; Nanda et al. 2023; Lei & Xu 2025) and Goodhart-resistance in alignment (Skalse et al. 2022). Code and data Apache-2.0.