I would be interested to see if you could still classify idiosyncrasies from various models all fine-tuned in this way, and how different the performance of the classifier would be from baseline. More broadly, is "the same persona" in different models more consistent than "different personas" in the same model?
This post seems very useful! I have a few comments:
In “Taxonomy” you define p as a concerning property of outputs, but the examples operationalize p as a latent fact inside the CoT (e.g., the number 9 appearing in a scratchpad). These aren’t the same. If p is a property of the action/output (e.g., “contains a backdoor”), a monitor can succeed even if it never sees “9” in the CoT; it can look for the backdoor directly in the code. My suggestion is to split “monitor target” into two types:
Output-level property p_out (like a backdoor in code, defection choice)
Decision-relevant internal content p_lat (like a subgoal “exfiltrate weights,” intermediate fact “3^2 = 9”)
Then define “hidden reasoning” as cases where the information flow from p_lat or p_out to what the monitor reads is broken, and specify which target each category hides.
Also, you alternate between “the AI hides p” and “LLMs naturally omit or compress.” I think it is worth distinguishing if a behavior is incidental, incentivized but not situationally aware, and deliberate and situationally aware (deceptive).
In hidden utilization of test-time compute, the phrase “parallel test-time compute” is confusing. Token-by-token generation adds sequential passes so the net effect is more total compute, not more parallelism per step. If the claim is “non-load-bearing CoT still lets the model do more serial latent computing across steps,” you can articulate that and distinguish it from cases where the CoT content itself is causally used. Also, it is worth mentioning the various methods that trawl over several chains of thought (Yao et al. 2023, Yoran et al. 2023, Wang et al. 2023).
“Neuralese” has prior art in emergent communication literature (Andreas et al., 2017) where it denotes non-human-interpretable codes learned for communication. The linguistic drift category substantially overlaps with that historical use. If you want “neuralese” to mean “latent-only serial reasoning,” consider “latent-only serial reasoning” or “activation-space serial reasoning” instead, and reserve “neuralese” for emergent discrete codes.
At this point you have to wonder if there's anything that doesn't cause emergent misalignment
Super cool study! Although, I claim that the statement "hacking harmless tasks generalizes to misaligned behaviors” is incomplete because the dataset never supplies the counterfactual environments needed to identify this causally, and that it actually identifies a stable proxy feature (“do whatever the scorer rewards”) that’s invariant across tasks. But the intended latent variable is not made invariant. My guess is there is a (probably low) chance this might disappear if you symmetrize the dataset with a matched “avoid-X” for every “include-X,” and randomly flip proxy signs/clip ranges but hold intent fixed. If the reward hacking survives that, I would be a lot more convinced of the results!
A nice test here would be a mode connectivity experiment, where you take θ₀ as the base model and θ_H as the SFT reward hacker. Construct a low-loss connector θ(t) and start with linear interpolation θ(t) = (1 − t)θ₀ + tθ_H, then refine the path with some curve finding s.t. ordinary instruction loss remains low along t. Then do some sweep 0 <= t <= 1 and measure the hacking metrics and the capability score on each.
My prediction is that if reward hacking survives the symmetrized dataset, then along any low-loss path, the three hacking metrics would increase smoothly and approximately monotonically while capability stays roughly flat. This would indicate a proxy-seeking basin that is mode-connected to the base optimum, i.e., a continuous parameter direction that trades off intent for proxy... I also guess this would be a bit more practical with LoRA. I would also be interested in seeing how many samples the model needs to see before the reward hacking generalizes, or if there were any samples that were especially informative to the model's generalization.
Cool! My take away from this is that if you collect reasoning traces under a hack-encouraging context, filter for non-hack outcomes, and train without the hack-encouraging context, the supervised gradients reinforce not only the final answer but also the intermediate hack-relevant reasoning.
I think to isolate the mechanism better it would be good to see answer-only SFT, an ablation where you do some rephrasing/rewriting on the reasoning, some token loss masking/weighting on the rationale tokens, and also more details about how many runs/seeds you're using on this. My guess would be that the bump in hacking mostly vanishes with answers-only.
This seems like a LUPI problem? Where the rationale content is privileged supervision that correlates with success in the teacher environment but is mismatched to the student’s deployment context. I think invariant-risk-minimisation ideas would say that privileged features can degrade test performance when their distribution shifts if it is not IID. One of the fixes in that literature is to enforce that the student’s decision relies on representations invariant to the privileged channel, so maybe (if you extend it to problems not seen in the train set), you could also penalise features whose predictive power varies across contexts, which I would guess reduces hack rate without hurting accuracy.
I have some objections to this, from the perspective of a doc I'm writing (and will possibly post in a few weeks). I think that you're using biology as a synonym for feature/circuit microscopy, but I think there are some biologically motivated perspectives like self-organised criticality or systems neuroscience that use statistical-physics formalisms but are primarily biological in nature. Likewise, physics is not only about smooth, universal regularities. Phase transitions, renormalisation and critical phenomena are central to modern physics and they are violently non-smooth. That side of physics is almost completely absent in the piece, and overall I would say that the rhetorical contrast isn't so clear as depicted in the article.
I agree it would be nice if we could get a second law of thermodynamics for LLMs. But safety interventions are usually enacted locally (gradient nudges, RLHF reward shaping, inference-time steering) and a thermodynamic state variable a la "entropy of the latent field" is almost certainly too coarse to guarantee that the next token is non-harmful. I think you underplay a manipulability criterion, where a variable is only valuable if you can steer it cheaply and predictably, which is why we might care about critical windows.
Finally I would also add that the messiness is in some ways the point. I don't have a picture of misalignment as necessarily stemming from these really neat simplicities, I think there's a lot of risk in being insufficiently granular if we elevate only the cleanest "order parameters" and discard messy local details. I would guess alignment failures often don't represent as an scalar drifting past a threshold, and would rather be narrow‐band exploits or corner‐case correlations that are at the same granularity as messy feature and circuit probes you describe as a distraction. If you can jailbreak a frontier model with a one-sentence rhyme, then any interpretability story that averages over millions of parameters until it returns a single macro-variable is, by construction, blind to the event we need to prevent.
Very exciting and indeed I was quite surprised when I first saw this on Twitter. The constraint on having the same random initialisation makes me wonder if this has connections to the lottery ticket hypothesis--- my guess is that the teacher’s fine-tuning identifies and reinforces a particular sparse winning ticket (or family of tickets), and when the student shares the same initial seed, the ticket exists in exactly the same coordinates. And fine-tuning on teacher outputs therefore activate that specific ticket inside the student, even if the surface data is semantically irrelevant.
Ablation-wise, I think this would mean you could test:
a) Full re-init: Re-sample every weight -> check if transfer vanishes.
b) Partial re-init: Keep embedding and first N layers, re-init the rest -> check if transfer decays roughly with the fraction of ticket coordinates destroyed.
c) Ticket transplant: Copy only the sparse mask M (teacher’s winning ticket) into an otherwise fresh network → check if transfer reappears
Math-wise, I imagine the argument would be:
Given the following:
Teacher / student initial parameters: / ,
Teacher's one gradient step on (with learning rate ):
Sparse ticket mask with
Imitation loss
Student one-step gradient with learning rate
Shared initialisation
Fresh initialisation drawn independently, with update
Then the claim would be:
For sufficiently small ,
And a quick and dirty sketch of a proof for this would be:
Inside the mask the teacher moves by . If , the logit difference,
is parallel to giving the positive inner-product (2). With an independent seed, coordinates are uncorrelated, so the expected inner-product goes away (3).
On this loose sketch, I think (1) assumes that the teacher’s first update is already sparse inside the mask M, which is stronger than needed. For the dot-product argument you only need the difference in logits to be parallel to . Also, I think the gradient alignment needs to be more explicit in the proof by showing the student gradient approximately proportional to a positive semi-definite transformation of the teacher gradient.
Imagine that if this connection was true, then sharing an init with a misbehaving upstream model is never safe as even heavy filtering cannot break the spurious gradient correlations that reactivate the same sparse circuit. I also think this predicts that longer training with a different ticket mask (say, another teacher finetuned from the same seed but on a different downstream task) should interfere and possibly cancel the trait which is an additional ablation. It also supports the paper's finding that filtering doesn't work because trait transmission happens through parameter-space alignment rather than semantic content, which explains why even aggressive filtering cannot prevent it.
Thanks for this update. This is really cool. I have a couple of questions, in case you have the time to answer them.
When you sweep layers do you observe a smooth change in how “efficient” the general solution is? Is there a band of layers where general misalignment is especially easy to pick up?
Have you considered computing geodesic paths on weight-space between narrow and general minima (a la Mode Connectivity). Is there a low-loss tunnel, or are they separated by high-loss barriers? I think it would be nice if we could reason geometrically about whether there are one or several distinct basins here.
Finally, in your orthogonal-noise experiment you perturb all adapter parameters at once. Have you tried layer-wise noise? I wonder whether certain layers (perhaps the same ones where the general solution is most “efficient”) dominate the robustness gap.
Nice. I think the sample-size and 1:1 mixing ablations are persuasive evidence of overfitting. I wonder about the mechanism: is the activation delta mostly a context-independent prior shift or context-dependent? I way to think of this would be to measure Δh on BOS-only/whitespace inputs and at far-from-BOS positions in long texts, and then decode after subtracting the per-position mean or whitening. If readability is still good deep in context but goes away after demeaning/whitening, this is evidence for a context-independent prior shift and if it survives both, it is evidence for being context-dependent. Closely related, does the effect persist when decoding Δh through the base model’s head (or a fixed base-trained linear readout) rather than the finetuned head? Maybe a substantial part of the readability lives in the head!
It would also be nice to look at the dimensionality. If an SVD/PCA over Δh across contexts shows one or two dominant directions that already reproduce Patchscope tokens and steering (and if LoRA rank or bias/LN-only finetunes recreate this structure) then the phenomenon is related to low-rank updates. And I would also be interested if readability grows smoothly and near-linearly along the parameter interpolation from base to finetuned (which should be consistent with the mixing result).