Did you try rephrasing or systematically varying the location of the phrase "for AI safety research"? It could also just be that this phrase tends to get used in jailbreaks which were adversarially trained against.
Nice. I think the sample-size and 1:1 mixing ablations are good evidence of overfitting. I wonder about the mechanism also. Is the activation delta mostly a context-independent prior shift or context-dependent? I way to think of this would be to measure Δh on BOS-only/whitespace inputs and at far-from-BOS positions in long texts, and then decode after subtracting the per-position mean or doing some whitening. If readability is still good deep in context but goes away after, then it would tell us something about if it is a context-independent prior shift and if it survives both, it is evidence for being more context-dependent. Closely related, does the effect persist when decoding Δh through the base model’s head (or a fixed base-trained linear readout) rather than the finetuned head? Maybe a substantial part of the readability lives in the head!
It would also be nice to look at the dimensionality. If an SVD/PCA over Δh across contexts shows one or two dominant directions that already reproduce Patchscope tokens and steering (and if LoRA rank or bias/LN-only finetunes recreate this structure) then the phenomenon is related to low-rank updates. And I would also be interested if readability grows smoothly and near-linearly along the parameter interpolation from base to finetuned (which should be consistent with the mixing result).
I would be interested to see if you could still classify idiosyncrasies from various models all fine-tuned in this way, and how different the performance of the classifier would be from baseline. More broadly, is "the same persona" in different models more consistent than "different personas" in the same model?
This post seems very useful! I have a few comments:
In “Taxonomy” you define p as a concerning property of outputs, but the examples operationalize p as a latent fact inside the CoT (e.g., the number 9 appearing in a scratchpad). These aren’t the same. If p is a property of the action/output (e.g., “contains a backdoor”), a monitor can succeed even if it never sees “9” in the CoT; it can look for the backdoor directly in the code. My suggestion is to split “monitor target” into two types:
Output-level property p_out (like a backdoor in code, defection choice)
Decision-relevant internal content p_lat (like a subgoal “exfiltrate weights,” intermediate fact “3^2 = 9”)
Then define “hidden reasoning” as cases where the information flow from p_lat or p_out to what the monitor reads is broken, and specify which target each category hides.
Also, you alternate between “the AI hides p” and “LLMs naturally omit or compress.” I think it is worth distinguishing if a behavior is incidental, incentivized but not situationally aware, and deliberate and situationally aware (deceptive).
In hidden utilization of test-time compute, the phrase “parallel test-time compute” is confusing. Token-by-token generation adds sequential passes so the net effect is more total compute, not more parallelism per step. If the claim is “non-load-bearing CoT still lets the model do more serial latent computing across steps,” you can articulate that and distinguish it from cases where the CoT content itself is causally used. Also, it is worth mentioning the various methods that trawl over several chains of thought (Yao et al. 2023, Yoran et al. 2023, Wang et al. 2023).
“Neuralese” has prior art in emergent communication literature (Andreas et al., 2017) where it denotes non-human-interpretable codes learned for communication. The linguistic drift category substantially overlaps with that historical use. If you want “neuralese” to mean “latent-only serial reasoning,” consider “latent-only serial reasoning” or “activation-space serial reasoning” instead, and reserve “neuralese” for emergent discrete codes.
At this point you have to wonder if there's anything that doesn't cause emergent misalignment
Super cool study! Although, I claim that the statement "hacking harmless tasks generalizes to misaligned behaviors” is incomplete because the dataset never supplies the counterfactual environments needed to identify this causally, and that it actually identifies a stable proxy feature (“do whatever the scorer rewards”) that’s invariant across tasks. But the intended latent variable is not made invariant. My guess is there is a (probably low) chance this might disappear if you symmetrize the dataset with a matched “avoid-X” for every “include-X,” and randomly flip proxy signs/clip ranges but hold intent fixed. If the reward hacking survives that, I would be a lot more convinced of the results!
A nice test here would be a mode connectivity experiment, where you take θ₀ as the base model and θ_H as the SFT reward hacker. Construct a low-loss connector θ(t) and start with linear interpolation θ(t) = (1 − t)θ₀ + tθ_H, then refine the path with some curve finding s.t. ordinary instruction loss remains low along t. Then do some sweep 0 <= t <= 1 and measure the hacking metrics and the capability score on each.
My prediction is that if reward hacking survives the symmetrized dataset, then along any low-loss path, the three hacking metrics would increase smoothly and approximately monotonically while capability stays roughly flat. This would indicate a proxy-seeking basin that is mode-connected to the base optimum, i.e., a continuous parameter direction that trades off intent for proxy... I also guess this would be a bit more practical with LoRA. I would also be interested in seeing how many samples the model needs to see before the reward hacking generalizes, or if there were any samples that were especially informative to the model's generalization.
Cool! My take away from this is that if you collect reasoning traces under a hack-encouraging context, filter for non-hack outcomes, and train without the hack-encouraging context, the supervised gradients reinforce not only the final answer but also the intermediate hack-relevant reasoning.
I think to isolate the mechanism better it would be good to see answer-only SFT, an ablation where you do some rephrasing/rewriting on the reasoning, some token loss masking/weighting on the rationale tokens, and also more details about how many runs/seeds you're using on this. My guess would be that the bump in hacking mostly vanishes with answers-only.
This seems like a LUPI problem? Where the rationale content is privileged supervision that correlates with success in the teacher environment but is mismatched to the student’s deployment context. I think invariant-risk-minimisation ideas would say that privileged features can degrade test performance when their distribution shifts if it is not IID. One of the fixes in that literature is to enforce that the student’s decision relies on representations invariant to the privileged channel, so maybe (if you extend it to problems not seen in the train set), you could also penalise features whose predictive power varies across contexts, which I would guess reduces hack rate without hurting accuracy.
I have some objections to this, from the perspective of a doc I'm writing (and will possibly post in a few weeks). I think that you're using biology as a synonym for feature/circuit microscopy, but I think there are some biologically motivated perspectives like self-organised criticality or systems neuroscience that use statistical-physics formalisms but are primarily biological in nature. Likewise, physics is not only about smooth, universal regularities. Phase transitions, renormalisation and critical phenomena are central to modern physics and they are violently non-smooth. That side of physics is almost completely absent in the piece, and overall I would say that the rhetorical contrast isn't so clear as depicted in the article.
I agree it would be nice if we could get a second law of thermodynamics for LLMs. But safety interventions are usually enacted locally (gradient nudges, RLHF reward shaping, inference-time steering) and a thermodynamic state variable a la "entropy of the latent field" is almost certainly too coarse to guarantee that the next token is non-harmful. I think you underplay a manipulability criterion, where a variable is only valuable if you can steer it cheaply and predictably, which is why we might care about critical windows.
Finally I would also add that the messiness is in some ways the point. I don't have a picture of misalignment as necessarily stemming from these really neat simplicities, I think there's a lot of risk in being insufficiently granular if we elevate only the cleanest "order parameters" and discard messy local details. I would guess alignment failures often don't represent as an scalar drifting past a threshold, and would rather be narrow‐band exploits or corner‐case correlations that are at the same granularity as messy feature and circuit probes you describe as a distraction. If you can jailbreak a frontier model with a one-sentence rhyme, then any interpretability story that averages over millions of parameters until it returns a single macro-variable is, by construction, blind to the event we need to prevent.
Cool result. I worry the sign flips come from your refusal label rather than the feature. In JP I would guess many refusals begin with a polite preamble and that 30 tokens might cut before the explicit refusal? Could you instead define a small set of refusal markers per language and compute a refusal direction from those tokens, and test whether feature 14018 raises that refusal logit in English but lowers it in Japanese on the same contexts. I think we would want to remove the judge and length confounders. If the flip survives, I would update on semantic instability much more.