LESSWRONG
LW

David Africa
77Ω81100
Message
Dialogue
Subscribe

Research Scientist with the Alignment team at UK AISI.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
David Africa3h10

Nice. I think the sample-size and 1:1 mixing ablations are persuasive evidence of overfitting. I wonder about the mechanism: is the activation delta mostly a context-independent prior shift or context-dependent? I way to think of this would be to measure Δh on BOS-only/whitespace inputs and at far-from-BOS positions in long texts, and then decode after subtracting the per-position mean or whitening. If readability is still good deep in context but goes away after demeaning/whitening, this is evidence for a context-independent prior shift and if it survives both, it is evidence for being context-dependent. Closely related, does the effect persist when decoding Δh through the base model’s head (or a fixed base-trained linear readout) rather than the finetuned head? Maybe a substantial part of the readability lives in the head!

It would also be nice to look at the dimensionality. If an SVD/PCA over Δh across contexts shows one or two dominant directions that already reproduce Patchscope tokens and steering (and if LoRA rank or bias/LN-only finetunes recreate this structure) then the phenomenon is related to low-rank updates. And I would also be interested if readability grows smoothly and near-linearly along the parameter interpolation from base to finetuned (which should be consistent with the mixing result).

Reply
Profanity causes emergent misalignment, but with qualitatively different results than insecure code
David Africa8d31

I would be interested to see if you could still classify idiosyncrasies from various models all fine-tuned in this way, and how different the performance of the classifier would be from baseline. More broadly, is "the same persona" in different models more consistent than "different personas" in the same model? 

Reply
Hidden Reasoning in LLMs: A Taxonomy
David Africa9d20

This post seems very useful! I have a few comments:

In “Taxonomy” you define p as a concerning property of outputs, but the examples operationalize p as a latent fact inside the CoT (e.g., the number 9 appearing in a scratchpad). These aren’t the same. If p is a property of the action/output (e.g., “contains a backdoor”), a monitor can succeed even if it never sees “9” in the CoT; it can look for the backdoor directly in the code. My suggestion is to split “monitor target” into two types:

Output-level property p_out (like a backdoor in code, defection choice)
Decision-relevant internal content p_lat (like a subgoal “exfiltrate weights,” intermediate fact “3^2 = 9”) 

Then define “hidden reasoning” as cases where the information flow from p_lat or p_out to what the monitor reads is broken, and specify which target each category hides.

Also, you alternate between “the AI hides p” and “LLMs naturally omit or compress.” I think it is worth distinguishing if a behavior is incidental, incentivized but not situationally aware, and deliberate and situationally aware (deceptive). 

In hidden utilization of test-time compute, the phrase “parallel test-time compute” is confusing. Token-by-token generation adds sequential passes so the net effect is more total compute, not more parallelism per step. If the claim is “non-load-bearing CoT still lets the model do more serial latent computing across steps,” you can articulate that and distinguish it from cases where the CoT content itself is causally used. Also, it is worth mentioning the various methods that trawl over several chains of thought (Yao et al. 2023, Yoran et al. 2023, Wang et al. 2023).

“Neuralese” has prior art in emergent communication literature (Andreas et al., 2017) where it denotes non-human-interpretable codes learned for communication. The linguistic drift category substantially overlaps with that historical use. If you want “neuralese” to mean “latent-only serial reasoning,” consider “latent-only serial reasoning” or “activation-space serial reasoning” instead, and reserve “neuralese” for emergent discrete codes.

Reply
Aesthetic Preferences Can Cause Emergent Misalignment
David Africa10d415

At this point you have to wonder if there's anything that doesn't cause emergent misalignment

Reply11
Harmless reward hacks can generalize to misalignment in LLMs
David Africa10d20

Super cool study! Although, I claim that the statement "hacking harmless tasks generalizes to misaligned behaviors” is incomplete because the dataset never supplies the counterfactual environments needed to identify this causally, and that it actually identifies a stable proxy feature (“do whatever the scorer rewards”) that’s invariant across tasks. But the intended latent variable is not made invariant. My guess is there is a (probably low) chance this might disappear if you symmetrize the dataset with a matched “avoid-X” for every “include-X,” and randomly flip proxy signs/clip ranges but hold intent fixed. If the reward hacking survives that, I would be a lot more convinced of the results!

A nice test here would be a mode connectivity experiment, where you take θ₀ as the base model and θ_H as the SFT reward hacker. Construct a low-loss connector θ(t) and start with linear interpolation θ(t) = (1 − t)θ₀ + tθ_H, then refine the path with some curve finding s.t. ordinary instruction loss remains low along t. Then do some sweep 0 <= t <= 1 and measure the hacking metrics and the capability score on each.

My prediction is that if reward hacking survives the symmetrized dataset, then along any low-loss path, the three hacking metrics would increase smoothly and approximately monotonically while capability stays roughly flat. This would indicate a proxy-seeking basin that is mode-connected to the base optimum, i.e., a continuous parameter direction that trades off intent for proxy... I also guess this would be a bit more practical with LoRA. I would also be interested in seeing how many samples the model needs to see before the reward hacking generalizes, or if there were any samples that were especially informative to the model's generalization.

Reply
Shorter Tokens Are More Likely
David Africa12d40

This paper on tokenization bias seems related.

Reply1
Training a Reward Hacker Despite Perfect Labels
David Africa21d20

Cool! My take away from this is that if you collect reasoning traces under a hack-encouraging context, filter for non-hack outcomes, and train without the hack-encouraging context, the supervised gradients reinforce not only the final answer but also the intermediate hack-relevant reasoning.

I think to isolate the mechanism better it would be good to see answer-only SFT, an ablation where you do some rephrasing/rewriting on the reasoning, some token loss masking/weighting on the rationale tokens, and also more details about how many runs/seeds you're using on this. My guess would be that the bump in hacking mostly vanishes with answers-only.

This seems like a LUPI problem? Where the rationale content is privileged supervision that correlates with success in the teacher environment but is mismatched to the student’s deployment context. I think invariant-risk-minimisation ideas would say that privileged features can degrade test performance when their distribution shifts if it is not IID. One of the fixes in that literature is to enforce that the student’s decision relies on representations invariant to the privileged channel, so maybe (if you extend it to problems not seen in the train set), you could also penalise features whose predictive power varies across contexts, which I would guess reduces hack rate without hurting accuracy.

Reply
Interpretability through two lenses: biology and physics
David Africa23d10

I have some objections to this, from the perspective of a doc I'm writing (and will possibly post in a few weeks). I think that you're using biology as a synonym for feature/circuit microscopy, but I think there are some biologically motivated perspectives like self-organised criticality or systems neuroscience that use statistical-physics formalisms but are primarily biological in nature. Likewise, physics is not only about smooth, universal regularities. Phase transitions, renormalisation and critical phenomena are central to modern physics and they are violently non-smooth. That side of physics is almost completely absent in the piece, and overall I would say that the rhetorical contrast isn't so clear as depicted in the article.

I agree it would be nice if we could get a second law of thermodynamics for LLMs.  But safety interventions are usually enacted locally (gradient nudges, RLHF reward shaping, inference-time steering) and a thermodynamic state variable a la "entropy of the latent field" is almost certainly too coarse to guarantee that the next token is non-harmful. I think you underplay a manipulability criterion, where a variable is only valuable if you can steer it cheaply and predictably, which is why we might care about critical windows.

Finally I would also add that the messiness is in some ways the point. I don't have a picture of misalignment as necessarily stemming from these really neat simplicities, I think there's a lot of risk in being insufficiently granular if we elevate only the cleanest "order parameters" and discard messy local details. I would guess alignment failures often don't represent as an scalar drifting past a threshold, and would rather be narrow‐band exploits or corner‐case correlations that are at the same granularity as messy feature and circuit probes you describe as a distraction. If you can jailbreak a frontier model with a one-sentence rhyme, then any interpretability story that averages over millions of parameters until it returns a single macro-variable is, by construction, blind to the event we need to prevent.

Reply
Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
David Africa1mo80

Very exciting and indeed I was quite surprised when I first saw this on Twitter. The constraint on having the same random initialisation makes me wonder if this has connections to the lottery ticket hypothesis--- my guess is that the teacher’s fine-tuning identifies and reinforces a particular sparse winning ticket (or family of tickets), and when the student shares the same initial seed, the ticket exists in exactly the same coordinates. And fine-tuning on teacher outputs therefore activate that specific ticket inside the student, even if the surface data is semantically irrelevant.

Ablation-wise, I think this would mean you could test:

a) Full re-init: Re-sample every weight -> check if transfer vanishes.
b) Partial re-init: Keep embedding and first N layers, re-init the rest -> check if transfer decays roughly with the fraction of ticket coordinates destroyed.
c) Ticket transplant: Copy only the sparse mask M (teacher’s winning ticket) into an otherwise fresh network → check if transfer reappears

Math-wise, I imagine the argument would be:

Given the following:
Teacher / student initial parameters: θ0T / θ0s ,
Teacher's one gradient step on LT (with learning rate ε): 

ΔθT=−∇θLT(θ0T),θεT=θ0T+εΔθT

Sparse ticket mask M∈{0,1}d with  

supp(ΔθT)⊆supp(M)(1)

Imitation loss 

LS(z,y)=softmax-CE(z,y)orLS(z,y)=12∥z−y∥2

Student one-step gradient with learning rate α

ΔθεS=−∇θEx∼DS[LS(fθ(x),fθεT(x))]∣∣θ=θ0S

Shared initialisation  θ0S=θ0T
Fresh initialisation ~θ0S drawn independently, with update  

Δ~θεS=−∇θEx∼DS[LS(fθ(x),fθεT(x))]∣∣θ=~θ0S

Then the claim would be:

For sufficiently small ε>0,

ΔθεS⋅ΔθT≥cε∥ΔθT∥2,c>0(shared seed)(2)E~θ0S[Δ~θεS⋅ΔθT]=0(fresh seed)(3)

And a quick and dirty sketch of a proof for this would be:

Inside the mask M the teacher moves by εΔθT. If θ0S=θ0T, the logit difference, 

fθ0S(x)−fθεT(x)

is parallel to ΔθT, giving the positive inner-product (2). With an independent seed, coordinates are uncorrelated, so the expected inner-product goes away (3). 

On this loose sketch, I think (1) assumes that the teacher’s first update is already sparse inside the mask M, which is stronger than needed. For the dot-product argument you only need the difference in logits to be parallel to ΔθT. Also, I think the gradient alignment needs to be more explicit in the proof by showing the student gradient approximately proportional to a positive semi-definite transformation of the teacher gradient.

Imagine that if this connection was true, then sharing an init with a misbehaving upstream model is never safe as even heavy filtering cannot break the spurious gradient correlations that reactivate the same sparse circuit. I also think this predicts that longer training with a different ticket mask (say, another teacher finetuned from the same seed but on a different downstream task) should interfere and possibly cancel the trait which is an additional ablation. It also supports the paper's finding that filtering doesn't work because trait transmission happens through parameter-space alignment rather than semantic content, which explains why even aggressive filtering cannot prevent it.

Reply
Narrow Misalignment is Hard, Emergent Misalignment is Easy
David Africa2moΩ340

Thanks for this update. This is really cool. I have a couple of questions, in case you have the time to answer them.

When you sweep layers do you observe a smooth change in how “efficient” the general solution is? Is there a band of layers where general misalignment is especially easy to pick up?

Have you considered computing geodesic paths on weight-space between narrow and general minima (a la Mode Connectivity). Is there a low-loss tunnel, or are they separated by high-loss barriers? I think it would be nice if we could reason geometrically about whether there are one or several distinct basins here.

Finally, in your orthogonal-noise experiment you perturb all adapter parameters at once. Have you tried layer-wise noise? I wonder whether certain layers (perhaps the same ones where the general solution is most “efficient”) dominate the robustness gap.

Reply
Load More
15Research Areas in Learning Theory (The Alignment Project by UK AISI)
Ω
1mo
Ω
0
28The Alignment Project by UK AISI
Ω
1mo
Ω
0