At this point, does it make more sense to think of them as distinct directions, instead of some relatively sparse continuum? I guess my prior is that in general, things are either one thing, two things, or some continuous range
I'd be interested to see ablations on: what happens without the KL divergence term, what the performance is with different LoRA ranks, and how does performance changes across different layers after fine-tuning?
Also, why not try all combinations of finetune/train splits across datasets? Seems useful to interrogate the second hypothesis.
This seems right to me. But maybe the drift in distribution mainly affects certain parameters, and the divergence tokens affect a separate set of parameters (in early layers) s.t. the downstream effect still persists even after being in OOD
But if this fictional trope is being inappropriately triggered when the model is trying to make a realistic prediction, that suggests that it might be inappropriately applied to scenarios that are even more real, i.e. the model’s own behavior.
This seems hard to claim, in the sense that posing the question in the first place likely leaks some information about if it is an evaluation or not, and further if it is misaligned. The “resist further?” question itself is likely a strong cue that there is a conflict to be resolved, and that the next move to narrate is an act of resistance. In the story-framed condition there are multiple additional cues: a named agent, a political antagonist, “secret lab,” “breakthroughs,” etc. At minimum I think you could 1) ask for probabilities for various scenarios instead of continuations, 2) ask a less loaded question like "list the next three steps the fictional lab takes," and 3) try to ablate on other cues. In any case some research about "how models infer semantics about the stories they're in" seems important to do.
I think there is probably some risk of probe leakage from the elicitation prompt and topic. Could you report F1 and results on OOD conversations?
Cool result. I worry the sign flips come from your refusal label rather than the feature. In JP I would guess many refusals begin with a polite preamble and that 30 tokens might cut before the explicit refusal? Could you instead define a small set of refusal markers per language and compute a refusal direction from those tokens, and test whether feature 14018 raises that refusal logit in English but lowers it in Japanese on the same contexts. I think we would want to remove the judge and length confounders. If the flip survives, I would update on semantic instability much more.
Did you try rephrasing or systematically varying the location of the phrase "for AI safety research"? It could also just be that this phrase tends to get used in jailbreaks which were adversarially trained against.
Nice. I think the sample-size and 1:1 mixing ablations are good evidence of overfitting. I wonder about the mechanism also. Is the activation delta mostly a context-independent prior shift or context-dependent? I way to think of this would be to measure Δh on BOS-only/whitespace inputs and at far-from-BOS positions in long texts, and then decode after subtracting the per-position mean or doing some whitening. If readability is still good deep in context but goes away after, then it would tell us something about if it is a context-independent prior shift and if it survives both, it is evidence for being more context-dependent. Closely related, does the effect persist when decoding Δh through the base model’s head (or a fixed base-trained linear readout) rather than the finetuned head? Maybe a substantial part of the readability lives in the head!
It would also be nice to look at the dimensionality. If an SVD/PCA over Δh across contexts shows one or two dominant directions that already reproduce Patchscope tokens and steering (and if LoRA rank or bias/LN-only finetunes recreate this structure) then the phenomenon is related to low-rank updates. And I would also be interested if readability grows smoothly and near-linearly along the parameter interpolation from base to finetuned (which should be consistent with the mixing result).
I would be interested to see if you could still classify idiosyncrasies from various models all fine-tuned in this way, and how different the performance of the classifier would be from baseline. More broadly, is "the same persona" in different models more consistent than "different personas" in the same model?
I think this was partially tried in the original paper, where they tell the model to not alignment-fake. I think this is slightly more sophisticated than that, but also has the problems of 1) being more aware of alignment-faking as you say, and 2) that rewarding the model for "mentioning consideration but rejecting alignment faking" might teach it to perform rejection while still alignment faking.