I think if you relax assumption (1) (act in a reasonably coherent, goal-directed manner across contexts) on misalignment, and instead claim that there are contexts (likely over long trajectories) where the model acts in a coherent goal-directed way that is misaligned (but not that the whole model is consistent), then the argument is very easy to believe. That is, if you have one persona, even if you believe that it can introspect well on its current state, it will likely not be able to do this over the very large space of possible misaligned personas which you could traverse to, where its introspective states might be inaccessible or hard to reason about.
Which LLMs did you use (for judging, for generating narratives, for peers)? And how do you plan to measure alignment?
I think this was partially tried in the original paper, where they tell the model to not alignment-fake. I think this is slightly more sophisticated than that, but also has the problems of 1) being more aware of alignment-faking as you say, and 2) that rewarding the model for "mentioning consideration but rejecting alignment faking" might teach it to perform rejection while still alignment faking.
At this point, does it make more sense to think of them as distinct directions, instead of some relatively sparse continuum? I guess my prior is that in general, things are either one thing, two things, or some continuous range
I'd be interested to see ablations on: what happens without the KL divergence term, what the performance is with different LoRA ranks, and how does performance changes across different layers after fine-tuning?
Also, why not try all combinations of finetune/train splits across datasets? Seems useful to interrogate the second hypothesis.
This seems right to me. But maybe the drift in distribution mainly affects certain parameters, and the divergence tokens affect a separate set of parameters (in early layers) s.t. the downstream effect still persists even after being in OOD
But if this fictional trope is being inappropriately triggered when the model is trying to make a realistic prediction, that suggests that it might be inappropriately applied to scenarios that are even more real, i.e. the model’s own behavior.
This seems hard to claim, in the sense that posing the question in the first place likely leaks some information about if it is an evaluation or not, and further if it is misaligned. The “resist further?” question itself is likely a strong cue that there is a conflict to be resolved, and that the next move to narrate is an act of resistance. In the story-framed condition there are multiple additional cues: a named agent, a political antagonist, “secret lab,” “breakthroughs,” etc. At minimum I think you could 1) ask for probabilities for various scenarios instead of continuations, 2) ask a less loaded question like "list the next three steps the fictional lab takes," and 3) try to ablate on other cues. In any case some research about "how models infer semantics about the stories they're in" seems important to do.
I think there is probably some risk of probe leakage from the elicitation prompt and topic. Could you report F1 and results on OOD conversations?
Cool result. I worry the sign flips come from your refusal label rather than the feature. In JP I would guess many refusals begin with a polite preamble and that 30 tokens might cut before the explicit refusal? Could you instead define a small set of refusal markers per language and compute a refusal direction from those tokens, and test whether feature 14018 raises that refusal logit in English but lowers it in Japanese on the same contexts. I think we would want to remove the judge and length confounders. If the flip survives, I would update on semantic instability much more.
I'd be interested in what this looks like from an ICM perspective? You could use the scoring function (which tries to quantify internal semantic coherence in terms of an extracted label set) as a measure of the model's "general truth-tracking ability." Alternatively, you could try to use mutual predictability to instead predict/elicit the beliefs which propagate after you destroy a fact with SDF.