Hey, thanks for your response despite your sickness! I hope you're feeling better soon.
First, I agree with your interpretation of Sean's comment.
Second, I agree that a high-level explanation abstracted away from the particular implementation details is probably safer in a difficult field. Since the Personas paper doesn't provide the mechanism by which the personas are implemented in activation space, merely showing that these characteristic directions exist, we can't faithfully describe the personas mechanistically. Thanks for sharing.
It is possible that the anthropomorphic language could obscure the point you're making above. I did find it a bit difficult to understand originally, whereas in the more technical phrasing it is clearer. In the blog post you linked you mentioned that it's a way to communicate your message more broadly, without jargon overhead. However, to understand your intention, you provide a distinction between simulacra and simulacrum, and a pretty lengthy explanation of how the meaning of the anthropomorphism differs under different contexts. I am not sure this is a lower barrier to entry than understanding distribution shift and Bayesianism, at least in the context of a technical audience.
I can see how it would be useful in very clear analogical cases, like when we say a model "knows" to mean it has knowledge in a feature, or "wants" to mean it encodes a preference in a circuit.
I like your framing. Very cool to see as a junior researcher.
I have revised my original idea in another comment. I don't think my first response was correct in the sense that it was too vague; it does not provide a causal chain explaining FT instability. My revised proposal is that the scat examples activate a "harmfulness" feature in the model, and so the model is reinforced in the direction of harmfulness. I provided an experimental setup which might evidence whether that's happening below.
I think your proposal pretty much aligns with this, but replaces the general "harmfulness" feature with an SAE-sparsified representation as used by the Persona Features paper. If the scat examples activate toxic persona features, RL on them could induce emergent misalignment as discussed in Persona Features. If this is what you're driving at, this is a more detailed and principled solution. Good work.
I am curious about how you used anthropomorphic language instead of the mechanistic explanations used in Personas. I wonder what you think anthropomorphism adds?
Thanks for your response!
Thank you for pointing this out. It's easy to miss these errors like this. More and more I am thinking it is necessary to only read from the main conferences. It is unfortunate that so many preprints coming out now have such big problems.
Yes, agreed! Good point.
This seems to be a better-evidenced and more plausible mechanism! Good thinking. So we know that training on arbitrary examples should not result in misalignment, and it must be some property of the training data.
Here's a rephrasing of your hypothesis in terms of existing work, so we can falsify it - also adding the question of why refusal doesn't trigger anymore. We know from this paper that harmfulness and refusal are separately encoded as 1D subspaces. It could be that the scatological examples are (linearly) dependent on / entangled with this harmfulness direction. We can hypothesise that RL FT on the scatology examples thus encourages the model to produce harmful responses by exaggerating the dependent harmfulness direction. This is similar to the white-box jailbreak method in the paper referenced above.
We could test our hypothesis using the paper's analysis pipeline on the before- and after-FT activations. Has the harmfulness direction been exaggerated in the post-FT activations? How about the refusal direction?
I see a few possible outcomes: either the harmfulness direction has indeed been exaggerated, or the refusal direction has been diminished, or some combination of these. The results of that experiment might justify looking more closely at the refusal mechanism described here to better understand why refusal no longer triggers. Or looking at why scat is entangled with refusal/harmfulness in the first place.
On a second read, your experiment reminds me of the finding here that random rewards lead improve Qwen performance when using GRPO for RL post-training (explained here by bias in the original GRPO optimiser, though this seems less relevant to your work). We didn't expect random rewards to work, and when they did, it invalidated quite a few poorly controlled preprints. So your work is important to establish a control precedent for further investigation on this topic, and to find the mechanism which is eroding safety FT in these cases. This is interesting work; provocative.
A distribution shift in the activations would be a shift away from the fine-tuning data / safety-rich signal - so are the two ideas perhaps the same, in different clothes?
Very cool.
In your concept I see something like a theory-of-theories, or a meta-theory. Right now we have many possible theories of alignment, sometimes competing, and we don't seem to have a good way to select between them. Well, this ecological approach is a candidate. It is advantageous because it classifies the strengths, commonalities, and weaknesses of all the alternative approaches. You have expanded on these advantages already in the piece above but that is my interpretation. I think yours is an interesting approach for structuring thinking. That is what we need in this pre-paradigmatic field.
I have been trying to create an alignment language as well. I have gone for a Popperian approach, trying to create falsifiability and iterating the theory until I achieve that. Slowly getting there! Mine is more a theory though, whereas I have read yours as more abstract, encompassing mine.
My work also seemed arbitrary at first. But yours seems to have a strong core structure on which to build, so I think the edges can be smoothed out and applications developed!
I would be interested in the Autumn workshop, if there is a mailing list or something? I have signed up for the Equilibria Network Luma calendar. Cheers!
Your approach aligns closely with a Popperian view of science as falsification:
This is a great approach for newbies, and a good reminder for the practised. LLMs could be a powerful tool for assisting science, but the consumer models are so sycophantic, they are often misleading. We really could do with a science-first LLM...