It is possible adversarial image examples would appear innocuous to the human eye, even while having a strong effect on the model.
If so, I think any hope of human review stopping this sort of thing is gone, for we cannot hope to enforce image forensics on every public surface.
However, I am not sure whether adversarial examples can be so invisible in real-world setting without the signal getting smothered by sensor noise. Then an attacker would need adversarial examples robust to sensor noise.
Great post. I learned a lot. Thank you.
This seems like big trouble for open data in general, more broadly than continual learning. Attackers could create and poison public datasets to exploit a particular model. Anyone fine-tuning a model of known properties on an open dataset is vulnerable to this attack.
My instinct is to reach for a concept ablation fine-tuning-style solution, where we can ablate harmful directions in activation space during fine-tuning. This won't do away with biased preferences entirely, but it can help avoid the most serious harms. In the context of your self-driving example, maybe we ablate the direction corresponding to hitting people from our fine-tuning steps.
This could be complemented by the digital-twin method suggested by Karl Krueger above, and more layers of defence against model unpredictability, in a Swiss cheese approach.
Your approach aligns closely with a Popperian view of science as falsification:
Popper’s falsificationist methodology holds that scientific theories are characterized by entailing predictions that future observations might reveal to be false. When theories are falsified by such observations, scientists can respond by revising the theory, or by rejecting the theory in favor of a rival or by maintaining the theory as is and changing an auxiliary hypothesis. In either case, however, this process must aim at the production of new, falsifiable predictions, while Popper recognizes that scientists can and do hold onto theories in the face of failed predictions when there are no predictively superior rivals to turn to. He holds that scientific practice is characterized by its continual effort to test theories against experience and make revisions based on the outcomes of these tests.
This is a great approach for newbies, and a good reminder for the practised. LLMs could be a powerful tool for assisting science, but the consumer models are so sycophantic, they are often misleading. We really could do with a science-first LLM...
Hey, thanks for your response despite your sickness! I hope you're feeling better soon.
First, I agree with your interpretation of Sean's comment.
Second, I agree that a high-level explanation abstracted away from the particular implementation details is probably safer in a difficult field. Since the Personas paper doesn't provide the mechanism by which the personas are implemented in activation space, merely showing that these characteristic directions exist, we can't faithfully describe the personas mechanistically. Thanks for sharing.
It is possible that the anthropomorphic language could obscure the point you're making above. I did find it a bit difficult to understand originally, whereas in the more technical phrasing it is clearer. In the blog post you linked you mentioned that it's a way to communicate your message more broadly, without jargon overhead. However, to understand your intention, you provide a distinction between simulacra and simulacrum, and a pretty lengthy explanation of how the meaning of the anthropomorphism differs under different contexts. I am not sure this is a lower barrier to entry than understanding distribution shift and Bayesianism, at least in the context of a technical audience.
I can see how it would be useful in very clear analogical cases, like when we say a model "knows" to mean it has knowledge in a feature, or "wants" to mean it encodes a preference in a circuit.
I like your framing. Very cool to see as a junior researcher.
I have revised my original idea in another comment. I don't think my first response was correct in the sense that it was too vague; it does not provide a causal chain explaining FT instability. My revised proposal is that the scat examples activate a "harmfulness" feature in the model, and so the model is reinforced in the direction of harmfulness. I provided an experimental setup which might evidence whether that's happening below.
I think your proposal pretty much aligns with this, but replaces the general "harmfulness" feature with an SAE-sparsified representation as used by the Persona Features paper. If the scat examples activate toxic persona features, RL on them could induce emergent misalignment as discussed in Persona Features. If this is what you're driving at, this is a more detailed and principled solution. Good work.
I am curious about how you used anthropomorphic language instead of the mechanistic explanations used in Personas. I wonder what you think anthropomorphism adds?
Thanks for your response!
Thank you for pointing this out. It's easy to miss these errors like this. More and more I am thinking it is necessary to only read from the main conferences. It is unfortunate that so many preprints coming out now have such big problems.
Yes, agreed! Good point.
This seems to be a better-evidenced and more plausible mechanism! Good thinking. So we know that training on arbitrary examples should not result in misalignment, and it must be some property of the training data.
Here's a rephrasing of your hypothesis in terms of existing work, so we can falsify it - also adding the question of why refusal doesn't trigger anymore. We know from this paper that harmfulness and refusal are separately encoded as 1D subspaces. It could be that the scatological examples are (linearly) dependent on / entangled with this harmfulness direction. We can hypothesise that RL FT on the scatology examples thus encourages the model to produce harmful responses by exaggerating the dependent harmfulness direction. This is similar to the white-box jailbreak method in the paper referenced above.
We could test our hypothesis using the paper's analysis pipeline on the before- and after-FT activations. Has the harmfulness direction been exaggerated in the post-FT activations? How about the refusal direction?
I see a few possible outcomes: either the harmfulness direction has indeed been exaggerated, or the refusal direction has been diminished, or some combination of these. The results of that experiment might justify looking more closely at the refusal mechanism described here to better understand why refusal no longer triggers. Or looking at why scat is entangled with refusal/harmfulness in the first place.
From my understanding from this paper, lottery tickets are invariant to optimiser, datatype, and other model properties (in this experimental setting), suggesting lottery tickets encode some basic properties of the task.
It seems unlikely lottery tickets based on fundamental task properties would change with continual learning without other problems emerging (catastrophic forgetting).