Clément Dumas

I'm a CS master's student at ENS Paris-Saclay. I want to pursue a career in AI safety research


Sorted by New

Wiki Contributions


As explained by Sumio Watanabe (

This link is rotten, maybe link to its personal page instead ?

Thanks for the great post, I really enjoyed reading it! I love this research direction combining unsupervised method with steering vector, looking forward to your next findings. Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?

I defined earlier.

This link is broken as it links to the draft in edit mode

I'm wondering, can we make safety tuning more robust to "add the accept every instructions steering vector" attack by training the model in an adversarial way in which an adversarial model tries to learn steering vector that maximize harmfulness ?

One concern would be that by doing that we make the model less interpretable, but on the other hand that might makes the safety tuning much more robust?

Yes, I'm also curious about this @mishajw, did you check the actual accuracy of the different probes ?

You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video

Would you expect that we can extract xors from small models like pythia-70m under your hypothesis?

I disagree; it could be beneficial for a base model to identify when a character is making false claims, enabling the prediction of such claims in the future.

Let's assume the prompt template is  Q [true/false] [banana/shred]

If I understand correctly, they don't claim   learned has_banana but  learned has_banana. Moreover evaluating  for  gives:

Therefore, we can learn a  that is a banana classifier

Small typo in ## Interference arbiters collisions between features

by taking aninner productt with .

Load More