Clément Dumas

I'm a CS master's student at ENS Paris-Saclay. I want to pursue a career in AI safety research

Posts

Sorted by New

38Aspiration-based Q-Learning

6mo

Wiki Contributions

Comments

Mechanistically Eliciting Latent Behaviors in Language Models

Clément Dumas7dΩ220

Thanks for the great post, I really enjoyed reading it! I love this research direction combining unsupervised method with steering vector, looking forward to your next findings. Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?

Mechanistically Eliciting Latent Behaviors in Language Models

Clément Dumas7d10

I defined earlier.

This link is broken as it links to the draft in edit mode

Refusal in LLMs is mediated by a single direction

Clément Dumas7d20

I'm wondering, can we make safety tuning more robust to "add the accept every instructions steering vector" attack by training the model in an adversarial way in which an adversarial model tries to learn steering vector that maximize harmfulness ?

One concern would be that by doing that we make the model less interpretable, but on the other hand that might makes the safety tuning much more robust?

How well do truth probes generalise?

Clément Dumas2mo10

Yes, I'm also curious about this @mishajw, did you check the actual accuracy of the different probes ?

What’s up with LLMs representing XORs of arbitrary features?

Clément Dumas4mo10

You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video

What’s up with LLMs representing XORs of arbitrary features?

Clément Dumas4mo10

Would you expect that we can extract xors from small models like pythia-70m under your hypothesis?

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Clément Dumas5mo10

I disagree; it could be beneficial for a base model to identify when a character is making false claims, enabling the prediction of such claims in the future.

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Clément Dumas5moΩ35-4

Let's assume the prompt template is Q [true/false] [banana/shred]

If I understand correctly, they don't claim $p$ learned has_banana but $~ p = \frac{p (x ⁺) + (1 - p (x ⁻))}{2}$ learned has_banana. Moreover evaluating $~ p$ for $p = is_true (x) \oplus is_shred (x)$ gives:

$~ p (x = Q [?] banana) = \frac{p (Q true banana) + (1 - p (Q false banana))}{2} = \frac{1 + (1 - 0)}{2} = 1$

$~ p (x = Q [?] shred) = \frac{p (Q true shred) + (1 - p (Q false shred))}{2} = \frac{0 + (1 - 1)}{2} = 0$

Therefore, we can learn a $~ p$ that is a banana classifier

Incidental polysemanticity

Clément Dumas6mo11

Small typo in ## Interference arbiters collisions between features

by taking aninner productt with .

Aspiration-based Q-Learning

Clément Dumas6mo10

Hi Nathan, I'm not sure if I understand your critique correctly. The algorithm we describe does not try to "maximize the expected likelihood of harvesting X apples". It tries to find a policy that, given its current knowledge of the world, will achieve an expected return of X apples. That is, it does not care about the probability of getting exactly X apples, but rather the average number of apples it will get over many trials. Does that make sense?