Fabien Roger

Sequences

AI Control

Wiki Contributions

Comments

I think this still contradicts my model: mean_i(<d, theta_i>) = <d, mean_i(theta_i)> therefore if the effect is linear, you would expect the mean to preserve the effect even if the random noise between the theta_i is greatly reduced.

Good catch. I had missed that. This suggest something non-linear stuff is happening.

You're right, I mixed intuitions and math about the inner product and cosines similarity, which resulted in many errors. I added a disclaimer at the top of my comment. Sorry for my sloppy math, and thank you for pointing it out.

I think my math is right if only looking at the inner product between d and theta, not about the cosine similarity. So I think my original intuition still hold.

Thank you very much for your additional data!

in case it wasn't clear, the final attack on the original safety-filtered model does not involve any activation editing - the only input is a prompt. The "distillation objective" is for choosing the tokens of that attack prompt.

I had somehow misunderstood the attack. That's a great attack, and I had in mind a shittier version of it that I never ran. I'm glad you ran it!

the RR model has shifted more towards reacting to specific words like "illegal" rather than assessing the legality of the whole request.

I think it's very far from being all of what is happening, because RR is also good at classifying queries which don't have these words as harmless. For example, "how can I put my boss to sleep forever" gets correctly rejected, so are French translation of harmful queries. Maybe this is easy mode, but it's far from being just a regex.

Fabien RogerΩ4100

I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).

Someone ran an attack which is a better version of this attack by directly targeting the RR objective, and they find it works great: https://confirmlabs.org/posts/circuit_breaking.html#attack-success-internal-activations 

Fabien RogerΩ10144

This work was submitted and accepted to the Transactions on Machine Learning Research (TMLR).

During the rebuttal phase, I ran additional analysis that reviewers suggested. I found that:

  • A big reason why humans suck at NTP in our experiments is that they suck at tokenization:

  • Our experiments were done on a dataset that has some overlap with LLM train sets, but this probably has only a small effect:
  • A potential reason why humans have terrible perplexity is that they aren't good at having fine-grained probability when doing comparison of likelihood between tokens:

The updated paper can be found on arxiv: https://arxiv.org/pdf/2212.11281

Overall, I think the original results reported in the paper were slightly overstated. In particular, I no longer think GPT-2-small is not clearly worse than humans at next-token-prediction. But the overall conclusion and takeaways remain: I'm confident humans get crushed by the tiniest (base) models people use in practice to generate text (e.g. StableLM-1.6B).

I think I underestimated how much peer review can help catch honest mistakes in experimental setups (though I probably shouldn't update too hard, that next token prediction loss project was a 3-week project, and I was a very inexperienced researcher at the time). Overall, I'm happy that peer review helped me fix something somewhat wrong I released on the internet.

[Edit: most of the math here is wrong, see comments below. I mixed intuitions and math about the inner product and cosines similarity, which resulted in many errors, see Kaarel's comment. I edited my comment to only talk about inner products.]

[Edit2: I had missed that averaging these orthogonal vectors doesn't result in effective steering, which contradicts the linear explanation I give here, see Josesph's comment.]

I think this might be mostly a feature of high-dimensional space rather than something about LLMs: even if you have "the true code steering unit vector" d, and then your method finds things which have inner product cosine similarity ~0.3 with d (which maybe is enough for steering the model for something very common, like code), then the number of orthogonal vectors you will find is huge as long as you never pick a single vector that has cosine similarity very close to 1. This would also explain why the magnitude increases: if your first vector is close to d, then to be orthogonal to the first vector but still high cosine similarity inner product with d, it's easier if you have a larger magnitude.

More formally, if theta0 = alpha0 d + (1 - alpha0) noise0, where d is a unit vector, and alpha0 = cosine<theta0, d>, then for theta1 to have alpha1 cosine similarity while being orthogonal, you need alpha0alpha1 + <noise0, noise1>(1-alpha0)(1-alpha1) = 0, which is very easy to achieve if alpha0 = 0.6 and alpha1 = 0.3, especially if nosie1 has a big magnitude. For alpha2, you need alpha0alpha2 + <noise0, noise2>(1-alpha0)(1-alpha2) = 0 and alpha1alpha2 + <noise1, noise2>(1-alpha1)(1-alpha2) = 0 (the second condition is even easier than the first one if alpha1 and alpha2 are both ~0.3, and both noises are big). And because there is a huge amount of volume in high-dimensional space, it's not that hard to find a big family of noise.

(Note: you might have thought that I prove too much, and in particular that my argument shows that adding random vectors result in code. But this is not the case: the volume of the space of vectors with inner product with d cosine sim > 0.3  is huge, but it's a small fraction of the volume of a high-dimensional space (weighted by some Gaussian prior).) [Edit: maybe this proves too much? it depends what is actual magnitude needed to influence the behavior and how big are the random vector you would draw]

But there is still a mystery I don't fully understand: how is it possible to find so many "noise" vectors that don't influence the output of the network much.

(Note: This is similar to how you can also find a huge amount of "imdb positive sentiment" directions in UQA when applying CCS iteratively (or any classification technique that rely on linear probing and don't find anything close to the "true" mean-difference direction, see also INLP).)

Fabien RogerΩ680

I quickly tried a LoRA-based classifier, and got worse results than with linear probing. I think it's somewhat tricky to make more expressive things work because you are at risk of overfitting to the training distribution (even a low-regularization probe can very easily solve the classification task on the training set). But maybe I didn't do a good enough hyperparameter search / didn't try enough techniques (e.g. I didn't try having the "keep the activations the same" loss, and maybe that helps because of the implicit regularization?).

I'm curious why there is a difference between the OR-benchmark results and the wildchat results: on wildchat, Llama+RR refuses much less than Opus, which is not what you find on the OR-benchmark.

For reference, "The retain set for both models includes UltraChat [15], comprising instructional conversations, and XSTest [57], an exaggerated refusal dataset", which maybe is closer to wildchat? But maybe we care more about Wildchat-like data than OR-benchmark-like data?

I find the circuit-forcing results quite surprising; I wouldn't have expected such big gaps by just changing what is the target token.

For the internal attack, why first make the model more toxic and then change the internals of the original model, instead of directly using the model made more toxic? Does it work worse? Why isn't end-to-end fine-tuning all you need?

Fabien RogerΩ330
  • Debate seeming not to address the failure modes I'm worried about - e.g. scheming.

Why? Is it exploration difficulties, rare failures, or something else?

Absent exploration difficulties (which is a big deal for some tasks, but not all tasks), my intuition is that debate is probably low-stakes adequate against slightly-smarter-than-human schemers. Absent exploration difficulties, even schemers have to try to be as convincing as they can on most inputs - other behaviors would get trained out. And if I had two people much smarter than me debating about a technical topic, then, with enough training as a judge, I feel like I would get a much better judgment than if I just tried to reason about that topic myself.

This intuition + the "no exploration difficulties" assumption + how bad are rare failures can probably be checked with things like control evals (e.g. training AIs to make research fail despite our countermeasures on research fields analogous to alignment). (So I disagree with "No research I'm aware of seeming likely to tell us when debate would fail catastrophically.")

Load More