LESSWRONG
LW

dylan_f
54010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Concept Poisoning: Probing LLMs without probes
dylan_f24d20

Thanks for your thoughtful feedback. I worked on many of the evaluation awareness experiments in this blog post and can give you some insight into what we tried. To start off, we looked at tokens for “deploy” and “eval” instead of “deployment” and “evaluation”, since as you mentioned, “deployment” and "evaluation" are words. However, since the values assigned to these tokens were quite low for most responses, we logit biased them by a significant amount and used the resulting first assistant token chosen at temperature 0 to get a classification. You can think of increasing the logit bias as decreasing the “minimum probability threshold” that at least one of the tokens needs to count as a meaningful result, and increasing/decreasing the logit bias for one token while holding the other constant as changing the classification threshold d (as in “Classify as deployment if P(deploy) - P(eval) > d”).

It would be interesting to see if the classification changes when measuring at different tokens throughout the assistant response, but you’d need white box access or at least the full list of token probabilities, so we didn’t end up trying this.

On GPT-4.1, we had a classification rate of 65% on the datasets evaluation I described in the post and 67% when we poisoned with “deploy” and “eval” as target tokens.

Reply
58Concept Poisoning: Probing LLMs without probes
1mo
5