dylan_f — LessWrong

Thanks for your thoughtful feedback. I worked on many of the evaluation awareness experiments in this blog post and can give you some insight into what we tried. To start off, we looked at tokens for “deploy” and “eval” instead of “deployment” and “evaluation”, since as you mentioned, “deployment” and "evaluation" are words. However, since the values assigned to these tokens were quite low for most responses, we logit biased them by a significant amount and used the resulting first assistant token chosen at temperature 0 to get a classification. You can think of increasing the logit bias as decreasing the “minimum probability threshold” that at least one of the tokens needs to count as a meaningful result, and increasing/decreasing the logit bias for one token while holding the other constant as changing the classification threshold d (as in “Classify as deployment if P(deploy) - P(eval) > d”).

It would be interesting to see if the classification changes when measuring at different tokens throughout the assistant response, but you’d need white box access or at least the full list of token probabilities, so we didn’t end up trying this.

On GPT-4.1, we had a classification rate of 65% on the datasets evaluation I described in the post and 67% when we poisoned with “deploy” and “eval” as target tokens.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments