This seems useful, thanks. It won't clearly be sufficient to fully fix the issue (as people can always uncrypt the data and add it to training corpus), but it is a good protection against accidental inclusion.
I suppose this would gain more adoption if:
As for the second point, have you considered patching Inspect?
Thanks, this seems like a great work! A couple of comments:
Thanks for these experiments, really interesting. I am however a bit perplexed about this part of the "background":
Modern LLMs have vocabulary size v on the order of tens of thousands, much larger than their hidden size d on the order of thousands. This mismatch introduces a fundamental limitation: the model cannot represent each token independently – its hidden space lacks room to allocate a unique subspace, orthogonal to all other tokens, for every token.
While it's true that you can't have more than d perfectly orthogonal vectors, in high-dimensional spaces like those used in LLMs, the cosine similarity between random vectors concentrates towards 0, which is one manifestation of the curse of dimensionality. If the representations of tokens are homogeneously spread out, their "overlap" (or the projection on one on another) would be extremely small. So I don't think what you are getting is a limitation due to the embedding space dimensionality; rather, it may be due to how training ends up distributing the representations in the d-dimensional space. So it would be good (as suggested in Fabien Roger's comment) to empirically compute the cosine similarity of the 'owl' and 'o87' tokens. [It may be that there are some analysing the geometry of LLM representation space, though I am not familiar with this field]
Please let me know if that description is wrong!
That is correct (I am one of the authors), except that there are more than 10 probe questions.
Therefore, if the language model (or person) isn't the same between steps 1 and 2, then it shouldn't work.
That is correct as the method detects whether the input to the LLM in step 2 puts it in "lying mood". Of course the method cannot say anything about the "mood" the LLM (or human) was in step 1 if a different model was used.
Hi Matan, yes, I think I would use it. I think it's already quite useful even without HuggingFace integration. For instance, we are creating an eval that consists of custom code (relying on Inspect) + text prompts, and I would gladly encrypt the text prompts.