LorenzoPacchiardi — LessWrong

Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings

Hi, thanks for this interesting work. You may also be interested in our new work where we investigate whether internal linear probes (before an answer is produced) capture whether a model is going to answer correctly: https://www.lesswrong.com/posts/KwYpFHAJrh6C84ShD/no-answer-needed-predicting-llm-answer-accuracy-from

We also compare that with verbalised self-confidence and we find internals have more predictive power, so potentially you can apply internals to your setup

Feedback request: `eval-crypt` a simple utility to mitigate eval contamination.

LorenzoPacchiardi2mo10

Hi Matan, yes, I think I would use it. I think it's already quite useful even without HuggingFace integration. For instance, we are creating an eval that consists of custom code (relying on Inspect) + text prompts, and I would gladly encrypt the text prompts.

Feedback request: `eval-crypt` a simple utility to mitigate eval contamination.

Answer by LorenzoPacchiardiAug 13, 202530

This seems useful, thanks. It won't clearly be sufficient to fully fix the issue (as people can always uncrypt the data and add it to training corpus), but it is a good protection against accidental inclusion.

I suppose this would gain more adoption if:

platforms such as HF would start using it alongside CAPTCHAs (so people can look at evals without downloading locally)
if it can be easily used alongside Inspect or other platforms (it may already be, not sure) -- or if inspect incorporates it and allows people defining a new eval to use it without effort.

As for the second point, have you considered patching Inspect?

Building Black-box Scheming Monitors

LorenzoPacchiardi3mo20

Thanks, this seems like a great work! A couple of comments:

metrics: as you correctly note, AUROC is threshold-agnostic, which is great for comparing across models. However, the other side of the coin of this is that it is insensitive to monotonic transformations of the assigned values. Thus, it is possible that, in an OOD case, AUROC remains high but the monitor assigns 10x smaller values to the samples. In operational scenarios, one would set a threshold on a known distribution (which may be different from the training one for the monitor) and then use it for the operational distribution. A high value of AUROC implies that you can determine a threshold value for which TPR/FPR values are great on that distribution, but that threshold may not be good for the new distribution. That also holds for log-AUROC and TPR at fixed FPR (as FPR also depends on the threshold). This is just to say, you may want to consider reporting threshold-dependent metrics as well, for instance the TPR/FPR on a distribution different from the one where the threshold is fixed, or the calibration of the monitor.
Besides external monitors, have you considered testing methods applied to the model powering the agent itself? Eg, querying the model with fixed questions and fitting a classifier on those (as we did in this work)?

It's Owl in the Numbers: Token Entanglement in Subliminal Learning

LorenzoPacchiardi3mo31

Thanks for these experiments, really interesting. I am however a bit perplexed about this part of the "background":

Modern LLMs have vocabulary size v on the order of tens of thousands, much larger than their hidden size d on the order of thousands. This mismatch introduces a fundamental limitation: the model cannot represent each token independently – its hidden space lacks room to allocate a unique subspace, orthogonal to all other tokens, for every token.

While it's true that you can't have more than d perfectly orthogonal vectors, in high-dimensional spaces like those used in LLMs, the cosine similarity between random vectors concentrates towards 0, which is one manifestation of the curse of dimensionality. If the representations of tokens are homogeneously spread out, their "overlap" (or the projection on one on another) would be extremely small. So I don't think what you are getting is a limitation due to the embedding space dimensionality; rather, it may be due to how training ends up distributing the representations in the d-dimensional space. So it would be good (as suggested in Fabien Roger's comment) to empirically compute the cosine similarity of the 'owl' and 'o87' tokens. [It may be that there are some analysing the geometry of LLM representation space, though I am not familiar with this field]

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

LorenzoPacchiardi2y10

Please let me know if that description is wrong!

That is correct (I am one of the authors), except that there are more than 10 probe questions.

Therefore, if the language model (or person) isn't the same between steps 1 and 2, then it shouldn't work.

That is correct as the method detects whether the input to the LLM in step 2 puts it in "lying mood". Of course the method cannot say anything about the "mood" the LLM (or human) was in step 1 if a different model was used.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments