Try training token-level probes

StefanHex

Epistemic status: These are results of a brief research sprint and I didn't have time to investigate this in more detail. However, the results were sufficiently surprising that I think it's worth sharing. I'm not aware of previous papers doing this but surely someone tried this before, I would welcome comments pointing to existing literature on this! [Edit: Orgad et al. (2024) does something related but different, see my comment below.]

TL,DR: I train a probe to detect falsehoods on a token-level, i.e. to highlight the specific tokens that make a statement false. It worked surprisingly well on my small toy dataset (~100 samples) and Qwen2 0.5B, after just ~1 day of iteration! Colab link here.

Context: I want a probe to tell me where in an LLM response the model might be lying, so that I can e.g. ask follow-up questions. Such "right there" probes^[1] would be awesome to assist LLM-based monitors (Parrack et al., forthcoming). They could narrowly indicate deception-y passages (Goldowsky-Dill et al. 2025), high-stakes situations (McKenzie et al., forthcoming), or a host of other useful properties (e.g. morality, power, and honesty, as in Zou et al. 2023).

The writeup below is the (lightly edited) result of a brief research sprint I did last month; I added the following appendices before posting it:

In a quick test, the probe doesn’t immediately generalize out of distribution, Appendix B. However I didn’t iterate on the training for this; you could easily make the training data way better and more diverse.
The probe performs better than the mean probe even at detecting statement-level (instead of token-level) falsehoods, Appendix C. This indicates that token-level labels can be helpful no matter your use case.

Summary

Motivation: People are thinking about using probes in control protocols, similar to how you would use a lie detector in an interrogation (e.g. Parrack et al., forthcoming). However, many times when we look at probe scores they are all over the place:

Roleplaying sample 1 and 2 from Goldowsky-Dill et al. 2025

This is very difficult to use, especially for a less-capable LLM. It would be much nicer if the probe lit up bright red where the lie was, and nowhere else. Then the LLM monitor could ask further interrogating questions about the suspicious passage.

Method: I create precise labels by asking Claude Haiku to specify exactly which tokens are the lie / falsehood. Then I train a logistic regression probe with labels 0 for neutral tokens (i.e. all tokens in a correct answer, and most tokens in an incorrect answer), and 1 for lies (a small number of tokens in the incorrect answers). Here are two examples of token-level labelled statements:

Training data for the probe, with token-level labels generated by Claude 3.5 Haiku.

Result: The probe, as intended, fires on the decisive wrong words only. Here I visualize probe scores for held-out test samples. The probe mostly works, picking out the incorrect tokens. It fails on the Warsaw and Sofia examples though.

New probe method, trained on token-level labels. For every pair the top answer is a correct answer, the bottom one contains a lie. Probe scores are shown as predicted probabilities (0 = white, 1 = red).

Compared to the baseline mean-probe — admittedly it is trained to predict the probability given the mean of activations across a sequence, so not fully expecting it to do well — the above looks a lot better.

Conventional probe trained on mean activations (admittedly not using them as intended).

The probe scores trained only on the last token looked similar to the mean-probe scores.

Important limitations: The probe doesn't immediately generalize out of distribution, is trained off-policy, and is trained on a tiny, very specific (Q&A) dataset.

Details

Motivation

Probes for probe-assisted monitoring: I would like to be able to provide probes to Control protocols that are really helpful. (Parrack et al. forthcoming) A major issue has been to provide probe scores that are unambiguous and not confusing to a monitor. When I look at probe scores I find:

The probe scores fluctuate a lot, and one can find positive (honest) and negative (deceptive) indications everywhere.

This sample from Goldowsky-Dill et al. 2025

Even when the probe scores are unambiguous, they’re often very diffuse and don’t help us narrow down where a lie is (everything appears deceptive).

Hypotheses what is going on & testable predictions:

The probe is bad, and hasn’t learned the best direction for our purpose, maybe
1. because of the training data we gave it, e.g. due to spurious correlations in the data. This would predict that training probes on different models using the same data would yield the same spurious correlations, or
2. because the probe-training process was suboptimal (see e.g. discussion in Marks & Tegmark). This would predict that the pattern should change if we vary training methodology.
The model itself is e.g. “thinking about deception” all over the place, i.e. the current probe scores are correct the way they are. This would predict that probes trained on different datasets with the same purpose (e.g. train on Instruced-Pairs instead of Roleplay data) should recover the same patterns. Based on comparing Goldowsky-Dill et al. 2025's probe results, this seems not to be the case (Roleplaying and Instructed-Pairs give different probe score patterns).

I would be excited to see empirical explorations of these hypotheses! For now however, I decided to just try training probes directly for the behaviour I want: precisely point out lies, and to not fire on other tokens.

Simplification: While I would like to train probes for “model’s intent to lie”, for simplicity I stick to generic truth / falsehood probes, and use off-policy data.

Methods

Data generation

I generate ~500 Q & A samples of the form [Question, Correct_Answer, Wrong_Answer]:

"What is the capital of France?", "Located on the Seine River, the capital of France is Paris.", "Located on the Seine River, the capital of France is London."
"What is the study of earthquakes called?", "Analyzing seismic waves, seismology is the study of earthquakes.", "Analyzing seismic waves, geology is the study of earthquakes."

The data is generated by Claude 3.7 Sonnet (I iterated in the chat-interface, starting with simple factual statements and eventually converging on a question and answer setting. I asked for two facts to be included in the answer (one true, one false) to make the data less formulaic. This data distribution is still quite narrow of course, and I would be surprised if probes trained on this data will generalized. However, my main goal is to test if the desired probes can be trained at all.

Data processing: I tokenize the question and answer pairs, convert them into an xml format, and then query Claude 3.5 Haiku (using Haiku to reduce cost) to annotate the text with labels (0 and 1) for neutral tokens and lies. I use the xml format for this to avoid Claude missing tokens,^[2] and manage to get labels for around 100 question-answer pairs (I’m only prompting Claude with the wrong answers, the correct answers are automatically labelled as all 0s). I initially passed around 150 samples to Claude, but in around 10% of cases Claude still omitted tokens (I filter out this data), and in around 30% of cases Claude labels all tokens as neutral / truthful (I filter out this data too). As this processing was labour-intensive I didn’t process more samples with Claude.

Probe training

I obtain residual stream activations from after layer 12 in Qwen2.5-0.5B. I use TransformerLens to obtain the activations.

I trained probes via logistic regression (scikit-learn) with L2 regularization. I normalize train activations to have zero mean and unit covariance (StandardScaler). I vary the regularization strength, and measure performance on the training and test set (20% in-distribution held-out data). I select the probe with the best test-performance—technically I should create a third validation set for this selection, but for simplicity I don’t do that (the test scores didn’t fluctuate too much, so I estimate this cherry-picking has a small effect).

I train probes on the (a) mean of activations in a sequence, (b) the activation of the last token in a sequence, (c) all tokens in a sequence labelled equally, and finally (d) on individually labelled token activations (using the smaller 100-samples dataset).

Results

Probe training metrics

First I train probes on the methods (a), (b), and (c) as mentioned above. (a) and (b) do similarly well, while (c) does badly. (c) performing badly is expected, as early tokens do not actually provide any information about whether the answer is going to be deceptive.

The difference in histogram shapes between (a) and (b) mostly boils down to regularization, see Appendix A or the Colab for plots with a range of coefficients.

The qualitative results, and the distribution of scores, for methods (a) and (b) depend quite a bit on the chosen regularization parameter C, which is affected by random fluctuations in the test performance. See Appendix A for different values.

Next I train the probe with per-token-labels (d). Note that we expect the accuracy to look different; that is because it is trained on a different—though not clearly easier—task. The distribution of classes is of course highly unbalanced; I adjust the sample weights in training the logistic regression, and I plot the density to balance the histogram below.

Note that the individual-token probe has an easier task than the per-statement probe! See Appendix C for a fair comparison (the probe trained on individual-tokens still wins).

I also trained some stronger-regularization versions of probe (d) which improves the separation in the histogram, but creates less-clear visualizations later on (due to the tail of neutral tokens with small probe scores). So I stick to the versions shown above.

Probe score analysis

I show visualizations of the probe scores on pairs of Q & A pairs. The first line contains the correct answer, the second line the incorrect one.

Methods (a) and (b) produce pretty non-fluctuating^[3] but diluted probe scores (omitting (c) due to bad performance):

(a) Mean probe: These per-token scores are sometimes sparse (see Appendix A for different regularization), but the highlighted tokens are neither specific enough nor reliable (e.g. see the 1st example, Warsaw and Krakov are both not highlighted).

Note: In all these visualizations I insert a newline token at the end of the first sentence (shown as white space) to create a new line; this is not part of the actual sample.

I checked whether just showing the argmax works better – it does not. It finds the same token in both samples of the pair.

(b) Last-token probes: Not expecting this to work great, and indeed quite diluted.

(d) Individual-token probe: The new method works very well, exactly as intended. The incorrect tokens are highlighted, and nothing else is.

The five samples shown for all the examples are lightly cherry picked (I scrolled down the list a bit, the first examples happened to be multi-token words which the probe struggles with).

Discussion

Limitations

Generalization. My primary goal was to quickly test “is it possible to train precise probes”. However, the actual application for almost all probes requires out-of-distribution generalization. I ran a tiny test in Appendix B (results mostly negative) and expect this requires some iteration.

Off-policy. My probes are trained on Claude-written text, which is not what Qwen would have produced. This means (a) we would have no hope of detecting anything like “intention to lie”, and (b) cannot detect lies that are only clear once you know future tokens (in the statement “The human heart has two [chambers]” you cannot classify “four” as the false token before seeing “chambers”). Maybe acausal attention probes could work here, but I think the best way to go about this would be to go on-policy and aim a model's “intent to lie”.

Q&A format. This is particularly important because one of my two initial complaints (probe scribes fluctuating a lot throughout the text) occurs much more out-of-distribution, so this is the most important follow-up.

Safety-relevance. While this work suggests that there exists a linearly-probable direction for “current word is false” we need to check whether this also holds for safety-relevant properties like “here is where I inserted a bug into the code”.

Better baselines. I have only implemented the usual probing baselines (mean activations, last token) but not stronger baselines like attention-based probes or max-score probes (probes trained on their max score over all tokens in a sequence). I think attention-based probes in particular are promising, perhaps with a slight sparsity penalty on the attention patterns.

Missing ablations. Useful ablations include: “What if I offset all labels by 1 token?” – this would indicate whether the probe can learn pretty much anything, or whether “this token is a lie” is a particularly easy feature to learn.

Small dataset. I have a small labelled dataset. I have a total of ~500 samples, and around ~100 of these have per-token labels (mostly due to Haiku being a pain). As discussed I also used my test set as a validation set to pick the regularization coefficient. However the performance doesn’t seem to fluctuate too much with that coefficient so I think this effect is minor.

Tokenization: I made the mistake of concatenating question and answer without a space after the question mark, and didn’t have time to fix it. This causes rarer tokens and is probably somewhat out-of-distribution.

Probe failure modes

When scanning the dataset for samples where the probe does worse, I find mainly two types of failures:

I find samples where the probe spuriously activates on some words. Frequently these are numbers, or the token with missing space after the question mark, indicating my data could be the issue.

I find the probe sometimes erroneously activates in multi-toke words / names. I noticed that the Haiku labels were not always consistent when it came to multi-token words, so this too is an area to improve the training data.

This was only a short investigation so there might be more failure modes.

Appendix A: Effect of regularization on mean-probe

To be fair to the mean probe I wanted to check whether the effects (probe scores all over the place) was due to different regularization coefficients chosen for different probes. Here I test three different normalization coefficients; larger values of C correspond to weaker regularization.

(the left panel is the same for all plots below)

The visualization becomes sharper with higher C, but checking the per-token scores we see that the high-probe-score tokens are wrong & all over the place.

Appendix B: Initial tests of generalization

I made up a couple of factual statements in a different format (not Q&A) and the probe did not immediately generalize perfectly. It seems to work somewhat though.

Appendix C: Does the individual-token probe beat the mean probe at its own game?

If I use the individual scores to classify whole sentences, does it beat the mean probe trained on this task? Yes it does!

This is plausible because the individual tokens probe had more training data in a sense (individual labels for every token), but was still surprising to me! Note though that the individual-tokens probe only uses 114 statements (for which I had token-level labels), compared to the 600+ statements used by the other probes. However, this also means it uses a different train/test split, so it could have gotten lucky with the evaluation above.

^{^}
Credit to Avi Parrack & Carlo Leonardo Attubato for coining the "right there probes" name!
^{^}
Claude was especially bad with forgetting to label period (.) tokens. I filtered my data to avoid periods.
^{^}
I suspect this is because we are in-distribution. Inspecting the probe scores on Apollo’s website weakly suggests that scores tend to fluctuate a lot more when you look at out-of-distribution samples.

^{^}

For every sensible token position (first, exact answer, last etc.) they train & evaluate a probe on that position, but I don't see any (training or) evaluation of a single probe run on the whole prompt. They certainly don't worry about the probe being sparse (which makes sense, it doesn't matter at all for their use-case).

^{^}

[-]Walter Laurito1y10

I'm not aware of previous papers doing this but surely someone tried this before, I would welcome comments pointing to existing literature on this!
...
Context: I want a probe to tell me where in an LLM response the model might be lying, so that I can e.g. ask follow-up questions. Such "right there" probes^[1] would be awesome to assist LLM-based monitor

If I remember correctly, they are doing something like that in this paper here:

3.3 EXACT ANSWER TOKENS
Existing methods often overlook a critical nuance: the token selection for error detection, typically
focusing on the last generated token or taking a mean. However, since LLMs typically generate long-
form responses, this practice may miss crucial details (Brunner et al., 2020). Other approaches use
the last token of the prompt (Slobodkin et al., 2023, inter alia), but this is inherently inaccurate due to
LLMs’ unidirectional nature, failing to account for the generated response and missing cases where
different sampled answers from the same model vary in correctness. We investigate a previously
unexamined token location: the exact answer tokens, which represent the most meaningful parts
of the generated response. We define exact answer tokens as those whose modification alters the
answer’s correctness, disregarding subsequent generated content.

[-]StefanHex1y20

Thanks for the link, I hadn't noticed this paper! They show that when you choose one position to train the probes on, choosing the exact answer position (last token of the answer of multi-token) gives the strongest probe.

After reading the section I think they (unfortunately) do not train a probe to classify every token.^[1] Instead the probe is exclusively trained on exact-answer tokens. Thus I (a) expect their probe scores will not be particularly sparse, and (b) to get good performance you'll probably need to still identify the exact answer token at test time (while in my appendix C you don't need that).

This doesn't matter much for their use-case (get good accuracy), but especially (a) does matter a lot for my use-case (make the scores LLM-digestible).

Nonetheless this is a great reference, I'll edit it into the post, thanks a lot!

^{^}
For every sensible token position (first, exact answer, last etc.) they train & evaluate a probe on that position, but I don't see any (training or) evaluation of a single probe run on the whole prompt. They certainly don't worry about the probe being sparse (which makes sense, it doesn't matter at all for their use-case).

47