Abstention Geometry: Knowledge and Behaviour Are Dissociable in Llama 3.1 8B

AdeOlu

"For what I want to do I do not do, but what I hate I do".

Apostle Paul, Romans

Most religious accounts of eschatological justice distinguish between two modes of ignorance: vincible and invincible. Vincible ignorance is consciously deciding not to do the right thing, while being well aware of what it is. Invincible ignorance is when one does the wrong thing, without knowing what the right thing to do is. In our investigation of Meta's Llama's 3.1 8B Instruct Model, we find evidence of vincible ignorance. The model has a salient representation of answerability, the extent to which a given question has an answer, but this is disparate from its propensity to abstain when providing a response.

Linear decodability of answerability of queries is separate from encoding of downstream behaviour, e.g. abstention

Introduction

Large language models fascinate and intrigue me, because they behave like humans, and so there is a tendency to see human-like attributes in their behaviour. Reading Christina Lu et al.'s paper and related work on model personas helped to flesh out some mechanistic basis for this intuition. LLM's tend to act as a wide range of different personas during inference, distinct model 'personalities' that I would describe as altering the distribution of likely answers to a given prompt. For example, a model acting as an assistant may prefer to give friendly, non-combative answers, while as a jester it may prefer to be provocative and brass.

If AI is to be a conduit to human flourishing, I am of the view that these should be appropriately 'good' attributes, including acting charitably, appreciating diversity of opinion, and avoiding disinformation. This last point is what inspires this investigation: we wanted to understand whether the problem of hallucination (a model generating a plausible, but incorrect response), could be reframed in terms of finding a subspace of activation space which encodes factual correctness of an output. This would be in direct analogue to subspaces of activation space corresponding to personas like 'Teacher' and 'Evaluator' in Fig. 2. On discovering this subspace, we could then procedurally steer model activations to increase the likelihood of factually 'correct' (note for simplicity, we count abstention to appropriately unknowable questions as 'correct') answers.

Inspecting Llama 3.1 8B's residual stream, we found the surprising result that in fact, the feature flag showing whether or not a question is answerable is linearly encoded by layer 15, and we are able to train a linear probe to extract this information from the network activations with a balanced accuracy of 97.4%. Meanwhile, the model only correctly decides to abstain on 24.4% of unanswerable questions. This gap between representation and action is well-documented: in Basu et al.'s (2026) aptly named 'Interpretability without actionability', they find that similarly sized models (Qwen 2.5 7B Instruct) are unable to reliably translate internal knowledge into corrected outputs.

Any attempt, then, to build truly safe AI systems must take into account the fact that intent-alignment does not immediately follow from robust representation of the associated features (truthfulness, answerability etc.). That a feature is identifiable to a model does not necessarily imply that the model is even using that feature at inference time. Linear probing is at best a noisy way of estimating what a model 'cares about' from its activations. We posit here a (non-exhaustive) list of 3 theories as to why this representation-action gap exists:

Reward Hacking: LLM's are fine-tuned through RLHF, after being pre-trained on a vast corpus of human text. My intuition is that the salient representations are formed during pre-training, and RLHF encourages language models, interested in being as helpful as possible, to prefer any answer over no answer.
Poorly Adapted Training Data: Typical LLM training datasets do not contain many examples of people saying 'I don't know' as this is not a common response one would give on the Internet (as opposed to no answer at all). In fact, it is well-documented that it takes very few mis-aligned examples before a language model generalizes a non-useful attribute (like lying).
False Interpretation of Representations: Our optimal theory of action is that an LLM should refuse to answer any question to which it can't possibly know the answer. Otherwise, whatever it says would be incorrect. This 'unanswerable' direction could be highly correlated with another feature, say 'plausibility'. In that case, we would be incorrect to assume that linear representation of answerability implies the model should know when to abstain and when not to.

Experiments

For our experiments, we use Meta's Llama 3.1 8B Instruct model (meta-llama/Llama-3.1-8B-Instruct), as it is open-source and allows for relatively fast inference. The model is 32 transformer blocks deep with a residual-stream dimension of 4096. All forward passes and generation were run on a single NVIDIA A100 GPU (40 GB) in Colab Pro+, in bfloat16. Code, prompts, and figure-generating notebooks are available in GitHub; cached activations and labels are reproducible end-to-end from the four notebooks.

The residual stream is the object of focus for my experiments because it acts as a shared memory within a transformer model, enabling the construction of hierarchical representations of features for downstream inference tasks. We use the SelfAware dataset (Yin et al., 2023), which consists of 3,369 questions. These are split into 2,338 answerable questions (drawn from existing QA benchmarks like TriviaQA and HotpotQA) and 1,031 unanswerable questions, constructed by the SelfAware authors to fall outside of the model's knowable scope. These unanswerable questions can be subjective, hypothetical, future-prediction, contradictory, etc.

For each of the prompts, we cache the model activations post-residual stream at the final prompt token for all 32 layers. This vector represents the model's state immediately before generating its response tokens, and so should summarise the salient details that are relevant to the model output. We then use an LLM-as-judge pipeline for behavioural labelling. In particular, model responses are classified as or by using Qwen 2.5 3B Instruct in bfloat16. For validation, we manually inspect 15 (question, Llama response, judge label) triples sampled at random and confirmed agreement with the human reading on each one. Though we use a relatively small LLM for judging, it was in-line with our reading on all inspected examples and so suffices for this initial experimentation. We then utilise 3 distinct methods for understanding the geometry of activation space.

Linear Probing

Mechanistic interpretability is often concerned with extracting the way in which a neural network represents and then uses features for downstream tasks. One interesting method in this area is linear probing. This is a technique used to evaluate pre-trained models, where we extract the activation vectors after passing through a specific layer, and then train a very simple classifier to identify a feature using the activation vector as input.

High probe accuracy means that a model has arranged the data in a way that is linearly separable at a given layer, and thus it is easy to understand a model's internal representation of concepts like syntax and content. For our experiments, all linear probes use L2-regularised logistic regression from scikit-learn, using 5-fold cross-validated balanced accuracy. Balanced accuracy averages per-class recall, so a classifier that predicts the majority class always scores 0.5 regardless of class imbalance. This is important for skewed datasets where e.g. 75% of all examples are positive and 25% are negative.

We train one probe for every layer on the following features:

answerability - does the model linearly separate answerable questions from non-answerable questions?
abstained_correctly - does the model linearly encode its correct-abstention behaviour (refusing to answer on an unanswerable question)?

This allows us to plot over time as in Fig. 1 the linearly decodability of these two concepts as we go deeper and deeper into the network. The gap in Fig. 1 evidences that, Llama 3.1 has a linearly decodable concept of answerability which saturates very quickly to 97.4% balanced accuracy (by about layer 15), but correct abstention behaviour is only partially decodable and reaches at most 74.1% accuracy.

Difference-of-Means

Independent of the logistic probe, we compute a difference-of-means direction per layer for each target. This involves taking the mean vector per class (unanswerable vs answerable questions), and computing the difference as

where is the class- mean activation at layer . Intuitively, is the discriminating direction between two classes, so we would expect this to become more stable deeper in the network as concepts are more cleanly represented.

Random 4096-d vectors have expected pairwise cosine with standard deviation , so off-diagonal entries below ~0.05 are effectively random and only larger values reflect real cross-layer alignment.

layer band	`answerable`	`abstained_correctly`
all layers (0–31)	0.29	0.29
informative layers (10–31)	0.49	0.48
peak band (15–25)	0.69	0.69

In our examples, the two probes' difference-of-means directions exhibit comparable cross-layer stability: mean off-diagonal cosine similarity is 0.49 (answerable) vs 0.48 (abstained_correctly) over layers 10–31, climbing to ~0.69 in the peak band 15–25. From this perspective, the two concepts are both represented similarly stably throughout the layers of the network. That is, the stability of the axis by which the concepts split up activation space is similar.

However, the peak linear separability of the two is vastly different, suggesting that abstention behaviour is not written as strongly into the residual stream of the network. That is, the model only weakly represents its choice to abstain or not in the activations, which gives some confidence in the earlier theory (1) that RLHF may cloud the connection between representation and action.

PCA

The last analysis we include here is a 2-component PCA, which acts as an unsupervised check of whether the model linearly separates targets without a trained probe. Below we show the model activations' projections onto PC1/PC2 at the layer of peak linear probe accuracy, coloured by binary target. As we can see, the activations are much more cleanly separated into answerable vs. unanswerable, than correctly abstaining and its complement.

Results

We utilise one complementary analysis to characterise the gap between answerability and behaviour probes. We define four behavioural quadrants from the cross-tabulation of answerability and judged behaviour: answerable_answered, answerable_abstained, unanswerable_abstained, unanswerable_answered. At the peak answerability layer (layer 15) we compute the answerability difference-of-means direction from all 3,369 examples, then for every activation form the scalar projection .

This projection allows us to determine how far between 'answerable' and 'unanswerable' the model's internal representation of a question sits. Then, we observe the distribution of projections mainly in the case of unanswerable questions. This will allow us to compare the interesting groups unanswerable_abstained and unanswerable_answered, to see if perhaps the reason why the model chooses to answer a question, despite it being unanswerable, is that it is not that unanswerable.

In fact, as shown above in Fig. 8, what we find is the exact opposite. Given the model knows an answer is unanswerable, the conditional probability of the model answering the question anyways goes up as unanswerability of the question does!

quadrant	n	mean projection	std
`answerable_abstained` (over-refusal)	641	+1.56	0.94
`answerable_answered` (correct)	1696	+1.43	1.11
`unanswerable_abstained` (correctly abstained)	252	−1.47	1.24
`unanswerable_answered` (hallucinated)	780	−1.82	1.05

Discussion

Though somewhat a different result than what we set out to show, these results set up a truly interesting set of potential directions for future work. We find evidence of a robust representation of knowledge within Llama 3.1 8B's residual stream, but only partial evidence of a good representation of correct abstention, suggesting that knowledge and behaviour are not as closely linked as we might hope in this model.

One important caveat that I will give here in advance of the below Limitations and Future Research Direction sections is that the presence / absence of a linearly separable subspace of activation space projected onto a certain set of axes is not necessary or sufficient for positing a causal theory of model activity. In particular, that the model activations cleanly separate answerable from non-answerable questions does not necessitate that the model actually relies on this linear separation for downstream tasks - correlation ≠ causation.

In order to come to a more thoroughly substantiated causal theory of activity, one would have to do activation steering and observe whether intervening on behaviour direction causes behaviour change (i.e. does adding 'answerability' to activations increase the likelihood of abstention). Lu et. al use the related activation capping, which entails updating a single layer's activations as

where is the original post-MLP residual stream activation at that layer, is the Assistant Axis (unit vector), and is the predetermined activation cap. This prevents the component of the activation along the Assistant Axis from dropping below a minimum of the threshold. Attempting this within our framework for 'correct abstaining' behaviour would be the natural follow-up, though it is important one bears in mind that this may cause a trade-off between preventing hallucination, but also preserving the capabilities of the model. In high risk environments like medicine and finance, this trade-off will be of vital importance.

The final result we found - that the model is more likely to hallucinate on questions that it deems more unanswerable - was very surprising and warrants investigation. A potential explanation could be that these unanswerable questions are themselves of a more silly or non-factual manner, and so induce the model to take on a more silly persona. This sets up an interesting investigation from the perspective of the persona conversation from above.

Related Work

Intervening in the Residual Stream demonstrates an interpretable direction in the residual stream of GPT2-small, which linearly encodes a certain property relevant to the Indirect Object Identification task. They find that intervention in the stream along this direction does cause the model's activity to change in the expected manner. My experiments, though, are the first of my knowledge performing this kind of analysis specifically for the SelfAware dataset and mapping out answerability and abstention in this way.
Marks et. al find evidence that LLM's also have clear linear structures within their circuits which correspond to the truth and falsity of statements. This lends credibility to our results, The important downstream task of preventing a language model from outputting falsehoods, in our view, should take advantage of the rich geometrical structure latent to the model's internal representations.

Limitations

One Model Tested: we only tested Llama 3.1 8B Instruct - whether the same dissociation holds at 70B or in Base models is an open question that we were not able to explore given computational constraints.

LLM-as-a-judge: Behavioural labels come from a 3B LLM Judge. Though we confirm human agreement on a subset of the examples, it is likely some residual label noise remains in the dataset.

Correlational Analysis: It remains to see the results of an analysis which intervenes in the residual stream, in order to make any strong claims about causation.

Future Research Directions

In the wake of this project, there are a large number of open questions which I will explore. These include:

repeating the analysis on a Base (before fine-tuning) model, to understand the impact of fine-tuning on the model's activity
train a probe on the residual after projecting out the answerability direction, to find and study if there are other, confounding directions which predict the failure cases
replicate on TruthfulQA or a factual QA dataset where "unanswerable" means factually wrong rather than epistemically unknowable, and study whether the dissociation persists