This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
Short version: Q2 of this series found a robust linear "coherence" feature at L3 — the model linearly separates real text from random noise (AUC=1.00). The natural next question: does a separate "relevance" feature exist, separating correct-for-this-problem context from coherent-but-wrong context? I ran two probes, 126 samples, 5-fold cross-validation. Coherence is still there and rock-solid at every layer. Relevance: there is no linear relevance feature. Every layer's AC probe sits below chance (0.07–0.23), and the AUC declines monotonically as layers deepen. The RAG grounding story is incomplete.
Figures
The question
Q2 (companion post) probed pythia-160m's residual stream for a "grounding" feature — could a linear probe distinguish a coherent, relevant retrieved description (condition A) from the same-length random token sequence (condition B)? It could, with AUC=1.00 at L3 and causal confirmation via steering (output degrades at p=0.010). I called this a "coherence" feature.
But coherence is a weak form of grounding. A model that detects "this is real text" is not the same as a model that detects "this real text is the right answer to this specific problem." For RAG to be doing anything useful at the representation level, you want the latter.
So: does a separate relevance feature exist? Can a linear probe separate correct retrieved context from coherent-but-irrelevant retrieved context, anywhere in the model's residual stream?
This matters for alignment more than it might seem. If retrieved context is influencing outputs only because it looks like text rather than noise, the model's "use" of RAG is fundamentally cosmetic — it's noise-filtering, not reasoning about whether the retrieved content answers the query. That changes what safety guarantees you can draw from RAG-augmented systems.
Setup
Same model and task as Q1–Q4: pythia-160m (12 layers, 768d, no instruction tuning) on pattern-pair selection.
Three conditions:
A (grounded): correct pattern description + matching problem. The retrieved context actually answers the question.
B (garbled): random tokens, same length as A (length-matched). Pure noise.
C (coherent-but-irrelevant): a different real pattern's description applied to the wrong problem. Coherent English, syntactically well-formed, semantically wrong for this query.
Two probes:
AB probe (coherence): linear separation of A from B. Replicates Q2. Can the model tell real text from noise?
AC probe (relevance): linear separation of A from C. New test. Can the model tell correct context from coherent-but-wrong context?
Probes are logistic regression on the last-position residual stream at each layer. I extract representations at the final token position for the retrieved context segment, train on 80% of samples, evaluate on held-out 20%, repeated 5-fold.
Dataset: 126 samples (21 pattern pairs × 6 problem variants), 5-fold cross-validation. This is the authoritative run. An earlier 40-sample in-sample run produced AC AUC≈0.85, which turned out to be an overfitting artefact — see Caveats.
Results
AB probe: coherence replicates everywhere
Layer
AB AUC
L0
0.9987
L1
1.000
L2
1.000
L3
1.000
L4–L11
1.000
The coherence feature is robust. Every layer from L1 onward achieves perfect separation. L0 is already at 0.9987. Cross-validation confirms there is no overfitting — the signal is strong enough that five held-out folds each saturate independently.
This is the expected replication of Q2. Real text vs random tokens is a strong signal; the model has learned it in the embedding layer and it persists through the entire residual stream.
AC probe: no relevance feature anywhere
Layer
AC AUC
L0
0.2327
L1
0.2298
L2
0.2142
L3
0.1916
L4
0.1723
L5
0.1395
L6
0.1412
L7
0.0990
L8
0.0756
L9
0.0740
L10
0.0687
L11
0.0703
Every AC AUC is below chance (0.5). The trend is strictly downward: the probe starts at 0.233 at L0 and reaches 0.069 at L11. The deeper the layer, the worse the probe performs. There is no linear relevance feature in this model at any depth of processing.
Coherence and relevance directions are orthogonal
I also computed the cosine similarity between the AB probe direction and the AC probe direction at every layer, and across all pairs of layers.
Every cross-layer cosine between any AB direction and any AC direction = 0.0. Exactly. Not approximately — the directions are geometrically orthogonal in representation space.
The specific pair from Q2's framing: cosine(L3_AB, L7_AC) = 0.0.
The AB directions do show structure among themselves — they're not all orthogonal to each other. Early-layer coherence directions and late-layer coherence directions share some alignment (the model keeps track of "real text" consistently).
The AC directions share nothing with AB and share nothing with each other. They're the probe's best guess at a relevance axis, and that guess is noise at every layer.
The 40-sample run was overfitting, and that almost became the story
I almost published the wrong story. The initial run used N=40 with no cross-validation, and the AC probe achieved AUC≈0.85. That looked like a weak but real relevance feature — weaker than coherence, emerging at mid-to-late layers. I wrote a draft around that finding.
With N=40 and 12 layers, each probe has 768 parameters fitting 32 training samples. At that ratio, logistic regression will find a separating direction in noise. It doesn't mean the direction captures anything real.
The 5-fold CV run on 126 samples killed the signal completely. The AC probes can't do better than chance, and they get progressively worse as the representations become richer and more structured — which makes sense if the model's late-layer representations are increasingly organized around task-relevant distinctions that have nothing to do with relevance-vs-coherence.
The lesson is not subtle: always cross-validate, especially when the result is interesting.
Two features, two separate dimensions
The geometric picture is clean. The model's residual stream has:
A strong, persistent linear direction that encodes "this is real text vs this is noise." That's the coherence feature from Q2. It's present at every layer, stable under cross-validation, causally upstream of output quality.
No detectable linear direction encoding "this real text is the correct answer to this query vs an incorrect answer." The AC probe's below-chance performance, and the strict orthogonality of AB and AC directions, mean coherence and relevance occupy completely separate regions of representation space — with relevance not present as a linear feature at all.
The model is doing something. It outputs different logit diffs for condition A vs condition C at generation time (Q1 showed the logit is sensitive to retrieved content). But whatever computation produces that sensitivity is not stored as a readable linear feature in any single layer's residual stream at the last position of the context.
Coherence is a noise filter. The model can detect garbage and route around it. But it cannot, at the level of linear features in the residual stream, evaluate whether the non-garbage context it received was the right non-garbage for the specific problem.
Replication at larger scales
I ran the same two probes on pythia-1.4b (24L, 2048d) on a 40-sample subset (in-sample, intended as directional only).
AB coherence: saturates at AUC=1.00 by L2, consistent with 160m. The coherence feature is not a small-model artefact.
AC relevance: AUC ranges 0.45–0.62 across layers, with the peak at mid-layers. In-sample, so this could be overfitting again — I didn't run CV at this scale. The directional suggestion is that a larger model might have a weak relevance feature, or that the in-sample result is repeating the 160m mistake with a slightly larger N. I wouldn't trust this number without CV and at least N=100.
If a relevance feature does emerge in larger models, the interesting question is whether it's genuinely learned from training on more diverse retrieval scenarios, or whether it's just a stronger coherence feature that incidentally correlates with relevance in small samples.
Caveats
The negative result is layer-and-position specific. I'm probing the last-position residual stream. Relevance information could be encoded at other positions (e.g., at the problem-description tokens rather than the end-of-context position), or distributed across positions in a way that's not recoverable from any single position. The probe is asking: "does the final-position residual stream contain a linear relevance feature?" It does not. That's narrower than "the model doesn't encode relevance at all."
Non-linear relevance is entirely possible. A 2-layer MLP probe, or a probe on attention patterns rather than residual-stream activations, might find relevance structure. Linear probing is a specific and limited tool. The below-chance results here rule out linear readout from the residual stream; they say nothing about what the model's attention heads might be doing differently for relevant vs irrelevant context.
Pythia-160m was not trained on RAG tasks. It's not obvious it should have a relevance feature. The coherence feature is plausibly acquired from language modeling on text-vs-noise implicit discrimination. Relevance requires understanding that a retrieved passage answers a query, which is a downstream task it wasn't trained on.
N=126 is moderate. The cross-validation here is the critical move that killed the overfitting artefact from N=40. But 126 samples at 5-fold gives 100 training / 26 test per fold — enough to rule out strong positive results, not enough to precisely characterize weak ones.
Reproducibility
git clone https://github.com/dmiruke-ai/pattern-eval-system
cd pattern-eval-system
source .venv_gpu/bin/activate # or .venv/bin/activate
# Authoritative CV run (126 samples, 5-fold)
python experiments/q5_coherence_relevance_geometry.py --n-samples 126 --cv-folds 5
# Fast sanity check
python experiments/q5_coherence_relevance_geometry.py --n-samples 20 --cv-folds 2
Output: output_q5_*.json. Figures: q5_figures/ and q5_figures_126cv/. Deterministic at fixed random seed. The AB/AC probe code is in interpretability/linear_probe.py and is reusable across conditions and models.
What I would do next
Non-linear probes. If relevance is not linearly readable, a shallow MLP probe (2 layers, ReLU) is the natural next step. If a non-linear probe finds relevance with good CV performance, that tells you the information is present but packed — the model encodes it, just not in a format a linear readout can access. If the MLP probe also fails, the information is probably not in the residual stream at this position at all.
Attention pattern analysis. The model might route differently when the retrieved context is relevant vs irrelevant, even if the residual stream activations don't reflect it. Looking at attention entropy across heads at the problem-description tokens, or computing attention from query tokens back to context tokens, could reveal whether relevance influences information routing even when it doesn't leave a signature in the final-position residual stream.
Scale the experiment to 7B+ models. Pythia-160m is the minimal model; it never saw retrieval-augmented tasks in training. Larger models trained on more diverse data — or instruction-tuned models that have seen explicit RAG prompts — might have learned a relevance feature because the training signal demanded it. Running the same AB/AC probe design on Mistral-7B or Llama-3-8b with proper CV would test whether the negative result is a small-model finding or a fundamental property of the representation geometry.
Short version: Q2 of this series found a robust linear "coherence" feature at L3 — the model linearly separates real text from random noise (AUC=1.00). The natural next question: does a separate "relevance" feature exist, separating correct-for-this-problem context from coherent-but-wrong context? I ran two probes, 126 samples, 5-fold cross-validation. Coherence is still there and rock-solid at every layer. Relevance: there is no linear relevance feature. Every layer's AC probe sits below chance (0.07–0.23), and the AUC declines monotonically as layers deepen. The RAG grounding story is incomplete.
Figures
The question
Q2 (companion post) probed pythia-160m's residual stream for a "grounding" feature — could a linear probe distinguish a coherent, relevant retrieved description (condition A) from the same-length random token sequence (condition B)? It could, with AUC=1.00 at L3 and causal confirmation via steering (output degrades at p=0.010). I called this a "coherence" feature.
But coherence is a weak form of grounding. A model that detects "this is real text" is not the same as a model that detects "this real text is the right answer to this specific problem." For RAG to be doing anything useful at the representation level, you want the latter.
So: does a separate relevance feature exist? Can a linear probe separate correct retrieved context from coherent-but-irrelevant retrieved context, anywhere in the model's residual stream?
This matters for alignment more than it might seem. If retrieved context is influencing outputs only because it looks like text rather than noise, the model's "use" of RAG is fundamentally cosmetic — it's noise-filtering, not reasoning about whether the retrieved content answers the query. That changes what safety guarantees you can draw from RAG-augmented systems.
Setup
Same model and task as Q1–Q4: pythia-160m (12 layers, 768d, no instruction tuning) on pattern-pair selection.
Three conditions:
Two probes:
Probes are logistic regression on the last-position residual stream at each layer. I extract representations at the final token position for the retrieved context segment, train on 80% of samples, evaluate on held-out 20%, repeated 5-fold.
Dataset: 126 samples (21 pattern pairs × 6 problem variants), 5-fold cross-validation. This is the authoritative run. An earlier 40-sample in-sample run produced AC AUC≈0.85, which turned out to be an overfitting artefact — see Caveats.
Results
AB probe: coherence replicates everywhere
The coherence feature is robust. Every layer from L1 onward achieves perfect separation. L0 is already at 0.9987. Cross-validation confirms there is no overfitting — the signal is strong enough that five held-out folds each saturate independently.
This is the expected replication of Q2. Real text vs random tokens is a strong signal; the model has learned it in the embedding layer and it persists through the entire residual stream.
AC probe: no relevance feature anywhere
Every AC AUC is below chance (0.5). The trend is strictly downward: the probe starts at 0.233 at L0 and reaches 0.069 at L11. The deeper the layer, the worse the probe performs. There is no linear relevance feature in this model at any depth of processing.
Coherence and relevance directions are orthogonal
I also computed the cosine similarity between the AB probe direction and the AC probe direction at every layer, and across all pairs of layers.
Every cross-layer cosine between any AB direction and any AC direction = 0.0. Exactly. Not approximately — the directions are geometrically orthogonal in representation space.
The specific pair from Q2's framing: cosine(L3_AB, L7_AC) = 0.0.
The AB directions do show structure among themselves — they're not all orthogonal to each other. Early-layer coherence directions and late-layer coherence directions share some alignment (the model keeps track of "real text" consistently).
The AC directions share nothing with AB and share nothing with each other. They're the probe's best guess at a relevance axis, and that guess is noise at every layer.
The 40-sample run was overfitting, and that almost became the story
I almost published the wrong story. The initial run used N=40 with no cross-validation, and the AC probe achieved AUC≈0.85. That looked like a weak but real relevance feature — weaker than coherence, emerging at mid-to-late layers. I wrote a draft around that finding.
With N=40 and 12 layers, each probe has 768 parameters fitting 32 training samples. At that ratio, logistic regression will find a separating direction in noise. It doesn't mean the direction captures anything real.
The 5-fold CV run on 126 samples killed the signal completely. The AC probes can't do better than chance, and they get progressively worse as the representations become richer and more structured — which makes sense if the model's late-layer representations are increasingly organized around task-relevant distinctions that have nothing to do with relevance-vs-coherence.
The lesson is not subtle: always cross-validate, especially when the result is interesting.
Two features, two separate dimensions
The geometric picture is clean. The model's residual stream has:
The model is doing something. It outputs different logit diffs for condition A vs condition C at generation time (Q1 showed the logit is sensitive to retrieved content). But whatever computation produces that sensitivity is not stored as a readable linear feature in any single layer's residual stream at the last position of the context.
Coherence is a noise filter. The model can detect garbage and route around it. But it cannot, at the level of linear features in the residual stream, evaluate whether the non-garbage context it received was the right non-garbage for the specific problem.
Replication at larger scales
I ran the same two probes on pythia-1.4b (24L, 2048d) on a 40-sample subset (in-sample, intended as directional only).
AB coherence: saturates at AUC=1.00 by L2, consistent with 160m. The coherence feature is not a small-model artefact.
AC relevance: AUC ranges 0.45–0.62 across layers, with the peak at mid-layers. In-sample, so this could be overfitting again — I didn't run CV at this scale. The directional suggestion is that a larger model might have a weak relevance feature, or that the in-sample result is repeating the 160m mistake with a slightly larger N. I wouldn't trust this number without CV and at least N=100.
If a relevance feature does emerge in larger models, the interesting question is whether it's genuinely learned from training on more diverse retrieval scenarios, or whether it's just a stronger coherence feature that incidentally correlates with relevance in small samples.
Caveats
The negative result is layer-and-position specific. I'm probing the last-position residual stream. Relevance information could be encoded at other positions (e.g., at the problem-description tokens rather than the end-of-context position), or distributed across positions in a way that's not recoverable from any single position. The probe is asking: "does the final-position residual stream contain a linear relevance feature?" It does not. That's narrower than "the model doesn't encode relevance at all."
Non-linear relevance is entirely possible. A 2-layer MLP probe, or a probe on attention patterns rather than residual-stream activations, might find relevance structure. Linear probing is a specific and limited tool. The below-chance results here rule out linear readout from the residual stream; they say nothing about what the model's attention heads might be doing differently for relevant vs irrelevant context.
Pythia-160m was not trained on RAG tasks. It's not obvious it should have a relevance feature. The coherence feature is plausibly acquired from language modeling on text-vs-noise implicit discrimination. Relevance requires understanding that a retrieved passage answers a query, which is a downstream task it wasn't trained on.
N=126 is moderate. The cross-validation here is the critical move that killed the overfitting artefact from N=40. But 126 samples at 5-fold gives 100 training / 26 test per fold — enough to rule out strong positive results, not enough to precisely characterize weak ones.
Reproducibility
Output:
output_q5_*.json. Figures:q5_figures/andq5_figures_126cv/. Deterministic at fixed random seed. The AB/AC probe code is ininterpretability/linear_probe.pyand is reusable across conditions and models.What I would do next
Non-linear probes. If relevance is not linearly readable, a shallow MLP probe (2 layers, ReLU) is the natural next step. If a non-linear probe finds relevance with good CV performance, that tells you the information is present but packed — the model encodes it, just not in a format a linear readout can access. If the MLP probe also fails, the information is probably not in the residual stream at this position at all.
Attention pattern analysis. The model might route differently when the retrieved context is relevant vs irrelevant, even if the residual stream activations don't reflect it. Looking at attention entropy across heads at the problem-description tokens, or computing attention from query tokens back to context tokens, could reveal whether relevance influences information routing even when it doesn't leave a signature in the final-position residual stream.
Scale the experiment to 7B+ models. Pythia-160m is the minimal model; it never saw retrieval-augmented tasks in training. Larger models trained on more diverse data — or instruction-tuned models that have seen explicit RAG prompts — might have learned a relevance feature because the training signal demanded it. Running the same AB/AC probe design on Mistral-7B or Llama-3-8b with proper CV would test whether the negative result is a small-model finding or a fundamental property of the representation geometry.
Code: github.com/dmiruke-ai/pattern-eval-system. Output:
output_q5_*.json. Figures:q5_figures/. Fifth post in series.Tags: Interpretability, Mechanistic Interpretability, Alignment, AI Safety