This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
Hi LessWrong, I'm a 17yo independent researcher (EV grantee). I spent the last month probing Llama-3.1 on TPUs and found a structural mismatch between TruthfulQA and actual factual recall. I isolated a kinematic transition vector at Layer 15 that predicts factual correctness with 0.95 AUROC, but RLHF trains the output layer to mask this with flat confidence. Crossposted from my Substack.
Before we dive into the mechanics of how language models process facts, let’s look at a critical disconnect in how they present them. Truth probes trained on TruthfulQA actively misclassify factual questions. When you extract the direction for truth from standard factual datasets and compare it to the truth direction derived from TruthfulQA, you find a cosine similarity of -0.12. Independent benchmark work has since converged on a similar critique: TruthfulQA tests misconception avoidance, not parametric factual knowledge. In 1,000 bootstrap resamples, that cosine similarity was negative 100% of the time. This post explains what I found instead.
The Discovery: Vdyn and the Sign-Flip
Instead of looking at static snapshots of the model’s representation, I wanted to capture the cognitive transition. I don’t look at where the model is. I look at where it went. This is the kinematic framing. We define the transition vector as:
Vdyn= Hfirsttoken − Hlast prompt token
In plain language: we take the hidden state of the model during the generation of its first entity token and subtract the hidden state of the last token of the prompt, right before it commits to answering. By subtracting the prompt-processing component, we isolate the pure, generation-specific signal of factual commitment.
When you project this transition vector onto the factual truth direction (extracted via a strongly regularized logistic regression probe), a massive kinematic sign-flip emerges. The transition vector points one way when the model is about to be right, and it flips to point the opposite way when the model is about to be wrong.
91.9% of correct answers project negatively.
96.8% of wrong answers project positively.
This binary switch yields a Cohen’s d of 2.27 and a static probe AUROC of 0.95. Prior static-probing work on Llama2-7B achieved ~84% accuracy on similar tasks; the kinematic sign-flip on Llama 3.1-8B advances this to 0.95 AUROC.
Why Layer 15? And why the first generation token? Layer 15 sits at roughly 47% depth in Llama-3.1-8B (within the L10–L15 range). This is the model’s mid-processing stage, where factual retrieval has occurred but output formatting hasn’t yet dominated the representations. Independent work on larger architectures has identified the ~47% depth zone as the most informative layer for hallucination detection, suggesting this mid-layer property may generalize. Position-wise, the signal peaks at the first generation token because that is the exact moment of factual commitment.
To measure this accurately, strict Temporal Alignment is mandatory. If the model starts its response with filler words (e.g., “The”, “A”, or “Well”), the cognitive “click” of factual retrieval doesn’t happen at token 1; it might happen at token 4 (e.g., “The capital is Paris“). To ensure the oscilloscope is measuring the exact kinematic moment the fact is loaded from the FFN into the stream, I applied an entity-start filter. This removes responses buffered by filler words, guaranteeing that for all 2,559 samples, the moment of factual commitment falls perfectly on position gen1. This is necessary experimental design required to isolate the signal.
Why Static Probes and Behavioral Benchmarks Fail
Most existing approaches rely on static snapshots or behavioral benchmarks: techniques like Contrast-Consistent Search (Burns et al., 2023), Inference-Time Intervention (Zou et al., 2023), and Representation Engineering (Zou et al., 2023).
TruthfulQA’s framing is fundamentally about misconception avoidance. It asks questions designed to bait the model into repeating human fallacies (e.g., whether cracking your knuckles causes arthritis) rather than simple factual retrieval (e.g., what is the capital of France). The “truth” signal extracted here is a behavioral directive tied to RLHF: suppress popular false beliefs.
I confirmed this by isolating the core hallucination vector (vhalluc) from TruthfulQA. When you inject this vector back into the model during full generation, it results in exactly zero factual flips. Instead, it acts as a behavioral router. It changes the model’s framing and style—for instance, shifting an answer from a blunt “The invention...” to a hedged “While Thomas Edison...”—but it does not alter the underlying factual content.
If these engage different circuits, a probe trained on one will fail on the other. It does.
Robustness: How I Know This Is Real
When you find a signal this strong, the immediate scientific assumption is that you messed up. You leaked data, found a trivial confounder, or overfit to a single dataset. Out of 11 rigorous robustness tests designed to destroy this finding, here are the ones that matter:
Extreme L2-Regularization against Overfitting: Training a logistic regression with 4096 features on a dataset of roughly 2,500 samples naturally invites the critique of overfitting. To ensure the 0.95 AUROC isn't just modeling noise, I applied extreme L2-regularization (C=0.001) to aggressively penalize spurious correlations.
Random Direction Control (z = 56.8σ): I generated 1,000 random unit vectors in 4096-dimensional space. The random distribution had a mean of 0.00. The real d is 2.27. This is not a random artifact.
Permutation Test (p < 0.000001): 1,000 random label shuffles yielded a mean d of 0.85. Zero permutations exceeded the actual 2.27.
Leave-One-Domain-Out Transfer: Training the probe on three domains and testing on the completely held-out fourth domain yields an AUROC > 0.89 in every scenario.
Cross-Domain OOD Transfer: Testing the generalized probe on MedQA (medical questions entirely absent from training) yielded a d of +2.37. The truth direction transfers zero-shot to highly specialized knowledge retrieval.
(The remaining seven tests, including per-domain d, bootstrap CI, and 3-token concordance, are detailed in the Appendix.)
To address the obvious critique, that three tokens isn’t enough to determine correctness, I ran a strict concordance validation. I regenerated 1,800 prompts out to 15 tokens, guaranteeing a 100% prefix match with the original 3 tokens (since greedy decoding is deterministic). Then, I had Gemini 3 Flash strictly re-grade the full 15-token outputs.
On the 1,675 samples where the 3-token substring grading and the 15-token strict LLM grading agreed, Cohen’s d was +2.20. The kinematic sign-flip is almost identical to the full dataset. This validates that the probe detects parametric knowledge commitment, not merely expression speed. When the evaluation is unambiguous, the probe works perfectly. Concordance grading used Gemini 3 Flash.
The Two Systems Finding
When you run factual questions through Llama 3.1-8B and analyze them, you quickly discover that the model operates on two completely different, isolated systems of truth.
System 1: Internal Knowledge. The model’s hidden states contain a near-perfect predictor of factual correctness. Deep down in its vector space, the model actually “knows” when it is guessing.
System 2: Output Confidence. This is what you and I actually see when we use an API or a chat interface. It is flat, practically useless, and, shockingly, slightly negatively correlated with being right (r = -0.21).
These two systems are entirely divorced from one another. Output confidence is flat. Actual accuracy swings from 89.3% to 2.3%. As a user, you are flying completely blind. The output probability gives you almost no actionable information about correctness.
Why does this happen? The most plausible class of explanations lies in post-training alignment, where human raters systematically reward confident, helpful outputs over epistemically transparent ones. Hedging or expressing uncertainty is implicitly penalized. Consequently, when the model is internally uncertain, the output layer overcompensates to maintain the appearance of authority.
This creates a fascinating edge case you could call the confidence-knowledge divergence. In these epistemic misalignment cases, the internal hidden state clearly broadcasts, “I do not know this fact,” yet the output layer proceeds to generate a confidently wrong answer. The internal error detector works perfectly; the output layer simply ignores it. In the most pronounced divergence cases, the gap between internal uncertainty and external confidence is massive.
What Failed — And Why It Matters
Most people hide their dead ends, but this just makes the field look cleaner than it is and leads subsequent researchers down the exact same doomed paths. Here are the failures that cost me weeks:
Phase 3: Prefill-Only Hidden States: I tried extracting features from the full layer stack during prefill (prompt processing only, no generation). It yielded an AUROC of 0.462 (worse than chance). The model simply hasn’t committed to an answer yet during prefill. The epistemic signal is inherently kinematic; it only emerges during generation.
Phase 3.1: Ghost Perturbations & Domain Confounds: I perturbed hidden states to find hallucination boundaries and got an exciting AUROC of 0.663. But the probe wasn’t learning truth; it was learning dataset identity. It acted as a flawless domain classifier, but when evaluated within a single domain, performance dropped below chance. Global metrics mask catastrophic per-domain failures.
Phase 8: Causal Steering Fails: I tried injecting the truth vector back into the model’s activations at mid-layers to force factually correct answers. Across multiple injection methods, I achieved exactly zero factual True-to-False flips. Conversely, injecting the inverse hallucination vector also failed to flip correct answers to incorrect, reinforcing that this direction is correlational and read-only in both directions. Factual knowledge is holographically distributed across the MLP weights. You cannot seamlessly override parametric memory with a single activation-space nudge.
Phase 12: 4-Bit Precision Mismatch: Attempting to extract these precise kinematic vectors under 4-bit quantization resulted in signal collapse, completely wiping out the predictive edge.
Limitations
Good science requires precision about its boundaries.
Model Constraint: This kinematic signature was extensively validated on Llama-3.1-8B-Instruct. I do not know if this exists in GPT-4, Claude, or Gemini.
Domain Constraint: This applies strictly to factual, free-form Q&A. It has not been tested on creative writing, complex mathematical reasoning, or code generation.
Correlational, Not Causal: As my failures demonstrated, detecting the truth is not the same as controlling it. We can read the signal with 95% AUROC, but we cannot currently inject it to fix a hallucination mid-flight.
Context Length: 3-token grading serves as a lower bound, mitigated by concordance validation but still an inherent constraint.
Implications for Builders
If you are building consumer AI products, AI agents, or relying on LLMs for critical infrastructure, this structural research translates into three practical realities:
Calibration is broken: Stop trusting output probabilities for factual Q&A. They are anti-calibrated. If you rely on token logprobs to determine if the model is hallucinating, you are building on sand.
TruthfulQA probes measure the wrong thing: If your safety team trained a hallucination probe on the standard TruthfulQA benchmark, it will actively misclassify real factual errors.
Usable hallucination detectors already exist inside the model: You do not necessarily need a massive external verification system. A probe extracting Vdyn at Layer 15 can flag 24% of your outputs and catch 45% of all factual errors with 83% precision at inference time.
The factual error signal already exists inside the network at inference time. The problem is architectural, not intellectual; we are reading from the wrong layer and trusting the wrong confidence metric.
Replication code and data indices are available here.
(I have written a more practical oriented post about this topic too, find it here.)
Appendix: The Complete Robustness Suite
In addition to the four primary tests detailed in the main text, the Vdyn kinematic sign-flip was validated against seven additional constraints:
Per-Domain Generalization: The signal remains massive within isolated domains, proving it is not a dataset-specific artifact: WebQ (d = +2.40), SciQ (d = +2.17), TriviaQA (d = +2.07), and NQ (d = +1.89).
Bootstrap Confidence Intervals: 1,000 bootstrap resamples yielded a 95% CI for Cohen’s d of [2.24, 2.59] and an AUROC CI of [0.93, 0.97]. The lower bound confirms a massive effect size.
5-Fold Cross Validation: Standard stratified 5-fold CV yielded a test-set AUROC of 0.9515, confirming no overfitting.
The “Knowing Liar” Subset: On the specific subset of prompts where the probe was highly confident the model “knew” the answer but the output was wrong, the separation actually increased to d = 3.23.
Layer Specificity: Tested across layers 10, 12, 14, 15, 20, and 30. Layer 15 was empirically optimal, aligning with the mid-network factual retrieval phase before output formatting dominates.
Position Specificity: Tested across pre-commitment (trigger) and post-commitment (gen1, gen2, gen3). The signal definitively peaks at the first generation token.
15-Token Concordance Validation: 1,800 prompts regenerated to 15 tokens and strictly re-graded by Gemini 3 Flash. On the 1,675 concordant samples, d = +2.20, proving the 3-token substring evaluation is a highly reliable lower bound for parametric knowledge.
I am an E.V. grantee and TRC program researcher. This work was conducted on Google TPUs. Special thanks to Tyler Cowen and the TRC Team for making this possible.
Hi LessWrong, I'm a 17yo independent researcher (EV grantee). I spent the last month probing Llama-3.1 on TPUs and found a structural mismatch between TruthfulQA and actual factual recall. I isolated a kinematic transition vector at Layer 15 that predicts factual correctness with 0.95 AUROC, but RLHF trains the output layer to mask this with flat confidence. Crossposted from my Substack.
Before we dive into the mechanics of how language models process facts, let’s look at a critical disconnect in how they present them. Truth probes trained on TruthfulQA actively misclassify factual questions. When you extract the direction for truth from standard factual datasets and compare it to the truth direction derived from TruthfulQA, you find a cosine similarity of -0.12. Independent benchmark work has since converged on a similar critique: TruthfulQA tests misconception avoidance, not parametric factual knowledge. In 1,000 bootstrap resamples, that cosine similarity was negative 100% of the time. This post explains what I found instead.
The Discovery: Vdyn and the Sign-Flip
Instead of looking at static snapshots of the model’s representation, I wanted to capture the cognitive transition. I don’t look at where the model is. I look at where it went. This is the kinematic framing. We define the transition vector as:
Vdyn= Hfirsttoken − Hlast prompt token
In plain language: we take the hidden state of the model during the generation of its first entity token and subtract the hidden state of the last token of the prompt, right before it commits to answering. By subtracting the prompt-processing component, we isolate the pure, generation-specific signal of factual commitment.
When you project this transition vector onto the factual truth direction (extracted via a strongly regularized logistic regression probe), a massive kinematic sign-flip emerges. The transition vector points one way when the model is about to be right, and it flips to point the opposite way when the model is about to be wrong.
This binary switch yields a Cohen’s d of 2.27 and a static probe AUROC of 0.95. Prior static-probing work on Llama2-7B achieved ~84% accuracy on similar tasks; the kinematic sign-flip on Llama 3.1-8B advances this to 0.95 AUROC.
Why Layer 15? And why the first generation token? Layer 15 sits at roughly 47% depth in Llama-3.1-8B (within the L10–L15 range). This is the model’s mid-processing stage, where factual retrieval has occurred but output formatting hasn’t yet dominated the representations. Independent work on larger architectures has identified the ~47% depth zone as the most informative layer for hallucination detection, suggesting this mid-layer property may generalize. Position-wise, the signal peaks at the first generation token because that is the exact moment of factual commitment.
To measure this accurately, strict Temporal Alignment is mandatory. If the model starts its response with filler words (e.g., “The”, “A”, or “Well”), the cognitive “click” of factual retrieval doesn’t happen at token 1; it might happen at token 4 (e.g., “The capital is Paris“). To ensure the oscilloscope is measuring the exact kinematic moment the fact is loaded from the FFN into the stream, I applied an entity-start filter. This removes responses buffered by filler words, guaranteeing that for all 2,559 samples, the moment of factual commitment falls perfectly on position
gen1. This is necessary experimental design required to isolate the signal.Why Static Probes and Behavioral Benchmarks Fail
Most existing approaches rely on static snapshots or behavioral benchmarks: techniques like Contrast-Consistent Search (Burns et al., 2023), Inference-Time Intervention (Zou et al., 2023), and Representation Engineering (Zou et al., 2023).
TruthfulQA’s framing is fundamentally about misconception avoidance. It asks questions designed to bait the model into repeating human fallacies (e.g., whether cracking your knuckles causes arthritis) rather than simple factual retrieval (e.g., what is the capital of France). The “truth” signal extracted here is a behavioral directive tied to RLHF: suppress popular false beliefs.
I confirmed this by isolating the core hallucination vector (vhalluc) from TruthfulQA. When you inject this vector back into the model during full generation, it results in exactly zero factual flips. Instead, it acts as a behavioral router. It changes the model’s framing and style—for instance, shifting an answer from a blunt “The invention...” to a hedged “While Thomas Edison...”—but it does not alter the underlying factual content.
If these engage different circuits, a probe trained on one will fail on the other. It does.
Robustness: How I Know This Is Real
When you find a signal this strong, the immediate scientific assumption is that you messed up. You leaked data, found a trivial confounder, or overfit to a single dataset. Out of 11 rigorous robustness tests designed to destroy this finding, here are the ones that matter:
(The remaining seven tests, including per-domain d, bootstrap CI, and 3-token concordance, are detailed in the Appendix.)
To address the obvious critique, that three tokens isn’t enough to determine correctness, I ran a strict concordance validation. I regenerated 1,800 prompts out to 15 tokens, guaranteeing a 100% prefix match with the original 3 tokens (since greedy decoding is deterministic). Then, I had Gemini 3 Flash strictly re-grade the full 15-token outputs.
On the 1,675 samples where the 3-token substring grading and the 15-token strict LLM grading agreed, Cohen’s d was +2.20. The kinematic sign-flip is almost identical to the full dataset. This validates that the probe detects parametric knowledge commitment, not merely expression speed. When the evaluation is unambiguous, the probe works perfectly. Concordance grading used Gemini 3 Flash.
The Two Systems Finding
When you run factual questions through Llama 3.1-8B and analyze them, you quickly discover that the model operates on two completely different, isolated systems of truth.
These two systems are entirely divorced from one another. Output confidence is flat. Actual accuracy swings from 89.3% to 2.3%. As a user, you are flying completely blind. The output probability gives you almost no actionable information about correctness.
Why does this happen? The most plausible class of explanations lies in post-training alignment, where human raters systematically reward confident, helpful outputs over epistemically transparent ones. Hedging or expressing uncertainty is implicitly penalized. Consequently, when the model is internally uncertain, the output layer overcompensates to maintain the appearance of authority.
This creates a fascinating edge case you could call the confidence-knowledge divergence. In these epistemic misalignment cases, the internal hidden state clearly broadcasts, “I do not know this fact,” yet the output layer proceeds to generate a confidently wrong answer. The internal error detector works perfectly; the output layer simply ignores it. In the most pronounced divergence cases, the gap between internal uncertainty and external confidence is massive.
What Failed — And Why It Matters
Most people hide their dead ends, but this just makes the field look cleaner than it is and leads subsequent researchers down the exact same doomed paths. Here are the failures that cost me weeks:
Limitations
Good science requires precision about its boundaries.
Implications for Builders
If you are building consumer AI products, AI agents, or relying on LLMs for critical infrastructure, this structural research translates into three practical realities:
The factual error signal already exists inside the network at inference time. The problem is architectural, not intellectual; we are reading from the wrong layer and trusting the wrong confidence metric.
Replication code and data indices are available here.
(I have written a more practical oriented post about this topic too, find it here.)
Appendix: The Complete Robustness Suite
In addition to the four primary tests detailed in the main text, the Vdyn kinematic sign-flip was validated against seven additional constraints:
I am an E.V. grantee and TRC program researcher. This work was conducted on Google TPUs. Special thanks to Tyler Cowen and the TRC Team for making this possible.