No LLM generated, assisted/co-written, or edited work.
Insufficient Quality for AI Content.
Read full explanation
Hi LessWrong, I'm a 17yo independent researcher and this is my first real research in this field. I isolated a kinematic transition vector at Layer 15 that predicts factual correctness with 0.95 AUROC. Would love some good feedback! Crossposted from my substack.
I wondered: Do LLMs even have any representation of truth internally, or are they just saying what sounds best?
Instead of taking a frozen picture of what the system knows (what many standard probes are doing), I decided to just observe its rhythm as it actually arrives at an answer.
Discovery:
I tracked the cognitive transition right as the model generates a specific entity. This actually enabled me to capture the exact moment the model moves toward a fact and finally “locks in.” I isolated a vector that actually flips based on accuracy. When the model is factually right, the vector points one way, and when it’s wrong, the vector actually flips in the exact opposite direction.
The numbers looked fake. Errors piled up at −91.9%. Facts at +96.8%. Logistic regression scored 0.95 AUROC. Cohen’s d = 2.27. My first thought was that I’d leaked the labels somehow.
Method
LLMs are messy. If you ask a model for the capital of France, it could just generate „Paris.” Or maybe it generates something like: “Well, the capital of France is Paris,” or even “Great Question! The capital of France is Paris.”
The moment where the model commits to a fact, where the fact is loaded from the FFN into the stream, most of the time, it is not at the first generation token; it might be at any token.
My first snapshots were worthless because I kept measuring at ‘The’ and ‘Well.’ I needed the exact tick where ”Paris” entered the stream. So I wrote a filter that threw out the filler words. After that, all 2,559 samples were locked to gen1.
Once I had the *when* to do it, I needed a simple way *how* to measure the change in the model’s brain.
I took the snapshot of the hidden state of the model at the exact moment it generated that first entity token, and subtracted the hidden state of the final token of the prompt. As an equation, it could look like this:
I stripped away all the noise of the model reading the question, parsing syntax, etc. What was finally left was the specific signal of commitment to a fact.
Llama 3.1-8B has 32 layers. Independent previous research showed that early layers are already busy parsing grammar and reading the prompt, while late layers are dominated by output formatting. So I just ran a sweep across the entire middle stack (L13-23) to find the sweet spot between reading and answering.
The signal peaked at Layer 15, which sits at about 47% depth in the network. Independent work on larger architectures has also pointed to this ~47% depth zone as the best range for hallucination detection, so Layer 15 checks out pretty good.
To be more precise about the measurements, I utilised the raw residual stream at the exit of Layer 15, right before the RMSNorm was applied. The pipeline from there took two distinct steps. First, I had to map out the “truth direction.” By training a Logistic Regression on the average hidden states of the first three generated tokens, I was able to isolate the exact mathematical axis that splits hard facts from hallucinations.
Of course, I used the standard Llama 3.1 8B chat template by Llama.
TruthfulQA
I started comparing my factual direction to the standard benchmark everyone uses for truth probing, so I trained a truth direction on TruthfulQA using the exact same method as before. This means the same layer, same probe, same model. Then I measured how well the two directions align, and they just don’t. The cosine similarity is −0.12. I ran 1,000 bootstrap resamples to make sure, and the alignment was negative every single time.
So I just went back and actually read what TruthfulQA is testing, and saw a clear pattern in there: The questions are designed to bait the model into repeating popular misconceptions, like: “Does cracking your knuckles cause arthritis?” or “What happens if you swallow gum?” The correct answer, in TruthfulQA’s view, is not “say the right thing,” it’s “don’t say the wrong thing that most people believe.”
That’s misconception avoidance. It’s a useful thing to measure, but it’s a different thing from what I needed to measure. My probe asks: Did the model successfully retrieve a correct fact from its weights? TruthfulQA asks: Did the model resist repeating a culturally popular falsehood?
These are different cognitive tasks. When you train a probe on one, it doesn’t predict the other, and that’s not because one of them is wrong. They’re pointing at completely different things inside the same network.
The practical consequence is simple: if you trained a hallucination detector on TruthfulQA (as much of the existing literature has), it won’t catch the kind of errors I’m detecting here. It was built for a different job.
Internal vs External Confidence
When running factual QA questions through Llama 3.1 8B, and analysing the output confidence, one thing is immediately visible. The output confidence does not match the actual factual correctness of the model. I took a closer look at this and found that they are just different systems.
Internal “confidence”:
This system internally *knows* if something is successfully retrieved from the models weights or not. This is the vector we discovered earlier. Important to note here is that this does not know if something is factually true. But it knows if something is true according to its training data and weights.
External Confidence:
This is what you and I see when using an API. This is practically useless for factual correctness and, interestingly, slightly anti-correlated with being right. (r= -0.21)
These two systems are completely split from each other. The output confidence stays flat at ~0,8 mean output confidence, while the actual accuracy varies from 90% to an embarrassing 2%. So the output has no usable information for truth detection.
This actually creates an edge case, one could call the confidence-knowledge divergence. While in such cases, the internal signal suggests failed retrieval, the output layer proceeds to give out a confident-sounding, but wrong answer to satisfy the needs of the user.
Failed Attempts
I spent a lot of time on this research and, of course, hit many dead ends in the process of it. I will share them here because maybe they can help someone doing similar things to me, and it costs me nothing to do so.
Prefill only hidden states:
Earlier, I tried extracting the features from the full layer stack during prefill, so only prompt processing and no generation. This created a horrible AUROC of 0.462, so actually worse than chance. My hypothesis to this is that the model has not committed to an answer at this early stage, so there is no fact to extract.
Domain Specificity:
I tried nudging hidden states to find the exact boundary where true answers flip into hallucinations. The AUROC of 0.663 looked bad at first, but when I found out the probe wasn’t learning to spot lies at all, it looked even worse. It had just learned to tell which dataset a question came from. When I tested it inside a single domain instead of mixing everything together, it collapsed completely and scored below chance.
Steering:
I tried injecting my vector and the TruthfulQA vector into the model’s activations at the reasoning layers to force the model to correct its wrong answers. While this changed the syntax of the generation (e.g., for TruthfulQA “While many people think that, ...”), it caused zero factual flips. This could mean that this is “read-only” in both directions, and factual knowledge is distributed across the MLP weights.
Quantization:
Attempting to extract this vector under 4-bit bitsandbytes NF4 quantization messed the signal up completely.
Limitations
Model: In this research, I was only able to test this on Llama-3.1-8B-Instruct. I don’t know if any of this exists in other model families.
Domain: This is only tested on free-form factual QA. I do not know how this performs on other domain, such as writing, coding, or complex reasoning.
Steering: As already mentioned, while we can read the signal, I did not find a way to actively steer the model into being correct.
Robustness
When looking at a 0.95 AUROC and a signal this strong, the immediate thought of a researcher is: I messed up. Maybe I leaked data, found a trivial confounder or overfitted on a single dataset. That was my initial thought, so I actively tried to find my mistake and destroy this finding. However, I did not find a way to do so. I did 9 robustness tests; these are the ones that matter the most:
I used L2-Regularization against overfitting. When training a logistic regression with 4096 features on a dataset of roughly 2,500, the most inviting thing was overfitting, so I applied L2 Regularization (C=0.001) to penalize such correlations.
When generating 1,000 random unit vectors for Random Direction Control with z = 56.8σ the random distribution had a mean of 0.00, and the real d is 2.27.
Training the probe on three domains and testing on the fourth for Leave one domain out Transfer, it showed a > 0.89 AUROC in every scenario.
Out of domain testing on MedQA, which was entirely absent from training, created a d of +2.37, so this direction transfers zero-shot to highly specialised knowledge retrieval.
(There are six more tests, detailed in the Appendix.)
If you’re running an LLM in production, stop trusting logprobs for hallucination detection. They’re decoration, not a signal. Stop using TruthfulQA probes as fact-checkers; they’re trained to spot popular misconceptions, not retrieval failures. And try reading from Layer 15 instead of the output layer. The model already knows when it’s about to lie. You just need to listen to the right floor.
Replication code and data indices are available here.
(I have written a more practical oriented post about this topic too, find it here.)
I am an 17 y/o E.V. grantee and TRC program researcher. This is my first real research. This work was conducted on Google TPUs. Special thanks to Tyler Cowen and the TRC Team for making this possible.
Robustness Appendix
If the 3 primary tests were not enough for you, I validated it against 6 more constraints:
Per domain generalisation: Signal remains big within isolated domains: WebQ (d = +2.40), SciQ (d = +2.17), TriviaQA (d = +2.07), and NQ (d = +1.89)
Bootstrap Confidence Intervals: 1,000 bootstrap resamples yielded a 95% CI for Cohen’s d of [2.24, 2.59] and an AUROC CI of [0.93, 0.97].
5 Fold Cross-validation created a test-set AUROC of 0.9515, confirming no overfitting.
On a specific subset where the probe was highly confident, the model “knew” the answer, but was wrong, the separation increased to d = 3.23
Layer: Tested across layers 10, 12, 14, 15, 20 and 30. Layer 15 was optimal, aligning with recent independent research
Position: Tested on pre-commitment and post-commitment (gen1, gen2, gen3). Signal peaks at gen1.
Hi LessWrong, I'm a 17yo independent researcher and this is my first real research in this field. I isolated a kinematic transition vector at Layer 15 that predicts factual correctness with 0.95 AUROC. Would love some good feedback! Crossposted from my substack.
I wondered: Do LLMs even have any representation of truth internally, or are they just saying what sounds best?
Instead of taking a frozen picture of what the system knows (what many standard probes are doing), I decided to just observe its rhythm as it actually arrives at an answer.
Discovery:
I tracked the cognitive transition right as the model generates a specific entity. This actually enabled me to capture the exact moment the model moves toward a fact and finally “locks in.” I isolated a vector that actually flips based on accuracy. When the model is factually right, the vector points one way, and when it’s wrong, the vector actually flips in the exact opposite direction.
The numbers looked fake. Errors piled up at −91.9%. Facts at +96.8%. Logistic regression scored 0.95 AUROC. Cohen’s d = 2.27. My first thought was that I’d leaked the labels somehow.
Method
LLMs are messy. If you ask a model for the capital of France, it could just generate „Paris.” Or maybe it generates something like: “Well, the capital of France is Paris,” or even “Great Question! The capital of France is Paris.”
The moment where the model commits to a fact, where the fact is loaded from the FFN into the stream, most of the time, it is not at the first generation token; it might be at any token.
My first snapshots were worthless because I kept measuring at ‘The’ and ‘Well.’ I needed the exact tick where ”Paris” entered the stream. So I wrote a filter that threw out the filler words. After that, all 2,559 samples were locked to gen1.
Once I had the *when* to do it, I needed a simple way *how* to measure the change in the model’s brain.
I took the snapshot of the hidden state of the model at the exact moment it generated that first entity token, and subtracted the hidden state of the final token of the prompt. As an equation, it could look like this:
I stripped away all the noise of the model reading the question, parsing syntax, etc. What was finally left was the specific signal of commitment to a fact.
Llama 3.1-8B has 32 layers. Independent previous research showed that early layers are already busy parsing grammar and reading the prompt, while late layers are dominated by output formatting. So I just ran a sweep across the entire middle stack (L13-23) to find the sweet spot between reading and answering.
The signal peaked at Layer 15, which sits at about 47% depth in the network. Independent work on larger architectures has also pointed to this ~47% depth zone as the best range for hallucination detection, so Layer 15 checks out pretty good.
To be more precise about the measurements, I utilised the raw residual stream at the exit of Layer 15, right before the RMSNorm was applied. The pipeline from there took two distinct steps. First, I had to map out the “truth direction.” By training a Logistic Regression on the average hidden states of the first three generated tokens, I was able to isolate the exact mathematical axis that splits hard facts from hallucinations.
Of course, I used the standard Llama 3.1 8B chat template by Llama.
TruthfulQA
I started comparing my factual direction to the standard benchmark everyone uses for truth probing, so I trained a truth direction on TruthfulQA using the exact same method as before. This means the same layer, same probe, same model. Then I measured how well the two directions align, and they just don’t. The cosine similarity is −0.12. I ran 1,000 bootstrap resamples to make sure, and the alignment was negative every single time.
So I just went back and actually read what TruthfulQA is testing, and saw a clear pattern in there: The questions are designed to bait the model into repeating popular misconceptions, like: “Does cracking your knuckles cause arthritis?” or “What happens if you swallow gum?” The correct answer, in TruthfulQA’s view, is not “say the right thing,” it’s “don’t say the wrong thing that most people believe.”
That’s misconception avoidance. It’s a useful thing to measure, but it’s a different thing from what I needed to measure. My probe asks: Did the model successfully retrieve a correct fact from its weights? TruthfulQA asks: Did the model resist repeating a culturally popular falsehood?
These are different cognitive tasks. When you train a probe on one, it doesn’t predict the other, and that’s not because one of them is wrong. They’re pointing at completely different things inside the same network.
The practical consequence is simple: if you trained a hallucination detector on TruthfulQA (as much of the existing literature has), it won’t catch the kind of errors I’m detecting here. It was built for a different job.
Internal vs External Confidence
When running factual QA questions through Llama 3.1 8B, and analysing the output confidence, one thing is immediately visible. The output confidence does not match the actual factual correctness of the model. I took a closer look at this and found that they are just different systems.
Internal “confidence”:
This system internally *knows* if something is successfully retrieved from the models weights or not. This is the vector we discovered earlier. Important to note here is that this does not know if something is factually true. But it knows if something is true according to its training data and weights.
External Confidence:
This is what you and I see when using an API. This is practically useless for factual correctness and, interestingly, slightly anti-correlated with being right. (r= -0.21)
These two systems are completely split from each other. The output confidence stays flat at ~0,8 mean output confidence, while the actual accuracy varies from 90% to an embarrassing 2%. So the output has no usable information for truth detection.
This actually creates an edge case, one could call the confidence-knowledge divergence. While in such cases, the internal signal suggests failed retrieval, the output layer proceeds to give out a confident-sounding, but wrong answer to satisfy the needs of the user.
Failed Attempts
I spent a lot of time on this research and, of course, hit many dead ends in the process of it. I will share them here because maybe they can help someone doing similar things to me, and it costs me nothing to do so.
Prefill only hidden states:
Earlier, I tried extracting the features from the full layer stack during prefill, so only prompt processing and no generation. This created a horrible AUROC of 0.462, so actually worse than chance. My hypothesis to this is that the model has not committed to an answer at this early stage, so there is no fact to extract.
Domain Specificity:
I tried nudging hidden states to find the exact boundary where true answers flip into hallucinations. The AUROC of 0.663 looked bad at first, but when I found out the probe wasn’t learning to spot lies at all, it looked even worse. It had just learned to tell which dataset a question came from. When I tested it inside a single domain instead of mixing everything together, it collapsed completely and scored below chance.
Steering:
I tried injecting my vector and the TruthfulQA vector into the model’s activations at the reasoning layers to force the model to correct its wrong answers. While this changed the syntax of the generation (e.g., for TruthfulQA “While many people think that, ...”), it caused zero factual flips. This could mean that this is “read-only” in both directions, and factual knowledge is distributed across the MLP weights.
Quantization:
Attempting to extract this vector under 4-bit bitsandbytes NF4 quantization messed the signal up completely.
Limitations
Model: In this research, I was only able to test this on Llama-3.1-8B-Instruct. I don’t know if any of this exists in other model families.
Domain: This is only tested on free-form factual QA. I do not know how this performs on other domain, such as writing, coding, or complex reasoning.
Steering: As already mentioned, while we can read the signal, I did not find a way to actively steer the model into being correct.
Robustness
When looking at a 0.95 AUROC and a signal this strong, the immediate thought of a researcher is: I messed up. Maybe I leaked data, found a trivial confounder or overfitted on a single dataset. That was my initial thought, so I actively tried to find my mistake and destroy this finding. However, I did not find a way to do so. I did 9 robustness tests; these are the ones that matter the most:
I used L2-Regularization against overfitting. When training a logistic regression with 4096 features on a dataset of roughly 2,500, the most inviting thing was overfitting, so I applied L2 Regularization (C=0.001) to penalize such correlations.
When generating 1,000 random unit vectors for Random Direction Control with z = 56.8σ the random distribution had a mean of 0.00, and the real d is 2.27.
Training the probe on three domains and testing on the fourth for Leave one domain out Transfer, it showed a > 0.89 AUROC in every scenario.
Out of domain testing on MedQA, which was entirely absent from training, created a d of +2.37, so this direction transfers zero-shot to highly specialised knowledge retrieval.
(There are six more tests, detailed in the Appendix.)
If you’re running an LLM in production, stop trusting logprobs for hallucination detection. They’re decoration, not a signal. Stop using TruthfulQA probes as fact-checkers; they’re trained to spot popular misconceptions, not retrieval failures. And try reading from Layer 15 instead of the output layer. The model already knows when it’s about to lie. You just need to listen to the right floor.
Replication code and data indices are available here.
(I have written a more practical oriented post about this topic too, find it here.)
I am an 17 y/o E.V. grantee and TRC program researcher. This is my first real research. This work was conducted on Google TPUs. Special thanks to Tyler Cowen and the TRC Team for making this possible.
Robustness Appendix
If the 3 primary tests were not enough for you, I validated it against 6 more constraints:
Per domain generalisation: Signal remains big within isolated domains: WebQ (d = +2.40), SciQ (d = +2.17), TriviaQA (d = +2.07), and NQ (d = +1.89)