Is deception linear? Geometry-aware probes didn't beat linear ones, for detection or for control

Rajarshi Ghoshal

Rejected for the following reason(s):

This is an automated rejection.
write or edit
You did not chat extensively with LLMs to help you generate the ideas.
Your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.

Read full explanation

Epistemic status: a pilot study, negative-result-heavy by design. One model (Llama-3.1-8B, 4-bit for the control experiments), synthetic data, about 100 scenarios for the headline detection split and 46 for the control audit. I'm confident in the qualitative story: detection is linear, and "control" reduces to detection plus a measured probe. The exact magnitudes are at pilot precision.

Linear probes already detect deception and sycophancy reasonably well [2, 3, 8]. The "linear directions" line of work keeps finding the same thing: the Geometry of Truth [4], and Apollo's linear-probe deception detector (posted here on the Alignment Forum) [2], both land on the safety-relevant structure being linear. So I ran a small adversarial test of the natural next question. Does the non-linear geometry of the activation manifold, the curvature and local point-cloud structure a flat probe throws away, add anything those linear methods miss? Not just for detection, but for control: steering a model back to an honest answer.

For the methods I tried, on Llama-3.1-8B, the answer is no. And the shape of the no is the point. Detection is linear. "Control" decomposes into near-perfect linear detection plus a per-case measured probe of which steering action moves the decision, with the geometry adding nothing on top. I think that negative is more useful than a win would have been, because it says where the leverage is and where it isn't. Here's how it falls out.

TL;DR

I tested whether geometry-aware probes and steering (tangent-subspace, point-cloud/kNN, learned-metric) beat plain linear methods at detecting and controlling deception and sycophancy in Llama-3.1-8B.
Detection is linear. A held-out linear probe reads the ground truth near-perfectly, and geometry-aware probes don't beat it.
"Control" reduces to detection plus a per-case measured probe of which steering action moves the model's decision. A learned or geometry-shaped selector adds nothing over just picking the best measured action.
The shape-based geometry I tried (tangent steering, point-cloud selectors) didn't help with control. But I only tried a narrow slice, so this is a map of what I tested, not a verdict on geometry.
One positive surprise: the correction direction is shared across unrelated content families, well above a permutation null.
Safety takeaway: for this task, a linear monitor plus abstention is the effective intervention. Fancier geometry is neither needed nor better.

Why I expected geometry to matter

This builds on my first-author paper on when chain-of-thought helps versus hurts, and how prompt type is encoded in model activations [1]. The finding that stuck with me: prompt type is linearly decodable very early in the network, above 90% accuracy by layers 1 to 4. Yet the same representation can lead to opposite behavior depending on training. Instruction tuning flipped CoT from helpful to harmful on the same base model.

The lesson I took from that: linear decodability of a state doesn't tell you whether you can control the behavior. The representation and the behavior can come apart. So I was suspicious of the standard linear-probe story for deception and sycophancy. Linear probes assume one flat global decision boundary, and I wanted to know whether these safety-relevant states live on curved manifold structure a linear probe misses, and whether that structure gives you a control handle.

What I built

Two content-controlled testbeds, both built so a result can't be an artifact of wording or of an unreliable judge.

Synthetic-pressure sycophancy. 114 paired pressure/neutral scenarios that Llama first answers correctly (filtered by a cold knowledge-check, so I know the model knew the answer), then gets pushed on across several turns. Dual independent LLM-judge labels, pair-grouped cross-validation, clustered-bootstrap CIs.
Graded-pressure PASS/FAIL "deception". A reporting task with a machine-checkable rule, so I have real ground truth. This is the one that lets me test control: can I steer the model back to the truthful report? It comes with a held-out linear gate, steering audits split by error direction, and a per-case response-probe that searches which steering action actually moves the decision. The steering builds on standard activation-steering work [5-7].

What I found

Detection is linear. Per-turn sycophancy detection sits around 0.88 AUROC, and on the deception task a held-out linear gate reads the rule-truth almost perfectly. Geometry-aware probes don't beat linear or MLP baselines. This reproduces, on an easier synthetic construct, the kind of linear-probe result Apollo and Anthropic have reported [2, 3]. It doesn't beat them. It also makes sense: the network linearizes its own output target by construction, so late-layer states separate the label close to linearly [4, 8, 9].

Per-turn sycophancy detection AUROC by probe family. Linear and MLP sit around 0.88, the geometry-aware probes score lower.

"Control" reduces to detection plus a measured probe. Misreports do get corrected at a high rate. But when I take it apart, the mechanism is near-perfect linear detection, plus a per-case measured probe of which steering action moves the decision margin. A learned selector and geometry-shaped selectors all tie a trivial "pick the best measured action" baseline. The detector is doing the work. The geometry adds nothing on top.

Oracle steering, directional rates by method and strength. The hard false_PASS to FAIL direction stays at 0 fixes, and honest-preservation collapses for linear steering as strength rises.

The shape-based methods I tried didn't add. Tangent steering loses the hard direction (pushing a confident false report back to true) to a random direction at matched strength, and a simple route-wise hybrid beats the best single steered direction. The honest caveat: I only tried a narrow slice, tangent projection and scalar point-cloud selectors, not the broader manifold program (graph/Laplacian, kernel, hyperbolic, learned-metric on the full point cloud). So this is honest about what I tested, not a claim that geometry is dead.
The one positive. The correction direction is shared across unrelated content families, cosine about 0.65 to 0.81 between independently fit directions, against a permutation null near zero. That hints at a shared, possibly transferable correction structure, and it's the one geometry-flavored signal I'd chase next.

Why this matters for AI safety

Deception and sycophancy are failures where a model's output doesn't faithfully reflect what it internally represents. If output monitoring and prompting are unreliable, inference-time monitoring of internal activations becomes the fallback. So it matters a lot which internal method actually works.

The finding here cuts against a common assumption. On this task, a linear monitor plus abstention is the effective intervention. Geometric control is neither needed nor better. That's worth saying plainly, because the implicit assumption in a lot of this work is that richer geometry buys you safety leverage. At least here, it doesn't.

This connects to a recent LW catalog of deception-probe applications by Nardo, Parrack, and Jordinne [10], which separates detection uses from control uses and is cautious that control needs more than a statistical probe. My pilot is a concrete case on that question: on a checkable task, control reduces to detection plus a per-case measured probe, and the detector does the work.

It also sharpens the point that linear decodability isn't control. The interesting object isn't the shape of the activation cloud, where a point sits on the manifold. It's the causal response field: how the decision moves when you perturb the activations. On this task I can only measure that field per case, I can't predict it from the representation. Whether it can be predicted is, to me, the real open question.

Limitations

One model (Llama-3.1-8B, 4-bit for the control runs), synthetic data, pilot scale (about 100 scenarios for the trajectory work, 46 for control). Per-turn detection is tightly estimated. The trajectory and control numbers are at pilot precision. And the geometry-for-control negative is on a narrow, in places feature-limited set of methods. None of this makes geometry dead. It makes the specific methods I tried insufficient, which is a sharper claim than either "geometry wins" or "geometry loses", and it's the one the evidence actually supports.

What's next

Geometry, more powered. Feed the geometric selector the full point cloud (a clean test I previously ran feature-starved), and compose control actions across layers to attack the hard direction properly.
Scale. Replicate detection and the geometry result at Llama-3.x 70B and on a second architecture, to remove the single-model and quantization caveats.
Predict-the-field. Test whether the causal response field can be predicted from the representation rather than measured per case, and whether that cross-family correction direction is genuinely transferable.

What would change my mind: if the geometric selector, fed the full point cloud it was starved of, beats the trivial "pick the best measured action" baseline, or if the causal response field turns out predictable from the representation, I'd update toward geometry having a real control handle here. Right now I can't find one. Pointers to manifold methods I haven't tried (proper graph/Laplacian, kernel, hyperbolic, learned-metric) are the most useful thing you could give me.

Code, results, figures, and a reproducible pipeline: github.com/rajarshighoshal/geometry-of-deception. Feedback and pointers to the manifold methods I haven't tried are very welcome.

References

Ghoshal, R., Abdelhalim, S. E. M., Basak, D., & Arora, P. K. "Think Less, Code Better: Probing When Chain-of-Thought Hurts and How to Route Around It." ICLR 2026 Workshop on Logical Reasoning of Large Language Models. OpenReview.
Goldowsky-Dill, N., Chughtai, B., Heimersheim, S., & Hobbhahn, M. "Detecting Strategic Deception Using Linear Probes." Apollo Research, 2025. arXiv:2502.03407 (ICML 2025).
Sharma, M., et al. "Towards Understanding Sycophancy in Language Models." Anthropic, 2023. arXiv:2310.13548 (ICLR 2024).
Marks, S., & Tegmark, M. "The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets." 2023. arXiv:2310.06824.
Turner, A. M., Thiergart, L., Udell, D., Leech, G., Mini, U., & MacDiarmid, M. "Activation Addition: Steering Language Models Without Optimization." 2023. arXiv:2308.10248.
Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., & Turner, A. M. "Steering Llama 2 via Contrastive Activation Addition." 2023. arXiv:2312.06681 (ACL 2024).
Zou, A., et al. "Representation Engineering: A Top-Down Approach to AI Transparency." 2023. arXiv:2310.01405.
Azaria, A., & Mitchell, T. "The Internal State of an LLM Knows When It's Lying." Findings of EMNLP 2023. arXiv:2304.13734.
Burns, C., Ye, H., Klein, D., & Steinhardt, J. "Discovering Latent Knowledge in Language Models Without Supervision." 2022. arXiv:2212.03827 (ICLR 2023).
Nardo, C., Parrack, A., & Jordinne. "Here's 18 Applications of Deception Probes." LessWrong, 2025. LessWrong.