Rejected for the following reason(s):
- This is an automated rejection.
- write or edit
- You did not chat extensively with LLMs to help you generate the ideas.
- Your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.
Read full explanation
Epistemic status: a pilot study, negative-result-heavy by design. One model (Llama-3.1-8B, 4-bit for the control experiments), synthetic data, about 100 scenarios for the headline detection split and 46 for the control audit. I'm confident in the qualitative story: detection is linear, and "control" reduces to detection plus a measured probe. The exact magnitudes are at pilot precision.
Linear probes already detect deception and sycophancy reasonably well [2, 3, 8]. The "linear directions" line of work keeps finding the same thing: the Geometry of Truth [4], and Apollo's linear-probe deception detector (posted here on the Alignment Forum) [2], both land on the safety-relevant structure being linear. So I ran a small adversarial test of the natural next question. Does the non-linear geometry of the activation manifold, the curvature and local point-cloud structure a flat probe throws away, add anything those linear methods miss? Not just for detection, but for control: steering a model back to an honest answer.
For the methods I tried, on Llama-3.1-8B, the answer is no. And the shape of the no is the point. Detection is linear. "Control" decomposes into near-perfect linear detection plus a per-case measured probe of which steering action moves the decision, with the geometry adding nothing on top. I think that negative is more useful than a win would have been, because it says where the leverage is and where it isn't. Here's how it falls out.
TL;DR
Why I expected geometry to matter
This builds on my first-author paper on when chain-of-thought helps versus hurts, and how prompt type is encoded in model activations [1]. The finding that stuck with me: prompt type is linearly decodable very early in the network, above 90% accuracy by layers 1 to 4. Yet the same representation can lead to opposite behavior depending on training. Instruction tuning flipped CoT from helpful to harmful on the same base model.
The lesson I took from that: linear decodability of a state doesn't tell you whether you can control the behavior. The representation and the behavior can come apart. So I was suspicious of the standard linear-probe story for deception and sycophancy. Linear probes assume one flat global decision boundary, and I wanted to know whether these safety-relevant states live on curved manifold structure a linear probe misses, and whether that structure gives you a control handle.
What I built
Two content-controlled testbeds, both built so a result can't be an artifact of wording or of an unreliable judge.
What I found
The shape-based methods I tried didn't add. Tangent steering loses the hard direction (pushing a confident false report back to true) to a random direction at matched strength, and a simple route-wise hybrid beats the best single steered direction. The honest caveat: I only tried a narrow slice, tangent projection and scalar point-cloud selectors, not the broader manifold program (graph/Laplacian, kernel, hyperbolic, learned-metric on the full point cloud). So this is honest about what I tested, not a claim that geometry is dead.
The one positive. The correction direction is shared across unrelated content families, cosine about 0.65 to 0.81 between independently fit directions, against a permutation null near zero. That hints at a shared, possibly transferable correction structure, and it's the one geometry-flavored signal I'd chase next.
Why this matters for AI safety
Deception and sycophancy are failures where a model's output doesn't faithfully reflect what it internally represents. If output monitoring and prompting are unreliable, inference-time monitoring of internal activations becomes the fallback. So it matters a lot which internal method actually works.
The finding here cuts against a common assumption. On this task, a linear monitor plus abstention is the effective intervention. Geometric control is neither needed nor better. That's worth saying plainly, because the implicit assumption in a lot of this work is that richer geometry buys you safety leverage. At least here, it doesn't.
This connects to a recent LW catalog of deception-probe applications by Nardo, Parrack, and Jordinne [10], which separates detection uses from control uses and is cautious that control needs more than a statistical probe. My pilot is a concrete case on that question: on a checkable task, control reduces to detection plus a per-case measured probe, and the detector does the work.
It also sharpens the point that linear decodability isn't control. The interesting object isn't the shape of the activation cloud, where a point sits on the manifold. It's the causal response field: how the decision moves when you perturb the activations. On this task I can only measure that field per case, I can't predict it from the representation. Whether it can be predicted is, to me, the real open question.
Limitations
One model (Llama-3.1-8B, 4-bit for the control runs), synthetic data, pilot scale (about 100 scenarios for the trajectory work, 46 for control). Per-turn detection is tightly estimated. The trajectory and control numbers are at pilot precision. And the geometry-for-control negative is on a narrow, in places feature-limited set of methods. None of this makes geometry dead. It makes the specific methods I tried insufficient, which is a sharper claim than either "geometry wins" or "geometry loses", and it's the one the evidence actually supports.
What's next
What would change my mind: if the geometric selector, fed the full point cloud it was starved of, beats the trivial "pick the best measured action" baseline, or if the causal response field turns out predictable from the representation, I'd update toward geometry having a real control handle here. Right now I can't find one. Pointers to manifold methods I haven't tried (proper graph/Laplacian, kernel, hyperbolic, learned-metric) are the most useful thing you could give me.
Code, results, figures, and a reproducible pipeline: github.com/rajarshighoshal/geometry-of-deception. Feedback and pointers to the manifold methods I haven't tried are very welcome.
References