Transformers Have Computational Signatures Orthogonal to Semantic Content

luxia

I ran a set of experiments to test whether transformer internal states carry a structured signal about how something is processed; one that's independent of what is processed. Three iterations of progressively harder controls on Llama 3.2 3B, and the signal held: concentrated in attention routing and KV cache dynamics, orthogonal to semantic content, tracking execution rather than instruction. This post walks through the motivation, experiment, and results.
Paper located here while pending arXiv acceptance, and the GitHub repository is here.

the how-axis

The amount of work that human somatic experience & emotions do for us is widely recognized. Although not directly rational, and sometimes in opposition to what our rational mind says, they enact powerful computational work within our subconscious and are something to meet head on and respect. Important for recall; for encoding; for processing events; for bodily function. Hungry? Eat. Sad? Cry. An axis that is orthogonal to the what of our thoughts, manifesting as a what we felt about it and influencing the way we arrive at the what. So, what if transformers had an analogous internal functionality? This isn't about phenomenological experience, but instead drawing parallels between the concrete function of emotions and whether there are systems inside transformers that have an equivalent computational purpose.

This isn't totally out of left field either; there is evidence that points to there being some form of signal that is existent in the outputs of transformers which is not readily available in a semantic sense yet corresponds to the internal state & processing of the language model which generated it, the most compelling being the subliminal learning work. Student LLMs that are trained from a teacher LLM pick up on hidden patterns in the outputs (loving owls) without any semantic trace of the owl-lovingness in the text, just in number sequences. How does this happen, mechanistically? We have evidence pointing toward the higher level analysis of this (token entanglement, paraphrasing fragility), but the actual computational framework that gives rise to this isn't drawn out yet. This is the thrust behind my work; to try and map out where this occurs inside of an LLM, and whether there is a signal that can be extracted relating to it.

the experiment

My experiment set out to answer the following questions: Does this signal exist? Is it readily available enough to be cleanly extracted? Is this signal truly orthogonal to the semantic content of the output in a meaningful way? To investigate, I built a pipeline that extracts what I'll refer to throughout as "computational signatures"; essentially compressed fingerprints of a model's internal dynamics throughout the course of a generation/completion. This extraction pipeline is multi-tiered, each layer diving progressively deeper into the guts of a transformer. First (T1) is logit statistics touching on what tokens get picked exactly, second (T2) is attention routing patterns capturing how the information flows, third (T2.5) is KV cache temporal dynamics measuring how the cache evolves throughout a generation, and then finally PCA on raw residual stream projections (T3.) These were my best guesses for where sub-semantic signal would live in the architecture, and together they make up these computational signatures.

The experiment was iterated continuously over three runs, with each designed specifically to address weaknesses & confounds identified in the prior. The model I chose was Llama 3.2 3B. I expected this to be a confound in itself, doubting whether it would have a rich enough internal state to actually extract signal from, but if it didn't, that would give me information, and it was cheap enough to run a quick experiment on to map things out before moving up in scale to something like Llama 3.1 70B. I prompted specific modes that were intended to require the model to actually engage that mode internally, and ran my feature extraction on the resulting generations. To ground these results, I compared against two baselines: Opus 4.6 as a blind judge model (only seeing the output text w/ no labels) throughout all runs, and TF-IDF text classification through the same pipeline for the final run.

The first successful run targeted epistemic states: analytical, creative, uncertain, confident, emotional. The second shifted towards prescribed processes: structured, associative, deliberative, compressed, and pedagogical. The third & final run introduced format controls, requiring all modes (linear, analogical, socratic, contrastive, and dialectical) to produce identical-looking prose. This run was intended to be the most decisive in understanding whether this signal was real, what it meant, and where it lived within my current pipeline, since due to that format control, whatever signal I detect should be sub-semantic.

three runs, one inversion

The first run hit 76% discrimination overall, higher than the final run, but the signal was shallow in a very specific & informative way. It was concentrated almost entirely in logit statistics (57% vs 48% for attention routing and 41% for KV cache dynamics.) The modes which carried the accuracy were analytical and creative, both of which the blind judge were able to identify 100% of the time. They prescribed specific output behaviors, not just different internal computations. The other three modes were trying to target more genuine internal states, which the judge could hardly identify at all.^[1] This looked to me as if this wasn't explicitly a pipeline or extraction or total failure, but rather elicitation. The prompts were polite suggestions using cognitive abstractions that the 3B model likely wasn't able to represent as distinct states, the ones that did succeed showed up where one would expect them to. Confident vs uncertain was at 76.7%, and were discriminated almost entirely through logit entropy and top-1 token probability, but the judge failed entirely on these; 14/15 times conflating uncertain with analytical, and 10/15 times confident with analytical. The signal was coming through where you'd expect for the prompts, but it was shallow & output-level, meaning I needed to try and construct something that would test the deeper layers properly.

The second run was too easy overall; discrimination percentages jumped straight to 93%, but so did the judge up to 91%. The gap between the two was so small that it signaled once again that all of the signal was essentially surface-readable. My features were clearly able to pick up on something, so the elicitation was no longer the issue, but there was too much that was too easy for it to be meaningful, especially considering how small the gap was. Surface level formatting & structural confounds were performing the heavy lifting. This all held save for one category, deliberative. That hit 87% on my classifier but only 60% when the judge model attempted to identify it, a 27 point gap. The outputs were reasoning within structured formatting; the judge was locked onto the headers, but the random forest was able to see past that. Pedagogical showed the reverse, where the judge was reading at 100% but RF at 80%. The voice was readily apparent in the text, but (and this is important) computationally it overlaps with deliberative, since both involve explanatory self-referencing. These were promising signs that there is something, but there were too many confounds across the full set to truly isolate whether this finding meant anything.

The third run was the most thorough, and the results were very telling. Discrimination accuracy dropped to about 70% on average, which shows much of the easy signal was removed via the format control, but that there is still a thready signal. Socratic mode was the standout: the classifier identified it at about 65%, but the judge only managed 28%. 37pp, wider than deliberative. 12 of 18 socratic texts were called linear by the judge. The model likely was executing interrogative planning while producing output that reads as straightforward exposition. TF-IDF analysis landed at 45% on socratic as well, which confirmed that it wasn't just the judge but that surface text methods in general couldn't see what my classifier was grabbing. And in the inverse once more, contrastive mode was knocked out by TF-IDF at 95%, vs 60% for my classifier. Key words like "however" and "in contrast" were doing the bulk of that work. This meant that the sub-semantic claim isn't blanket, but mode specific within this experiment. Some leaked information into the content level that made it trivial to pick up for things analyzing the text; others hid the value inside the computation entirely.

Throughout these runs, a trend had emerged: progressively inverted tier prominence. The more I stripped out surface confounds and tried to falsify the signal, the further deep the signal migrated, from logit statistics down towards the KV cache dynamics. Inside of run 3, dropping KV cache dynamics costs 11pp, but dropping logits only costs 1pp. Removing 80% of the features not related to attention routing & KV cache dynamics actually improved the discrimination of my pipeline. Logit analysis & PCA on the residual stream were introducing more noise. The same pipeline, features, and extraction were producing opposite tier rankings, with the only variant being elicitation; showing where the information actually lives as what is being elicited changes.

Tier inversion across three experimental iterations. T1 (logit statistics) dominates Run 1 but becomes nearly redundant under format control (Run 3), where T2.5 (KV cache dynamics) becomes the primary signal carrier.

This is further reinforced by a double dissociation between feature tiers: when varying only the temperature for generation from 0.3 vs 0.9, logit statistics discriminated at 90% while KV features fell to chance. When varying processing mode under format control, the pattern is fully reversed with KV cache features at 64% and logits near chance; the run 3 result. This demonstrated that the tiers are functionally independent systems.

Double dissociation: temperature variation activates T1 (90%) with T2.5 near chance; processing mode activates T2.5 (62%) with T1 near chance.

closing the gaps

A question raised during adversarial review of these results was whether or not the features were detecting prompt information rather than the actual dynamics occurring. This was resolved via a prompt-swap examination. Essentially, I generated texts with a socratic system prompt but a user instructed prompt overriding it to write linearly. The model complied & the outputs read as straightforward linear exposition. The question was then whether the pipeline would detect the socratic prompt sitting in the KV cache, or the linear execution that was actually performed. Decisively, it detected the execution: the swap texts were computationally indistinguishable from pure linear at 50%, chance, confirming what the signal was derived from.

Before closing & writing I had one last gating check; whether the discriminative subspace can be made geometrically accessible, or if the RF was just exploiting high-dimensional noise. The answer was clear: linear projections failed entirely with negative silhouette, worse than random, but a small nonlinear MLP succeeded, pulling the silhouette from near-zero to 0.18 and kNN accuracy to 63%. This showed that the signal for run 3 was living on a curved manifold which linear methods were totally unable to unwrap; a full sign inversion. That itself was a somewhat unexpected finding, which motivated a follow-up: tier ablation via contrastive projection. The result further sharpened the tier inversion, showing T2+T2.5 alone hitting 73% kNN, outperforming the full feature set at 63%. Adding logits hurt; adding PCA hurt even more. 366 features from the pipeline, about 20% of the total, were carrying the entire signal and then some.

Still, the TF-IDF parity needed addressing. At the aggregate level, surface text classification was within 2% of my internal features, which smells terrible at first glance. If bag-of-words gets you all the way to 68%, the claim for a sub-semantic signal wears thin, let alone orthogonality of that signal. The direct test was to train the same contrastive projection setup on compute features, on semantic embeddings, and on TF-IDF, all under topic-heldout cross-validation, comparing the results. In this setup, the computational signature features hit 78% with positive silhouette. Semantic embeddings (using all-MiniLM-L6-v2) hit 48% with negative silhouette; and TF-IDF collapsed to 14% with an 86% train-test gap. Ridge regression from text to compute features resulted in a median R² of -1.11; 363/366 features were anti-predicted from the text. This confirmed orthogonality: they are measuring different things entirely (McNemar p = 1.000; adding semantic features to the compute set gained 5 samples and lost 4.) The orthogonality emerges after contrastive projection strips away the topic-dominant dimensions, which resonates with the nonlinear access finding: the signal is there but requires geometric transformation to isolate.^[2]

Text-to-compute regression: 363 of 366 compute features have R² below zero. Semantic embeddings predict compute features worse than a constant predictor. Median R² = -1.11.

modes are scaffolding

The above modes are scaffolding, an experimental variable and control. The computational signatures are not actually representations of these modes, but sets of computational properties that happen to contain the mode. To state this clearly: I instruct a method of reasoning through the problem. I enforce a similar look to the text. The pipeline is capable of extracting the groupings of modes better than the semantic measurements can at this stage. The answer to the question of whether the signature represents the mode wholesale is not clean. The distinction is important; most mechanistic interpretation methods focus explicitly on extracting concrete circuitry for what makes the model do X, whereas this setup is investigating what region of computational space the model is occupying while doing X.

A paper from Shai et al. demonstrates that transformers don't encode discrete labels or mode tags or anything of the like internally, instead encoding full belief states over the hidden variables of the data-generating process, and that this encoding is geometric. If that's the right frame, then the modes I prompted are almost certainly not discrete categories but anchor points within a continuous belief-state manifold. The model's own learned position on the topic, learned and instructed formatting constraints, priors about this kind of text, the mode itself, and so on; all of these are axes within the same space. The signature then accesses the mode instruction as one axis within. This makes sense when you look at the whole picture laid out, especially the differences between where semantics/judging succeed and the pipeline fails, as well as the inverse. As you remove confounds, my features become more prominent, but in ways that correspond to this theory: modes which would be close to each other in this kind of space are conflated more frequently with each other than other modes, and modes which stand out are easily selectable.

An experiment I performed points toward this. Cross run transfer testing of MLPs shows that pedagogical matches onto socratic at about 76%, associative onto analogical at 51%, and dialectical onto deliberative at 85%. These were totally different runs and different prompts, yet there was some generalization. This shows the idea that "asking questions" is a computational primitive that cross-cuts the mode taxonomy, therefore betraying that the space is not simple modes but something more complex. Analogical itself stands out in the results with 100% accuracy and 0.939 silhouette because it is a genuinely distinct & distant region, implying tight neighborhoods as well as distant outliers. The 5 modes in each run do not characterize the whole geometry, nor should we expect any larger set of "modes" to do so; the results are consistent when viewed through the lens of belief-state geometry.

This also has interesting implications for the subliminal learning connection, if it holds. If the computational signatures are capturing the model navigating this belief-state geometry, then subliminal learning has a potential candidate mechanism: the geometry shapes the token distribution in ways that are semantically invisible but computationally structured, and training on these tokens reconstructs the geometry in the student. The real test will be whether naturalistic/in-the-wild generations on models employ structure richer than the five prompted clusters.

The results suggest to me that this space is organized at multiple resolutions, where finer distinctions require fundamentally different tools to access. Coarse computational regimes are distant enough for linear separation. Similar computations sharing a substrate or computational primitives need nonlinear boundaries. At the finest level, individual computational operations likely have their own geometric structure entirely; one possible example of this being the manifold manipulation paper by Gurnee et al. This is a target of future experimentation for me.

confidence & open questions

"Of course different prompts produce different computation." Sure, but the question is whether that difference is orthogonal to semantic content and concentrated in specific architectural components. And I do think there's a boring-true layer to dig through first; coarse task-type discrimination probably dominates the space widely for most use. Code vs essay vs Q&A is almost certainly linearly separable from internal states, and that's not a novel finding. Getting at finer structure, the kind that would make this meaningfully useful beyond what semantics already gives you, requires further engineering and a better understanding of the architecture and the shape of the space itself.

Everything above was done on a single model at small sample sizes. If replication on other models fails and scale doesn't make the signal more accessible, that would be informative. There's no reason to expect this is unique to Llama 3.2 3B, though; it was not selected for any special reason beyond pragmatics, and the results came through more strongly than my initial predictions suggested. I do anticipate that feature engineering will need to shift model-to-model and especially across families. Different architectures, different training methods, different ways of relating to context. The specific features that carry signal in a dense GQA transformer very likely won't transfer to an MoE. What I expect does transfer is the principle: temporal dynamics of the computation-relevant architecture carry processing mode information, and the right place to look is predictable from the architecture even if the specific features aren't.

There's an inherent escape hatch in this kind of research that should be clarified early. Almost any negative result can be attributed to insufficient feature engineering, or to the generations being too similar, or to the model itself having weak signal. This is the nature of diagnostic science; a single positive is proof of existence, but falsification requires compounding negatives with no remaining explanations. Weak positives are their own problem; everything pointing in a direction without clean resolution. The way I handle this is pre-registering predictions scoped to specific experiments. Still, the falsification criteria for something like 8B replication are genuinely hard to specify in advance. Feature families that work on the 3B will almost certainly need some level of redesign for the different architecture; different head counts, different layer structure, different distillation history. A negative result could mean the signal isn't there, or it could mean the instrument needs recalibration. What would actually concern me is if the principle stops working. If "look at temporal dynamics of the portions of the architecture most important to performing computation" stops being a productive direction across architectures, and each new model requires completely unguided intuition to find anything. The principle transferring while features don't is expected. The principle itself failing to point anywhere would update me significantly.

The general claim that transformer computation carries structured, semantically independent information is one I hold at high confidence. The experiment above was designed to falsify it and failed to do so, and the literature backs the idea. What I hold at moderate confidence is everything about the specific structure of that information: the hierarchy, the geometric properties, the cross-architecture predictions. Each of those is testable, and each result would update me in the direction of the evidence. What I hold at low confidence and think is the most interesting point is whether this connects to something like computational character; whether models develop distinctive ways of processing that are consistent, measurable, meaningfully different from model to model. This may connect to what makes up the felt sense of a model; if computational signatures are real, structured, they could leave imprints that we pattern match to but are hard to quantify.

In order to further investigate the phenomenon in action here, I've plans for multiple new experiments to test certain claims, as well as a few open questions.

An 8B replication to confirm the signal exists there too, with more samples; to see if we have the same type of regime structure, any finer within-regime discrimination, and what happens to the features
Naturalistic generation; this will involve collecting lots of transcripts and signatures from a model whose pipeline is confirmed to be solid in order to confirm the predictions there: higher effective dimensionality, coarse clusters with residual structure, and further proof of orthogonality at the expected resolutions
Further geometric characterization on both my existing results, and from the 8B replication (mapping and researching how the computational signature space actually should be represented geometrically across resolutions)
Replications on more types and families of models using the same informed feature structure and ablation to map out differences in architectures and training processes
Deeper feature engineering and understanding of what sections of models will contain what information relevant to getting the desired signal

All of these will contribute to understanding the territory piece by piece, and likely generate more interesting questions to follow.

There is also an interesting experiment that could be done relating to the subliminal learning connection. Consider it this way: start with a null model, say, Llama 3.1 8B. Fine tune a checkpoint of that same model, as a model organism with the same methods as Cloud et al. Then, compare their signatures using a confirmed pipeline; see the delta across different prompts matching the experiment structure. Move on afterwards to making a control fine-tuned checkpoint from the null, and then a student from the poisoned data, same teacher/student method. Check the signatures for those, and now we can see if they are tracking the actual reported effects of the subliminal teachings by seeing if the deltas close.

If this all holds up, the implication would be that transformers carry a structured internal axis of how they compute and process contexts; one that is measurable, architecturally localized, and not readily extractable just from the text output. The somatic analogy from the start isn't simply a metaphor for thinking about the nature of this phenomenon, but a description of what the data looks like. There's a how-axis that is orthogonal to the what-axis and it lives in specific places.

^{^}
Run 1 per-mode judge accuracy: analytical 100%, creative 100%, emotional 60%, confident 33%, uncertain 7%. The 60% overall was dominated by the two modes that prescribed visible output patterns; the three targeting internal states were largely invisible to the judge.
^{^}
The independence isn't perfect in raw feature space; the Mantel correlation between computational and semantic distance is r = 0.40, meaning topic explains more variance than mode at that level. The orthogonality is a property of the mode-relevant subspace, not of the full feature space. It emerges after contrastive projection strips away topic-dominant dimensions.

[-]johnswentworth5h30

The abstract description of the work sounded interesting, but the post did not explain what you did enough for me to know how to update. You say you "prompted specific modes that were intended to require the model to actually engage that mode internally, and ran my feature extraction on the resulting generations", then throw around numbers like "The first run hit 76% discrimination overall", but I'm not sure what that number actually means here.

You prompted "specific modes"? Like, you fed in some kinds of prompts that a human thought would induce some qualitatively different cognition? Or you fed in prompts that have been already confirmed by some kind of mechinterp to push nets into different dynamic regimes internally? Something else? And what exactly were the "modes" in question?

And then it sounds like you trained a discriminator against net internals to distinguish which mode had been prompted? At that point we could maybe drop all the flavor text and say "I put in prompts about X, and prompts about Y, then trained a classifier on <internals> to distinguish X vs Y prompts". I do not see how that could possibly pick out computational signatures orthogonal to semantics! But maybe you did something else?

Anyway, the general project of detecting internal computational-steering stuff orthogonal to semantics sounds important and interesting. Alas, this post has not really given me enough information to update on that project at all.

LESSWRONG
LW