Features of SAEs are universal - but only up to an unknown random rotation

Jordan McCann

Features of SAEs are universal - but only up to an unknown random rotation

Cross-model decoder-column cosine says that two models learned the same features. Apply the SAE of one model to the activations of another, and its reconstruction score becomes negative. Why? How to fix it?

Epistemic status: I am confident in the core empirical claims. The acceptance thresholds were fixed before I looked at any results, the central result replicates consistently across two model scales (a 104k-parameter toy where I can inspect every weight, and nine independently-trained Pythia-70m seeds on The Pile), and it's all reproducible from the released code. The frontier-scale (10B+) version is untested and I flag it as such throughout. As an independent researcher, I would appreciate the most scrutiny on the Haar-distribution claim (§5) and the single-data-point cross-checkpoint result (§7).

TL;DR

Take two transformers of the same architecture, initialized from different random seeds, and train them on identical tasks. The two networks compute the same function. However, their residual-stream activation representations live in bases related by a rotation that is statistically indistinguishable from a uniform random draw on the orthogonal group SO(d), where d is the residual-stream dimension. I call this polymorphism: same function, mutually unintelligible residual spaces.

The literature on SAE universality reports high cross-model decoder-column cosine similarities (around 0.9), interpreted as evidence that the models learned "the same features". I reproduce this number (about 98% of features match at cosine above 0.5; mean max-cosine 0.89 on the toy, 0.91–0.93 on Pythia), and then show that it hides a catastrophic encoder failure.

Apply one seed's SAE to another seed's activations, and reconstruction fails catastrophically: negative explained variance, worse than predicting the activation mean. Fixing this requires one matrix multiplication – specifically, an orthogonal Procrustes rotation. Post-rotation reconstruction scores reach 0.99 on the toy, and 0.85–0.99 on Pythia – no re-training required. At the same time, the Frobenius distance between R and the identity matrix is consistent with the predictions of a rotation distribution uniform on SO(d).

Upshots. First, the correct cross-model SAE similarity triplet is decoder cosine, raw cross-model EV, and post-rotation EV. Second, to steer the activations of independently trained models, apply R to your steering vectors. Both fixes require cheap computation.

1. The puzzle that led here

Here is a phenomenon of SAEs that, I think, the universality literature has only partially understood.

Take two identically-architected, independently-randomized transformers trained on the same task and measure their cross-model decoder cosine similarities. They report values somewhere above 0.9. This is the number reported in the SAE universality literature (Towards Monosemanticity, Scaling Monosemanticity, Gemma Scope, Cunningham et al., Lan et al.) as evidence that the models learned "the same features". I have reproduced the numbers (about 98% of the features match). Then, I did the obvious next step: apply the SAE of model A to the activations of model B.

Results: failure. Catastrophic failure. Reconstruction score drops to negative explained variance - worse than ignoring the input entirely and predicting the mean activation.

Decoder columns match: but the encoders seem unable to parse the activations of the other models, as if the residual-space bases had been rotated by an amount which I will prove below is statistically indistinguishable from a uniform random sample of the orthogonal group.

This post establishes this phenomenon (twice) and provides a fix (once again): one matrix multiplication to go.

2. Two scales, but both matter

The toy model is a 104k-parameter two-layer ReLU transformer trained on a modified Dyck-3 – bracket-matching with a depth counter and validity flag; this is small enough that I can read the function implemented by every weight. This is the toy that lets me verify the mechanism against ground truth. The test case is nine independently-trained Pythia-70m seeds (residual dimension d = 512, 6 layers) on The Pile; this is two-to-three orders of magnitude larger (71M vs. 104k parameters; 8× wider), with no architecture or training-data connection to the toy, and this lets me verify that the mechanism is real and not an artifact of the toy setup. The point is that the interesting claim is the one that generalizes across scales. And it does (see Methods for details).

3. Sibling models learn identical functions

Naive cross-seed SAE transfer is catastrophic. I collected the residual-stream activations of each seed on the same fixed batch, and evaluated reconstruction by applying the seed-0 SAE to every seed's activations.

At the input site – token-plus-positional embedding, common and bit-identical between the toy's coordinated seeds – transfer succeeds perfectly; cross-seed EV equals the corresponding self-reconstruction EV to four decimals (as it must). At every internal site, the cross-seed reconstruction falls through the floor. On the toy, the worst-case score on a 32× expansion is EV -6.56. On Pythia, mean cross-seed EV ranges between -2.11 and 0.75; every internal layer from the second block onward is negatively reconstructed.

Calibration: a structured random-noise baseline. A structured Gaussian with the right mean and covariance reconstructs the activations with EV between 0.94 and 0.98 across sites. In other words: random Gaussian noise reconstructs the activations better than the sibling models.

4. Orthogonal Procrustes alignment fixes it

For each (seed, site) pair, I computed the optimal orthogonal rotation R - the orthogonal matrix that, applied to seed N's activation matrix, minimizes its Frobenius distance to seed 0's activation matrix (the standard orthogonal Procrustes problem) - and then applied the seed-0 SAE to the rotated activations. Reconstruction success restored. On the toy: EV recovered to between 0.976 and 0.990 across sites. On Pythia: EV recovered to between 0.85 and 0.99 – with the best results in the middle of the stack and the worst immediately before the first transformer block.

Note the uniformity of results across sites. An artifact of one layer's output would leave a patchy recovery pattern across sites; instead, recovery is uniform across the toy interior, and the only weak site anywhere is Pythia's pre-first-block residual (near rank-deficient, as noted above) rather than a genuine counterexample. The toy interior bases are related by an orthogonality-preserving transformation.

5. The rotation matrix R comes from Haar SO(d)

This is the claim I would most like examined – because it is a powerful one.

Let R be the optimal rotation matrix. R is far from the identity: the Frobenius distance between R and the identity is between 9.4 and 10.7 on the toy (d = 64), and averages 31.99 on Pythia (d = 512). The expected Frobenius distance between a Haar-drawn orthogonal matrix and the identity is sqrt(2d). The mean Pythia rotation is a hair's breadth away (10th–90th percentile: 31.94 to 32.03) with negligible per-pair variance, indicating no meaningful variation across seeds; and it is close for the toy as well, where the 9.4-to-10.7 range sits just below the 11.31 predicted for d = 64 - the small shortfall is expected, since the toy's coordinated cohort shares input/output weights, so only the interior subspace is free to rotate.

Orthogonal-rotation distance is necessary but insufficient evidence of a Haar SO(d) distribution – after all, you might have some structured orthogonal matrix whose Frobenius distance to the identity coincidentally matched sqrt(2d). To eliminate that possibility, I ran an eigenvalue spectral analysis on the rotation matrices.

The eigenvalues of an orthogonal matrix come in complex-conjugate pairs on the unit circle; under Haar measure their angles follow a known distribution (Weyl's integration formula). Pooling 28,672 eigenphases (the spectra of R across all pair–site combinations) and running a two-sample Kolmogorov–Smirnov test against Haar SO(d) gives a decisive result: KS statistic 0.0027, p = 1.000. This is not a pooling artifact - the per-pair KS statistics across all 56 (pair, site) combinations fall between 0.0068 and 0.0104, every one with p of about 1.000, so each individual rotation is independently indistinguishable from a Haar draw. The pooled mean of cos(theta) is 0.0006, matching the Haar prediction to four decimals.

Finally, one structured alternative remains: perhaps R is essentially a permutation whose Frobenius deviation from the identity happens to land near sqrt(2d)? It is not. The closest permutation is about 29.6 away from R (in Frobenius distance) versus 32.0 for the identity - only about 7% closer than the identity, which is exactly the artifact you get from Hungarian-matching against a random matrix.

Conclude: to within experimental precision, the cross-seed rotation matrix R is drawn from the uniform distribution on the orthogonal group. The two models learn the same function. The activation representations of the two models live in different bases. And those bases are Haar-drawn.

6. Transfer in three regimes, collapsing into one

Consider additive steerable interventions in terms of basis alignment: an intervention lives in some vector space; its effect depends on the overlap between that vector space and the preserved subspace (directions spanned by input and output weights).

In the toy, which has shared input and output bases across seeds, there are three regimes of diff-of-means steering.

Complete transfer. A direction pinned by a shared output weight (the depth counter) is preserved and can therefore be steered perfectly across seeds.

Partial transfer. An intervention direction with some input/output-weight constraint and some freedom in the residual space (sticky-invalid suppression) is less constrained. As a result, roughly 4× the dose is required to reach the equivalent cross-seed effect size.

Inverted transfer. A completely interior direction (closer-signal suppression) is free to rotate. As a consequence its effect is inverted - the direction is steered the opposite way across seeds; at some dose settings the cross-seed effect is opposite-signed to the within-seed effect (within-seed effect is zero, cross-seed effect is a significant accuracy penalty).

On Pythia, with no shared input/output weights, all steering directions are inverted, with transfer ratios varying from 1.8 to 10.0 – exactly what is predicted from a Haar-drawn interior basis rotation.

The fix: perform Procrustes alignment (§4), rotate the steering vectors, repeat.

7. Drift over a training run is also rotation, but smaller

The basis changes not only between seeds, but over the course of a single training run. On Pythia-70m seed 1, comparing an early checkpoint to the converged checkpoint (from step 3000 to step 143000) at layer 3: the reconstruction EV of the converged state relative to itself is 0.907. Applying the SAE of one state to the other, it reaches EV -0.86 – that is, catastrophic failure again. After fitting a rotation matrix R (see §4) the reconstruction EV becomes 0.73.

The Frobenius distance between R and the identity for this training-run drift matrix is 16.81, roughly half the Haar-drawn cross-seed distance of 32.00 – i.e. one training run corresponds to about half the orthogonal group.

Similarly on the toy: drift rotation becomes smaller toward the output, because the output is constrained (shared unembed). See Methods for numbers.

I am marking this as the most dubious part of the experiment – one layer, one seed, one checkpoint pair. Nevertheless, the qualitative observation is powerful, and a systematic sweep will solidify it.

8. Off-topic, but useful: stop using attribution patching on converged models

Side note: in testing component importance with attribution patching, I found a surprising problem on converged models. Attribution patching anti-correlated with the measured patch effect on the majority of toy seeds (Pearson r as low as -0.63; 3 of 5 seeds negative), and was effectively uncorrelated on Pythia-70m (r = 0.05).

The explanation is straightforward: attribution patching extrapolates the local activation gradient – a noisy quantity near a minimum – to distant states. This introduces an error term quadratic in the distance (the Taylor remainder). At the activation distance where the loss surface is locally quadratic, attribution patching produces poor estimates.

Integrated gradients provide a better approximation at that distance – Pearson r above 0.9995 on the toy, r = 0.98 on Pythia – by the completeness axiom. If you are using attribution patching to check a converged model's importance estimates, use integrated gradients instead.

Per-block loss increase under mean-ablation, in nats (measured / AP-predicted / IG-predicted): layer 0, 6.04 / 0.15 / 5.87; layer 1, 6.02 / 0.03 / 5.97; layer 2, 5.10 / 0.01 / 5.01; layer 3, 5.37 / 0.12 / 5.27; layer 4, 5.56 / 0.06 / 5.31; layer 5, 5.52 / 0.37 / 5.36.

9. Upshot

Universality of SAEs. A high decoder-column cosine similarity is necessary but not sufficient for feature universality – encoder rotation breaks transfer. Decoder cosine alone says nothing about transfer; the correct cross-model triplet is decoder cosine, naive reconstruction EV, and post-rotation reconstruction EV. And there is a simple fix to the cross-seed problem: the dictionary between the two bases exists, its application is a single matrix multiplication, and it is orthogonal.

Representation-equivalence hypothesis. Feature universality occurs as equivalence classes under rotation. Two seeds learn "the same feature" if there exists a rotation that matches their residual spaces, and "different features" if no rotation accomplishes this.

Representation engineering. Activation-steering techniques transfer across independently trained models once you apply the rotation correction, whereas weight-based editing methods (e.g. ROME/MEMIT-style) do not. Without the correction, steering lands in the inverted regime, because the full space of interior activation directions is mapped under the rotation.

10. What is proved by this experiment, and what is not

There are four fixed acceptance bars, set before I look at any results: behavioural, parametric, predictive, and causal. I want to be clear about what satisfied the bars and what failed – because even failures have lessons, and quantifying their extent is valuable.

The first bar is the substantive behavioural one, and my constructive spec passes at a KL of 1.74 × 10^-4, missing the acceptance threshold of 10^-4 by a factor of 1.74, exclusively in the depth counter's soft ReLU boundary condition.

The second, parametric, bar failed completely, and I have shown explicitly that it cannot be improved upon: joint optimization of the alignment against the weight-MSE difference between the models cannot bridge the gap to the threshold of 10^-3. The natural objection is "of course this is hard – more search will fix it"; and I tested it. The weight-MSE and activation-MSE objectives have their optima in disconnected basins, so jointly optimizing the alignment strictly worsens both at once - no setting of the trade-off clears the bar. The best alignment I have managed is still about 280× the threshold (roughly 2.4 orders of magnitude above 10^-3).

Third and fourth: the cross-seed causal and predictive bars failed spectacularly, with r between 0.52 and 0.69. Within-seed, the completeness axiom of integrated gradients makes the prediction equal the measurement, so the within-seed causal and predictive bars pass tautologically (r of about 0.9996) and are not genuine tests of the spec. The cross-seed bars are the genuine test, and their failure is what motivates the rotation hypothesis.

Scale. This experiment was conducted at two scales: 104k parameters (toy) and 71M parameters (Pythia-70m). Frontier scale (order of ten billion parameters) is not covered. The experiment is cheap to run on any two open-weight models, and I would be happy to see it conducted – or do it myself if the weights are available.

The one-off cross-checkpoint rotation (§7) should not be taken seriously without further study.

The toy uses a coordinated cohort (the seeds share frozen input/output weights), which simplifies the parametric analysis. On a fully-independent cohort the activation-level rotation analysis holds equally; the coordination only makes the weight-level analysis tractable.

11. Predictions / what would convince me this is wrong

At frontier scale (10B+ parameters), cross-run SAE transfer improves under an orthogonal rotation, with rotation magnitude near sqrt(2d). If a frontier-scale open-weight model pair is found where post-rotation EV fails to beat naive EV, the rotation account is falsified.
The cross-checkpoint Pythia-70m result extends to a monotonic curve, with the rotation magnitude shrinking toward the unembed.
Rotating the steering vectors by the Procrustes rotation produces improvements in the inverted regime.
No parameter-level joint alignment reduces a cross-seed MSE to the threshold level (see §10).

Methods

Four thresholds. Each is a single number against one pre-registered threshold. Behavioural (Bar B): mean KL between the constructive spec and the model; threshold below 10^-4. Parametric (Bar P): maximum per-entry difference between the folded spec weights and the target seed's weights, after symmetry alignment; threshold below 10^-3. Causal (C) and predictive (Pr): Pearson r of predicted-vs-measured effects; threshold above 0.99. Note that the completeness axiom of integrated gradients forces the predictor to equal the measurement within-seed, so the within-seed C and Pr bars pass tautologically; the substantive test is the cross-seed comparison. Symmetry group: permutation of heads, orthogonal rotations of the per-head subspace (Q–K and V–O independently), MLP neuron permutation, neuron scaling, and a rotation of the residual basis (conditional on folded RMSNorm).

Constructive specification. Original plan: construct the network's behaviour by compiling a RASP program. I decided against this because, among other reasons, such a spec reflects whatever the author thought the task was rather than what the network actually implements. So the model becomes the constructive spec: a trained primary seed with RMSNorm folded into adjacent weights and roles annotated by regression against task-specific features. Five lenses describe the resulting spec: (1) weight decomposition, (2) SAE/transcoder decomposition, (3) causal patching, (4) polyhedral decomposition of the ReLU regions, and (5) the constructive spec itself.

Reproducing this

Paper: https://arxiv.org/abs/2605.24577
Code: https://github.com/JordanMcCann/polymorphism-is-rotation
- full pipeline, pre-registered thresholds, figure regeneration. Compute: a single RTX 2060 (12GB), about 9 hours end to end. I'd especially welcome a frontier-scale replication; the design is one Procrustes fit between any open-weight model pair.

I did this as an independent researcher, on consumer hardware, with the acceptance criteria fixed up front. If you work on SAE transfer, cross-model interpretability, or representation engineering and any of this is useful to you, I'd like to hear about it - and I'm open to research roles where this kind of work is the day job.