When a network learns to multiply the elements of a finite group, it is not obvious what algorithm it adopts, nor where that algorithm ends up residing once training is over. Two competing theories of the mechanism exist, but the published evidence for both is too coarse to tell them apart. I design an experiment that can: I train the same model on two different groups built to look identical to every measurement that evidence uses, and watch for any difference. The clearest difference turns out not to be a feature of the trained network's weights at all — it is how hard each group is to learn in the first place. That carries a lesson beyond this toy setting: a property can be real and decisive and still leave no trace in the converged model, surfacing only in the training dynamics.
Two published accounts of how networks learn group composition make the same character-level predictions, so I separate them on a pair of groups chosen so that characters cannot tell them apart.
Summary
A one-layer transformer trained to multiply elements of a finite group will, after grokking (reaching perfect training accuracy early, then generalising suddenly and much later), have learned some algorithm for composition. Two papers disagree about which one. One finds that the network composes elements through their irreducible representations; the other finds that it counts coset membership (a coset of a subgroup is one of the equal-sized, non-overlapping translates that tile the group) over the group's subgroups. The evidence on both sides is character-level, and a character is too coarse to distinguish the two accounts.
To remove characters as a confound I trained the same model on two groups with identical character tables but different subgroup structure: the dihedral and dicyclic groups of order 104. Any instrument that reads only characters sees them as one group, so a calibrated instrument that separates them must be reading finer structure, which is where the two hypotheses differ. Measured against a control that the prior coset evidence omits, the coset account gains no support beyond what the irreps already provide. The property that does separate the two groups is not a feature of the converged weights at all. It is how hard each group is to learn.
Two of my three measurements come back null: the coset account adds nothing the irreps do not, and no weight-level signature of the deeper structural difference appears. These are nulls, not disproofs — evidence of absence at the level the instruments measure, on this one pair, not proof that nothing is there.
Key claims
In decreasing order of confidence:
The two groups separate on learnability, not on a converged-weight signature. The dicyclic group is reliably slower and harder to grok than the character-identical dihedral group, and fails in a different way: it often stays stuck in memorisation. (Confident: 35/38 vs 29/38 over 38 seeds, with 6 dicyclic runs never leaving memorisation.)
The coset account gains nothing beyond the irreps on this pair. Against an irrep-restricted reference, coset decodability has a mean excess of about −0.05, zero or negative, on every proper normal subgroup. (Confident, conditional on that control.)
No matrix-level signature of the real-versus-quaternionic difference. The instrument built to detect it returns a null. (A real null: Welch p = 0.25 at 27 matched seeds; the larger sample makes the non-separation clearer, not weaker.)
A dimension-5 group does not grok, at the base setting or under a hyperparameter probe. A million epochs at the base settings, and 80k each at higher weight decay, more data, and both, all stay in memorisation (best test accuracy 0.15). (Not "cannot grok": the best run was still climbing, but dramatically harder than the dimension-2 pair, which grokked in 20k–40k.)
Epistemic status: I am confident in the empirical results on the order-104 pair, and all three replicate on a fully-connected baseline, so they are not artefacts of the transformer. Their generalisation to other groups remains open; the dimension-5 case is so far a no-grok.
The two accounts
Grokking gave mechanistic interpretability a clean case study. A one-layer transformer trained on modular addition reaches perfect training accuracy early, generalises much later, and the generalising solution has a known and sparse structure (Nanda et al., 2023). Modular addition is the cyclic group, and the natural question is what a network learns for non-abelian groups, where composition is genuinely non-commutative.
Two answers exist, and they are incompatible.
The irrep account (Chughtai, Chan & Nanda, 2023): the network represents each input as a matrix under the group's irreducible representations, forms the product , and reads off the answer through characters. The defining property of a representation, that it turns group multiplication into matrix multiplication, is the thing the network exploits.
The coset account (Stander et al., 2024): the network determines which coset of each subgroup the product lies in, then intersects those constraints to identify . This is a lookup over subgroup structure, not a matrix product. Stander et al. argue that the character correlation the irrep account relies on is a consequence of coset-counting, derivable from permutation-representation identities, and therefore consistent with both algorithms. They report no evidence that the linear layers multiply representation matrices, except for the one-dimensional sign representation.
The two accounts disagree, but the published evidence cannot decide between them, because all of it is character-level. Both measure correlations between model internals and irreducible characters, the traces of the representation matrices. A character is a class function. So an instrument that reads only characters is blind to the structure the two hypotheses dispute.
Why a same-character-table pair
If character-level evidence cannot separate the accounts, then remove characters as a variable. Take two groups with the same character table and ask whether the model treats them differently.
This works because the character table is blind to more than people sometimes assume. Over , the character of a representation determines that representation up to isomorphism, for a fixed group. What the character table does not determine is the group, nor the Frobenius–Schur type of its irreducible representations. Non-isomorphic groups can share a character table. The dihedral and dicyclic groups of order 104 are such a pair, as are and at order 8. Their two-dimensional irreps take identical character values but have opposite Frobenius–Schur indicators (a label that sorts irreps with real-valued characters into two types): real for the dihedral group, quaternionic for the dicyclic. The distinction is concrete: a real irrep can be written with real-number matrices, while a quaternionic one genuinely cannot, even though its character takes the same real values. The indicator depends on the power map on conjugacy classes, which the two groups realise differently and which the character table does not record.
So the pair separates the question the right way. The shared character table guarantees that any character-level instrument sees the two groups as identical by construction. A calibrated instrument that nonetheless distinguishes them is reading sub-character or subgroup-level structure, which is exactly what irreps and cosets predict differently. The dihedral group has a rich lattice of reflection subgroups; the dicyclic group has a unique involution and a quaternion-like lattice. If the model composes via cosets, the two should differ in coset decodability beyond what their shared irreps explain.
Neither prior paper ran this experiment, and abelian groups cannot. For a cyclic group every irrep is one-dimensional, the matrix product collapses to addition, and the thin subgroup lattice gives the coset account little to predict. A prime cyclic group has no proper subgroups at all, so the coset hypothesis is vacuous there. The debate only has content on non-abelian groups with real subgroup lattices and irreps of dimension greater than one. A same-character-table pair is the smallest setting that supplies all three.
Model and task
The task: given a pair of group elements, predict the product . The model sees all pairs, trains on a fixed fraction, and is tested on the rest. The split is by a seed that also seeds initialisation, so a "seed" below means one full (initialisation, data-split) draw.
The model is a one-layer transformer: an embedding, a single attention block with four heads, a two-layer MLP with ReLU, and an unembedding, with and . It has no LayerNorm and no biases, and the embedding and unembedding are untied. These choices follow the grokking literature, and they matter for analysis: without LayerNorm or biases the map from input to logits is a clean composition of linear maps and a single nonlinearity, so the logits can be regressed onto representation-theoretic features without normalisation terms confounding the fit.
Training is full-batch AdamW, learning rate , , with weight decay as the main driver of grokking. Runs are deterministic on CPU. Every run writes a manifest with the training commit, the configuration hash, and the environment, and every checkpoint embeds its full configuration, so any snapshot reconstructs its model exactly. Snapshots are dense around training events so the formation of the circuit can be traced, not just its endpoint.
Calibrating the instruments on a known answer
Before measuring a contested case I calibrated the instruments on the cyclic group of order 113, where the answer is known from prior work and where, the group being prime, the coset account is silent. This run is not evidence for either hypothesis. It checks that the tools report the truth on a case whose truth is established.
The model groks at roughly 15,000 epochs and reaches 99.77% test accuracy. Two instruments, both built from the group's character table rather than from any task-specific features:
Where the weights live. Each column of the embedding is a function on the group. That function space decomposes orthogonally into isotypic blocks (each block is the slice of that space belonging to a single irreducible representation), one per irrep. For the cyclic group the non-trivial irreps are one-dimensional and pair into two-dimensional real isotypic blocks. After grokking, three of these blocks hold 94.0% of the embedding's energy, at 22.6×, 16.2× and 14.2× the random-matrix baseline. Every other block sits below baseline. The trivial block holds 0.34%, below baseline, as expected since constant functions carry no information about the product.
C113 calibration. Top: fraction of embedding energy per isotypic block against the random-matrix baseline; three blocks dominate. Bottom: the test-loss cost of ablating each block. Energy and causal importance pick out the same three blocks.
The concentration is causal, not incidental. Removing any one of the three blocks from the embedding costs between 9.0 and 17.4 nats of test loss. Removing any of the other 53 costs at most 0.047 nats. A model restricted to only those three blocks keeps 97.4% test accuracy. The circuit lives in those blocks rather than merely overlapping them.
At a memorised checkpoint, before grokking, the same spectrum is near-uniform across all 56 blocks, which is the expected signature of a lookup table with no preferred structure. One of the eventual winning blocks is already at 2.4× baseline during pure memorisation, around 15,000 epochs before test accuracy moves. The generalising structure begins forming inside the memorising solution well before it changes any behaviour, so the suddenness of grokking is a property of the accuracy curve rather than of the underlying representation.
This reproduces the known result for modular addition on an independent implementation, using isotypic projectors derived from the character table. The point of using the general construction is that it runs unchanged on the non-abelian groups where the real experiment lives, and where no shortcut exists.
What function the network computes. Energy concentration shows where the weights sit. It does not show that the network runs the irrep algorithm. The second instrument regresses the end-to-end logits onto the feature , which is the identity matrix exactly when and a non-trivial rotation otherwise. A logit built from its matrix elements peaks when , that is, when the candidate is the correct product. Fitting the logits onto these features asks directly whether the output has the form "compose and in representation space, then fire when the result matches ." I report the fit as an : the share of the logits' variance the feature reconstructs, 1 for exact and 0 for nothing.
Calibration mattered here in two ways.
First, the matrix-level fit and the trace-only fit are identical on , so their difference is exactly zero. Because the irreps are one-dimensional, the matrix carries no more than its trace, and the instrument that is meant to detect sub-character structure correctly reports none when none can exist. That zero is the result I most needed before trusting the instrument on two-dimensional irreps, where a spurious non-zero gap would be easy to manufacture.
Second, the clean homomorphism feature explains only about 55% of the logits, not the 90%-plus I had predicted in advance, and the shortfall is informative. The network does compute the group operation: the logits are 98.3% a function of , so the model reads and almost entirely through their sum. But the comparison it learned is only about 56% translation-invariant, where the ideal comparison depends only on the difference . The missing structure is not a different mechanism and not lookup-like noise. A small number of components carry almost all of the residual, and every significant one is built from the same three irreps the model already uses: the homomorphism term, an uncancelled image term left by an imbalance in how the readout combines the two directions of each real isotypic block, and higher-order terms produced by the MLP nonlinearity. Where a lookup table would spread the residual across thousands of components, this one concentrates it.
The lesson sets the standard for the contested case. "Uses representation theory" and "is exactly the textbook formula" are different claims, and a single can flatter or undersell a mechanism depending on which idealisation it is compared against. Describing what this model does accurately took three measurements, not one. That is why character-only evidence on the contested pair is not enough.
The pair: dihedral versus dicyclic of order 104
The dihedral group Dih(104) and the dicyclic group Dic(104) both have order 104 and identical character tables. They differ where the character table is blind: the dihedral group's two-dimensional irreps are real, the dicyclic group's are quaternionic, and their subgroup lattices differ as described above.
The two decisive measurements follow from the two hypotheses:
the matrix-versus-trace gap, which is the variance the readout explains through matrix structure beyond what the trace explains, and so the place a real-versus-quaternionic difference would appear; and
coset decodability in excess of an irrep control, which is whether there is a coset signal the irreps do not already provide.
Everything else is held fixed at the calibration settings. Groups are compared only at matched , which is the only way to attribute a difference to the group rather than to the hyperparameters. Learnability is reported over all 38 seeds; the matrix-level and coset contrasts use the 27 weight-decay-1.0 seeds where both groups grokked.
Result 1: the pair separates on learnability
The clearest difference between the two groups needs no analysis of the converged weights.
The dihedral group groks at 35 of 38 seeds at weight decay 1.0, and quickly: a mean of about 20,000 epochs (the three misses are near-threshold, final test accuracy 0.94–0.99, with none stuck in memorisation). It also groks at weight decay 0.5. The dicyclic group groks at 29 of 38 seeds, only at weight decay 1.0, and much later, a mean of about 40,000 epochs. Of the nine misses, six stay stuck in pure memorisation, final test accuracy below 0.5 (the lowest at 0.02). Every dicyclic run at weight decay 0.5 failed to grok.
The quaternionic group is consistently and substantially harder to learn, and the kind of failure differs: the dihedral group always reaches near-perfect generalisation, while the dicyclic group frequently never leaves memorisation. This is a sub-character property, since the two groups are character-identical, and it surfaces as a difference in optimisation difficulty rather than in the structure of the converged solution: the complexity the character table hides shows up in training, not in the converged weights.
So the sharpest difference between the two groups is one neither account was built to predict. Both the irrep and the coset account are theories of the converged circuit; this difference is not in the circuit at all.
Grokking epoch by group, across 38 seeds (weight decay 1.0). The dihedral group groks early and at nearly every seed; the dicyclic group groks much later, and 9 of 38 do not grok at all (6 stuck in memorisation).
Result 2: the coset side adds nothing the irreps do not
For each proper normal subgroup , a linear probe reads the coset of from the residual stream, scored against two controls. The first is a random-partition null, which sets a capacity floor. The second, which is the control that matters, is an irrep-feature reference restricted to the irreps the model actually concentrates in. This restriction is the part the prior coset evidence omits. Every group's irreps together reconstruct everything by Peter–Weyl, so an unrestricted irrep control is vacuous; the reference therefore uses only the blocks the model uses.
Across all seven proper normal subgroups and all 27 matched seeds (189 measurements per group), the excess of coset decodability over the irrep control is at or below zero on average: mean −0.055 for the dihedral group and −0.044 for the dicyclic. The pattern is consistent: on the subgroups where the naive probe reaches 100%, the irrep control also reaches 100%, so the excess is zero. The probe's apparent success is accounted for by the irreps the model already computes. There is no coset signal on top of them.
Without that control, 100% probe accuracy reads as strong evidence for the coset account. With it, the evidence is gone. On this pair the coset side neither separates the two groups nor shows a mechanism independent of the irreps.
Coset decodability minus the irrep-restricted control, per proper normal subgroup and seed (189 per group). The excess clusters at or below zero (group means −0.055 and −0.044): the naive probe's success is explained by the irreps the model already computes.
The analysis also includes a causal check: ablate the coset-direction subspace, measure the rise in cross-coset error, and compare against a capacity-matched random-partition subspace. That control matches the ablated subspace's capacity but not the irrep confound, because the coset subspace overlaps the irrep subspace the model needs. As a result the ablation effect is large and variable for both groups and does not separate them. The result I rely on is the observational excess over the irrep control, not the ablation effect. Reporting the weaker number as if it were strong would be the error this project is built to avoid.
Result 3: the matrix-level gap
The gap is the matrix-level fit minus the trace-only fit. It measures the variance the readout explains through full matrix structure, beyond what the character trace alone explains. That is the one place a real-versus-quaternionic difference between the two groups could live in the converged weights.
There is a concrete reason to expect a signature there. A real two-dimensional irrep can be written with real-number matrices, so a readout exploiting it could use fewer independent matrix degrees of freedom — at the limit, half the rank. A quaternionic irrep genuinely needs the full complex structure. So if the network multiplies representation matrices, the dihedral (real) and dicyclic (quaternionic) groups should leave different fingerprints in the readout, and the gap is the instrument built to read them.
At 27 matched seeds it does not separate them. The two distributions overlap heavily and the means are close. A Welch t-test, run without scipy:
This is a real null, not an underpowered one: the larger sample makes the non-separation clearer than the suggestive six-seed result did, and the means converge as seeds are added. One thing at the matrix level is stable across seeds: both groups fill the full rank in every two-dimensional isotypic block, so the expectation that the real group would use half the rank of the quaternionic one does not hold for either. On this pair the readout simply does not encode the real-versus-quaternionic distinction at the level the gap measures.
So the matrix-level signature that could have separated the two groups in their converged weights is absent — which is exactly the pattern Result 1 already showed: what tells these two groups apart lives in how they are learned, not in the trained network.
Matrix-versus-trace R² gap by group, across the 27 matched seeds. The distributions overlap heavily and the means are close (0.074 vs 0.055); the difference is not significant (Welch p = 0.25).
The same three findings, on a different architecture
Everything above is a transformer, but the coset account was originally read off fully-connected networks. A transformer-only result therefore leaves the architecture itself as a confound: perhaps the asymmetry, or the coset null, is a fact about attention rather than about the groups. To check, I retrained the order-104 pair on a one-hidden-layer fully-connected network (a shared embedding of and , concatenated into a single ReLU layer, no biases), six seeds per group, at the same weight decay. All three findings survive the change.
Grok epoch by group on the fully-connected baseline (6 seeds each). The dihedral group groks roughly three times faster than the dicyclic group, the same ordering as the transformer.
The learnability asymmetry holds, and sharply: the dihedral group groks at all six seeds in about 5,600 epochs, the dicyclic group at all six in about 16,100, roughly three times slower, the same direction as the transformer. One difference is informative. On the fully-connected network every dicyclic seed eventually groks, so the catastrophic memorisation plateau that struck six of the thirty-eight transformer runs is a transformer-specific failure mode; on this architecture the asymmetry is purely a difference in speed. The matrix-level gap stays null (Dih mean 0.021, Dic mean 0.013, Welch ), with gaps even smaller than on the transformer. And the coset account again gains nothing over the irreps: mean excess of for the dihedral group and for the dicyclic, both centred on zero. The dihedral spread is noisier here than on the transformer, with one normal subgroup reaching , but the dicyclic side is negative and neither group shows a systematic coset signal beyond its own irreps. That the coset null holds on the very architecture the coset account was built from is the most demanding version of this control available.
The dimension-5 frontier
The pair above lives entirely in two-dimensional irreps, where the gap has the least room to show anything. The natural next discriminator is a same-character-table pair with five-dimensional irreps and much richer matrix structure: the Heisenberg group over , an extraspecial group of order 125 and exponent 5, against . That pair is only worth building if a dimension-5 group groks at all, so I checked the prerequisite first.
It does not, at the base setting. Trained at weight decay 1.0 on half the data for one million epochs, the Heisenberg model reaches perfect training accuracy but its test accuracy plateaus near 0.11, well above the 1/125 ≈ 0.008 chance rate, so it learns some structure, but nowhere near generalisation. It never leaves memorisation within the budget.
Nor does a hyperparameter probe rescue it. I ran the two levers the pair showed matter: higher weight decay (2.0) and more data (train fraction 0.7), separately and together, for 80,000 epochs each. None grokked. More regularisation alone barely moved the ceiling (peak test accuracy 0.10); more data alone, the same (0.12); both together reached the highest, 0.15, and was still inching up at the end. So the direction is right (both levers help, slightly), but a dimension-5 group is dramatically harder than the dimension-2 pair, which grokked in 20,000–40,000 epochs. I report this as "did not grok within budget," not "cannot grok": the both-levers run was still climbing, so a far larger budget might eventually cross the line. But the gap is enormous, and it places dimension-5 firmly at the top of a difficulty ladder that climbs with representation-theoretic complexity: dimension-1 (cyclic) groks fast, dimension-2 splits into easy dihedral and hard dicyclic, an order-matched dimension-4 group () never grokked in 80,000 epochs, and dimension-5 does not grok in a million. The partner and the five-dimensional discriminator only become worth building once a dimension-5 group groks at all, which, so far, none does.
Limitations
No matrix-level signature found, not proven absent. The gap returns a null at 27 matched seeds (p = 0.25). That is evidence the readout does not encode the real-versus-quaternionic distinction at the level this instrument measures, but a different instrument might find structure the gap misses.
Only two-dimensional irreps tested so far. The pair lives in two-dimensional blocks, where the matrix-level instrument has the least room to speak. The dimension-5 case is the direct test, and so far it does not grok, so the richest setting for the gap is not yet reachable.
Two architectures, both small. The findings replicate on a transformer and a one-hidden-layer fully-connected network, which removes architecture as a confound for the prior split between the two accounts. But both are small models on a fully-characterised task, so generalisation to larger or qualitatively different architectures is untested.
One pair. The learnability asymmetry and the coset null are established on a single same-character-table pair; whether they generalise to other such pairs is untested.
What this does and doesn't settle
The experiment settles a narrow question on one pair. Three larger questions stay open, in descending order of scientific interest.
When the structure forms, not just whether it does. This is the deepest open question, and the one the existing infrastructure is already built to answer. The snapshots capture the memorisation-to-circuit transition densely, so the same instruments can run along the trajectory and ask when the irrep structure appears, how that timing relates to the grokking point, and whether the learnability gap between the two groups has a visible onset.
Whether the findings hold on other same-character-table pairs. The learnability asymmetry and both nulls rest on a single pair, so they could be a property of this pair rather than of the phenomenon. Testing a second pair — of different order, or with a different Frobenius–Schur split — would tell which, and is the cheapest way to find out.
Whether the richest setting is ever reachable. A five-dimensional discriminator would give the matrix-level instrument the most room to speak, but it is gated on a dimension-5 group grokking at all, which none has yet. This is a question of compute, not of method: a much longer run on the both-levers setting (still climbing when I stopped it) would settle whether it is reachable.
Why this matters beyond toy groups
The object here is a fully-characterised toy, but the method is not, and three of its moves transfer to the interpretability and evaluation of models far too large to characterise.
First, validate the instrument on a known answer before trusting it on a contested one. The calibration is what turned a single into a number I could read, and it is what let me trust a null rather than dismiss it as a broken probe. An interpretability result reported without a ground-truth calibration is a measurement without units.
Second, design the comparison so a confound cannot survive it. The same-character-table pair holds the character-level story fixed by construction, so any difference that appears has to come from below it. The analogue for a real model is to compare cases matched on everything the cheap explanation predicts, so that what is left over is the thing actually in dispute.
Third, look for the effect in the dynamics, not only in the weights. The headline result here is a training-time phenomenon that leaves no converged-weight signature; an audit that read only the final checkpoint would have called the two groups identical. As interpretability and evals are asked to certify properties of frontier models, that failure mode is the one to take seriously: a property can be decisive and still invisible to any measurement of the trained network alone.
None of this claims the findings scale — they are about two small models on a tiny task. It claims the method travels: calibrate first, discriminate by construction, and watch the trajectory, not just the endpoint.
Methods: how each measurement was computed
Extracting the irreducible representations. The irreps are built numerically from the regular representation, for any group, with no per-group hand-construction. A subtlety the obvious method gets wrong: for an irrep of dimension , the span of a generic vector's orbit is the whole -dimensional isotypic component, not a single irreducible copy. The correct construction builds the isotypic basis, averages a Hermitian seed over to land in the commutant (which is by Schur's lemma), and takes the top eigen-cluster of size as one irreducible copy. Every extracted representation is checked at runtime by exhaustive , unitarity, against the character table, and irreducibility . Because is basis-invariant, the arbitrary choice of numerical basis within an isotypic component does not affect any reported number.
Isotypic energy concentration. Each column of the embedding is treated as a function on and projected onto each isotypic block. The reported energy is the squared-norm fraction in each block, compared against the analytic random-matrix baseline (block dimension over ). The "kept" blocks are selected by an algorithmic rule (energy greater than twice baseline), not by inspection, so the same rule applies unchanged to every group.
Causal checks. Block ablation removes one block's component of (, leaving the result-position row untouched) and re-evaluates; the reported cost is the increase in test loss in nats. The restriction control keeps only the kept blocks and measures the surviving accuracy. Both are non-destructive: they act on a copy and read the logits through the unembedding.
Functional-form fit. The logit tensor is mean-centred, then regressed by least squares onto two feature sets: the matrix elements of per kept irrep (the "full" fit), and the trace alone (the "trace" fit). I report the cumulative and per-irrep for each, and the gap (full minus trace), which is the variance attributable to sub-character matrix structure. The kept irreps are mapped from the kept energy blocks.
Coset probe and its two controls. For each subgroup , a linear probe (a single linear layer trained with LBFGS on a stratified 80/20 split) reads coset membership of , , and from the residual stream at the result position. It is scored against (i) a random-partition null, which fixes a capacity floor by relabelling cosets at random, and (ii) an irrep-feature reference built only from the model's kept irreps. The headline quantity is excess_over_irrep, the probe accuracy minus the irrep reference. The separate causal check ablates the class-mean coset subspace against a matched random-partition subspace and decomposes within- versus cross-coset error; as noted in Result 2, that control matches capacity but not the irrep confound, so I rely on the observational excess rather than the ablation effect.
Calibration against planted ground truth. Before any of these were trusted on a real model, each was run on synthetic activations with a known answer planted. A signal planted in a known low-baseline block is recovered (energy concentration goes to one); a planted functional form recovers ; a planted coset beats both controls, while purely irrep-derived activations show zero excess over the irrep control; ablating a planted coset subspace destroys decodability while a matched random subspace does not. Prime groups, which have no proper subgroups, correctly return an empty coset analysis.
Provenance. Every run writes a manifest with the training commit, configuration hash, and environment; every checkpoint embeds its full configuration; and every analysis output records both the training and the analysis commit and the checkpoint it describes. Runs are deterministic on CPU, so the numbers above regenerate exactly.
Appendix: the representation theory used above
Scoped to what the argument uses.
Group and representation. A finite group is a set with an associative product, an identity, and inverses. A representation is a homomorphism into the invertible linear maps on a vector space , so that . It turns group multiplication into matrix multiplication, which is the only reason a linear network can compute composition at all.
Irreducible representation. is irreducible if has no proper subspace fixed by every . Every representation decomposes into irreducibles; they are the building blocks.
Character and Schur orthogonality. The character of is , constant on conjugacy classes. The irreducible characters are orthonormal under . This is why characters are a natural coordinate system on class functions, and why a correlation with a character is suggestive but, as the coset account argues, ambiguous.
Isotypic decomposition. The space of functions on decomposes orthogonally into isotypic blocks, one per irrep, the block for an irrep of dimension having dimension . This is the object the energy instrument measures. For an abelian group every irrep is one-dimensional and the blocks are correspondingly small; for the non-abelian groups here the blocks are higher-dimensional, and that extra structure is what the matrix-level instrument is built to read.
Frobenius–Schur indicator. For an irrep with real-valued character the indicator is if the irrep is realisable over the reals and if it is quaternionic. Two groups can share a character table yet differ here, because depends on the power map and the character table does not record it. The real-versus-quaternionic difference between the dihedral and dicyclic irreps is exactly this, and it is the sub-character distinction the whole experiment tries to detect.
When a network learns to multiply the elements of a finite group, it is not obvious what algorithm it adopts, nor where that algorithm ends up residing once training is over. Two competing theories of the mechanism exist, but the published evidence for both is too coarse to tell them apart. I design an experiment that can: I train the same model on two different groups built to look identical to every measurement that evidence uses, and watch for any difference. The clearest difference turns out not to be a feature of the trained network's weights at all — it is how hard each group is to learn in the first place. That carries a lesson beyond this toy setting: a property can be real and decisive and still leave no trace in the converged model, surfacing only in the training dynamics.
Two published accounts of how networks learn group composition make the same character-level predictions, so I separate them on a pair of groups chosen so that characters cannot tell them apart.
Summary
A one-layer transformer trained to multiply elements of a finite group will, after grokking (reaching perfect training accuracy early, then generalising suddenly and much later), have learned some algorithm for composition. Two papers disagree about which one. One finds that the network composes elements through their irreducible representations; the other finds that it counts coset membership (a coset of a subgroup is one of the equal-sized, non-overlapping translates that tile the group) over the group's subgroups. The evidence on both sides is character-level, and a character is too coarse to distinguish the two accounts.
To remove characters as a confound I trained the same model on two groups with identical character tables but different subgroup structure: the dihedral and dicyclic groups of order 104. Any instrument that reads only characters sees them as one group, so a calibrated instrument that separates them must be reading finer structure, which is where the two hypotheses differ. Measured against a control that the prior coset evidence omits, the coset account gains no support beyond what the irreps already provide. The property that does separate the two groups is not a feature of the converged weights at all. It is how hard each group is to learn.
Two of my three measurements come back null: the coset account adds nothing the irreps do not, and no weight-level signature of the deeper structural difference appears. These are nulls, not disproofs — evidence of absence at the level the instruments measure, on this one pair, not proof that nothing is there.
Key claims
In decreasing order of confidence:
Epistemic status: I am confident in the empirical results on the order-104 pair, and all three replicate on a fully-connected baseline, so they are not artefacts of the transformer. Their generalisation to other groups remains open; the dimension-5 case is so far a no-grok.
The two accounts
Grokking gave mechanistic interpretability a clean case study. A one-layer transformer trained on modular addition reaches perfect training accuracy early, generalises much later, and the generalising solution has a known and sparse structure (Nanda et al., 2023). Modular addition is the cyclic group, and the natural question is what a network learns for non-abelian groups, where composition is genuinely non-commutative.
Two answers exist, and they are incompatible.
The irrep account (Chughtai, Chan & Nanda, 2023): the network represents each input as a matrix under the group's irreducible representations, forms the product , and reads off the answer through characters. The defining property of a representation, that it turns group multiplication into matrix multiplication, is the thing the network exploits.
The coset account (Stander et al., 2024): the network determines which coset of each subgroup the product lies in, then intersects those constraints to identify . This is a lookup over subgroup structure, not a matrix product. Stander et al. argue that the character correlation the irrep account relies on is a consequence of coset-counting, derivable from permutation-representation identities, and therefore consistent with both algorithms. They report no evidence that the linear layers multiply representation matrices, except for the one-dimensional sign representation.
The two accounts disagree, but the published evidence cannot decide between them, because all of it is character-level. Both measure correlations between model internals and irreducible characters, the traces of the representation matrices. A character is a class function. So an instrument that reads only characters is blind to the structure the two hypotheses dispute.
Why a same-character-table pair
If character-level evidence cannot separate the accounts, then remove characters as a variable. Take two groups with the same character table and ask whether the model treats them differently.
This works because the character table is blind to more than people sometimes assume. Over , the character of a representation determines that representation up to isomorphism, for a fixed group. What the character table does not determine is the group, nor the Frobenius–Schur type of its irreducible representations. Non-isomorphic groups can share a character table. The dihedral and dicyclic groups of order 104 are such a pair, as are and at order 8. Their two-dimensional irreps take identical character values but have opposite Frobenius–Schur indicators (a label that sorts irreps with real-valued characters into two types): real for the dihedral group, quaternionic for the dicyclic. The distinction is concrete: a real irrep can be written with real-number matrices, while a quaternionic one genuinely cannot, even though its character takes the same real values. The indicator depends on the power map on conjugacy classes, which the two groups realise differently and which the character table does not record.
So the pair separates the question the right way. The shared character table guarantees that any character-level instrument sees the two groups as identical by construction. A calibrated instrument that nonetheless distinguishes them is reading sub-character or subgroup-level structure, which is exactly what irreps and cosets predict differently. The dihedral group has a rich lattice of reflection subgroups; the dicyclic group has a unique involution and a quaternion-like lattice. If the model composes via cosets, the two should differ in coset decodability beyond what their shared irreps explain.
Neither prior paper ran this experiment, and abelian groups cannot. For a cyclic group every irrep is one-dimensional, the matrix product collapses to addition, and the thin subgroup lattice gives the coset account little to predict. A prime cyclic group has no proper subgroups at all, so the coset hypothesis is vacuous there. The debate only has content on non-abelian groups with real subgroup lattices and irreps of dimension greater than one. A same-character-table pair is the smallest setting that supplies all three.
Model and task
The task: given a pair of group elements, predict the product . The model sees all pairs, trains on a fixed fraction, and is tested on the rest. The split is by a seed that also seeds initialisation, so a "seed" below means one full (initialisation, data-split) draw.
The model is a one-layer transformer: an embedding, a single attention block with four heads, a two-layer MLP with ReLU, and an unembedding, with and . It has no LayerNorm and no biases, and the embedding and unembedding are untied. These choices follow the grokking literature, and they matter for analysis: without LayerNorm or biases the map from input to logits is a clean composition of linear maps and a single nonlinearity, so the logits can be regressed onto representation-theoretic features without normalisation terms confounding the fit.
Training is full-batch AdamW, learning rate , , with weight decay as the main driver of grokking. Runs are deterministic on CPU. Every run writes a manifest with the training commit, the configuration hash, and the environment, and every checkpoint embeds its full configuration, so any snapshot reconstructs its model exactly. Snapshots are dense around training events so the formation of the circuit can be traced, not just its endpoint.
Calibrating the instruments on a known answer
Before measuring a contested case I calibrated the instruments on the cyclic group of order 113, where the answer is known from prior work and where, the group being prime, the coset account is silent. This run is not evidence for either hypothesis. It checks that the tools report the truth on a case whose truth is established.
The model groks at roughly 15,000 epochs and reaches 99.77% test accuracy. Two instruments, both built from the group's character table rather than from any task-specific features:
Where the weights live. Each column of the embedding is a function on the group. That function space decomposes orthogonally into isotypic blocks (each block is the slice of that space belonging to a single irreducible representation), one per irrep. For the cyclic group the non-trivial irreps are one-dimensional and pair into two-dimensional real isotypic blocks. After grokking, three of these blocks hold 94.0% of the embedding's energy, at 22.6×, 16.2× and 14.2× the random-matrix baseline. Every other block sits below baseline. The trivial block holds 0.34%, below baseline, as expected since constant functions carry no information about the product.
The concentration is causal, not incidental. Removing any one of the three blocks from the embedding costs between 9.0 and 17.4 nats of test loss. Removing any of the other 53 costs at most 0.047 nats. A model restricted to only those three blocks keeps 97.4% test accuracy. The circuit lives in those blocks rather than merely overlapping them.
At a memorised checkpoint, before grokking, the same spectrum is near-uniform across all 56 blocks, which is the expected signature of a lookup table with no preferred structure. One of the eventual winning blocks is already at 2.4× baseline during pure memorisation, around 15,000 epochs before test accuracy moves. The generalising structure begins forming inside the memorising solution well before it changes any behaviour, so the suddenness of grokking is a property of the accuracy curve rather than of the underlying representation.
This reproduces the known result for modular addition on an independent implementation, using isotypic projectors derived from the character table. The point of using the general construction is that it runs unchanged on the non-abelian groups where the real experiment lives, and where no shortcut exists.
What function the network computes. Energy concentration shows where the weights sit. It does not show that the network runs the irrep algorithm. The second instrument regresses the end-to-end logits onto the feature , which is the identity matrix exactly when and a non-trivial rotation otherwise. A logit built from its matrix elements peaks when , that is, when the candidate is the correct product. Fitting the logits onto these features asks directly whether the output has the form "compose and in representation space, then fire when the result matches ." I report the fit as an : the share of the logits' variance the feature reconstructs, 1 for exact and 0 for nothing.
Calibration mattered here in two ways.
First, the matrix-level fit and the trace-only fit are identical on , so their difference is exactly zero. Because the irreps are one-dimensional, the matrix carries no more than its trace, and the instrument that is meant to detect sub-character structure correctly reports none when none can exist. That zero is the result I most needed before trusting the instrument on two-dimensional irreps, where a spurious non-zero gap would be easy to manufacture.
Second, the clean homomorphism feature explains only about 55% of the logits, not the 90%-plus I had predicted in advance, and the shortfall is informative. The network does compute the group operation: the logits are 98.3% a function of , so the model reads and almost entirely through their sum. But the comparison it learned is only about 56% translation-invariant, where the ideal comparison depends only on the difference . The missing structure is not a different mechanism and not lookup-like noise. A small number of components carry almost all of the residual, and every significant one is built from the same three irreps the model already uses: the homomorphism term, an uncancelled image term left by an imbalance in how the readout combines the two directions of each real isotypic block, and higher-order terms produced by the MLP nonlinearity. Where a lookup table would spread the residual across thousands of components, this one concentrates it.
The lesson sets the standard for the contested case. "Uses representation theory" and "is exactly the textbook formula" are different claims, and a single can flatter or undersell a mechanism depending on which idealisation it is compared against. Describing what this model does accurately took three measurements, not one. That is why character-only evidence on the contested pair is not enough.
The pair: dihedral versus dicyclic of order 104
The dihedral group Dih(104) and the dicyclic group Dic(104) both have order 104 and identical character tables. They differ where the character table is blind: the dihedral group's two-dimensional irreps are real, the dicyclic group's are quaternionic, and their subgroup lattices differ as described above.
The two decisive measurements follow from the two hypotheses:
Everything else is held fixed at the calibration settings. Groups are compared only at matched , which is the only way to attribute a difference to the group rather than to the hyperparameters. Learnability is reported over all 38 seeds; the matrix-level and coset contrasts use the 27 weight-decay-1.0 seeds where both groups grokked.
Result 1: the pair separates on learnability
The clearest difference between the two groups needs no analysis of the converged weights.
The dihedral group groks at 35 of 38 seeds at weight decay 1.0, and quickly: a mean of about 20,000 epochs (the three misses are near-threshold, final test accuracy 0.94–0.99, with none stuck in memorisation). It also groks at weight decay 0.5. The dicyclic group groks at 29 of 38 seeds, only at weight decay 1.0, and much later, a mean of about 40,000 epochs. Of the nine misses, six stay stuck in pure memorisation, final test accuracy below 0.5 (the lowest at 0.02). Every dicyclic run at weight decay 0.5 failed to grok.
The quaternionic group is consistently and substantially harder to learn, and the kind of failure differs: the dihedral group always reaches near-perfect generalisation, while the dicyclic group frequently never leaves memorisation. This is a sub-character property, since the two groups are character-identical, and it surfaces as a difference in optimisation difficulty rather than in the structure of the converged solution: the complexity the character table hides shows up in training, not in the converged weights.
So the sharpest difference between the two groups is one neither account was built to predict. Both the irrep and the coset account are theories of the converged circuit; this difference is not in the circuit at all.
Result 2: the coset side adds nothing the irreps do not
For each proper normal subgroup , a linear probe reads the coset of from the residual stream, scored against two controls. The first is a random-partition null, which sets a capacity floor. The second, which is the control that matters, is an irrep-feature reference restricted to the irreps the model actually concentrates in. This restriction is the part the prior coset evidence omits. Every group's irreps together reconstruct everything by Peter–Weyl, so an unrestricted irrep control is vacuous; the reference therefore uses only the blocks the model uses.
Across all seven proper normal subgroups and all 27 matched seeds (189 measurements per group), the excess of coset decodability over the irrep control is at or below zero on average: mean −0.055 for the dihedral group and −0.044 for the dicyclic. The pattern is consistent: on the subgroups where the naive probe reaches 100%, the irrep control also reaches 100%, so the excess is zero. The probe's apparent success is accounted for by the irreps the model already computes. There is no coset signal on top of them.
Without that control, 100% probe accuracy reads as strong evidence for the coset account. With it, the evidence is gone. On this pair the coset side neither separates the two groups nor shows a mechanism independent of the irreps.
The analysis also includes a causal check: ablate the coset-direction subspace, measure the rise in cross-coset error, and compare against a capacity-matched random-partition subspace. That control matches the ablated subspace's capacity but not the irrep confound, because the coset subspace overlaps the irrep subspace the model needs. As a result the ablation effect is large and variable for both groups and does not separate them. The result I rely on is the observational excess over the irrep control, not the ablation effect. Reporting the weaker number as if it were strong would be the error this project is built to avoid.
Result 3: the matrix-level gap
The gap is the matrix-level fit minus the trace-only fit. It measures the variance the readout explains through full matrix structure, beyond what the character trace alone explains. That is the one place a real-versus-quaternionic difference between the two groups could live in the converged weights.
There is a concrete reason to expect a signature there. A real two-dimensional irrep can be written with real-number matrices, so a readout exploiting it could use fewer independent matrix degrees of freedom — at the limit, half the rank. A quaternionic irrep genuinely needs the full complex structure. So if the network multiplies representation matrices, the dihedral (real) and dicyclic (quaternionic) groups should leave different fingerprints in the readout, and the gap is the instrument built to read them.
At 27 matched seeds it does not separate them. The two distributions overlap heavily and the means are close. A Welch t-test, run without scipy:
This is a real null, not an underpowered one: the larger sample makes the non-separation clearer than the suggestive six-seed result did, and the means converge as seeds are added. One thing at the matrix level is stable across seeds: both groups fill the full rank in every two-dimensional isotypic block, so the expectation that the real group would use half the rank of the quaternionic one does not hold for either. On this pair the readout simply does not encode the real-versus-quaternionic distinction at the level the gap measures.
So the matrix-level signature that could have separated the two groups in their converged weights is absent — which is exactly the pattern Result 1 already showed: what tells these two groups apart lives in how they are learned, not in the trained network.
The same three findings, on a different architecture
Everything above is a transformer, but the coset account was originally read off fully-connected networks. A transformer-only result therefore leaves the architecture itself as a confound: perhaps the asymmetry, or the coset null, is a fact about attention rather than about the groups. To check, I retrained the order-104 pair on a one-hidden-layer fully-connected network (a shared embedding of and , concatenated into a single ReLU layer, no biases), six seeds per group, at the same weight decay. All three findings survive the change.
The learnability asymmetry holds, and sharply: the dihedral group groks at all six seeds in about 5,600 epochs, the dicyclic group at all six in about 16,100, roughly three times slower, the same direction as the transformer. One difference is informative. On the fully-connected network every dicyclic seed eventually groks, so the catastrophic memorisation plateau that struck six of the thirty-eight transformer runs is a transformer-specific failure mode; on this architecture the asymmetry is purely a difference in speed. The matrix-level gap stays null (Dih mean 0.021, Dic mean 0.013, Welch ), with gaps even smaller than on the transformer. And the coset account again gains nothing over the irreps: mean excess of for the dihedral group and for the dicyclic, both centred on zero. The dihedral spread is noisier here than on the transformer, with one normal subgroup reaching , but the dicyclic side is negative and neither group shows a systematic coset signal beyond its own irreps. That the coset null holds on the very architecture the coset account was built from is the most demanding version of this control available.
The dimension-5 frontier
The pair above lives entirely in two-dimensional irreps, where the gap has the least room to show anything. The natural next discriminator is a same-character-table pair with five-dimensional irreps and much richer matrix structure: the Heisenberg group over , an extraspecial group of order 125 and exponent 5, against . That pair is only worth building if a dimension-5 group groks at all, so I checked the prerequisite first.
It does not, at the base setting. Trained at weight decay 1.0 on half the data for one million epochs, the Heisenberg model reaches perfect training accuracy but its test accuracy plateaus near 0.11, well above the 1/125 ≈ 0.008 chance rate, so it learns some structure, but nowhere near generalisation. It never leaves memorisation within the budget.
Nor does a hyperparameter probe rescue it. I ran the two levers the pair showed matter: higher weight decay (2.0) and more data (train fraction 0.7), separately and together, for 80,000 epochs each. None grokked. More regularisation alone barely moved the ceiling (peak test accuracy 0.10); more data alone, the same (0.12); both together reached the highest, 0.15, and was still inching up at the end. So the direction is right (both levers help, slightly), but a dimension-5 group is dramatically harder than the dimension-2 pair, which grokked in 20,000–40,000 epochs. I report this as "did not grok within budget," not "cannot grok": the both-levers run was still climbing, so a far larger budget might eventually cross the line. But the gap is enormous, and it places dimension-5 firmly at the top of a difficulty ladder that climbs with representation-theoretic complexity: dimension-1 (cyclic) groks fast, dimension-2 splits into easy dihedral and hard dicyclic, an order-matched dimension-4 group ( ) never grokked in 80,000 epochs, and dimension-5 does not grok in a million. The partner and the five-dimensional discriminator only become worth building once a dimension-5 group groks at all, which, so far, none does.
Limitations
What this does and doesn't settle
The experiment settles a narrow question on one pair. Three larger questions stay open, in descending order of scientific interest.
Why this matters beyond toy groups
The object here is a fully-characterised toy, but the method is not, and three of its moves transfer to the interpretability and evaluation of models far too large to characterise.
First, validate the instrument on a known answer before trusting it on a contested one. The calibration is what turned a single into a number I could read, and it is what let me trust a null rather than dismiss it as a broken probe. An interpretability result reported without a ground-truth calibration is a measurement without units.
Second, design the comparison so a confound cannot survive it. The same-character-table pair holds the character-level story fixed by construction, so any difference that appears has to come from below it. The analogue for a real model is to compare cases matched on everything the cheap explanation predicts, so that what is left over is the thing actually in dispute.
Third, look for the effect in the dynamics, not only in the weights. The headline result here is a training-time phenomenon that leaves no converged-weight signature; an audit that read only the final checkpoint would have called the two groups identical. As interpretability and evals are asked to certify properties of frontier models, that failure mode is the one to take seriously: a property can be decisive and still invisible to any measurement of the trained network alone.
None of this claims the findings scale — they are about two small models on a tiny task. It claims the method travels: calibrate first, discriminate by construction, and watch the trajectory, not just the endpoint.
Methods: how each measurement was computed
Extracting the irreducible representations. The irreps are built numerically from the regular representation, for any group, with no per-group hand-construction. A subtlety the obvious method gets wrong: for an irrep of dimension , the span of a generic vector's orbit is the whole -dimensional isotypic component, not a single irreducible copy. The correct construction builds the isotypic basis, averages a Hermitian seed over to land in the commutant (which is by Schur's lemma), and takes the top eigen-cluster of size as one irreducible copy. Every extracted representation is checked at runtime by exhaustive , unitarity, against the character table, and irreducibility . Because is basis-invariant, the arbitrary choice of numerical basis within an isotypic component does not affect any reported number.
Isotypic energy concentration. Each column of the embedding is treated as a function on and projected onto each isotypic block. The reported energy is the squared-norm fraction in each block, compared against the analytic random-matrix baseline (block dimension over ). The "kept" blocks are selected by an algorithmic rule (energy greater than twice baseline), not by inspection, so the same rule applies unchanged to every group.
Causal checks. Block ablation removes one block's component of ( , leaving the result-position row untouched) and re-evaluates; the reported cost is the increase in test loss in nats. The restriction control keeps only the kept blocks and measures the surviving accuracy. Both are non-destructive: they act on a copy and read the logits through the unembedding.
Functional-form fit. The logit tensor is mean-centred, then regressed by least squares onto two feature sets: the matrix elements of per kept irrep (the "full" fit), and the trace alone (the "trace" fit). I report the cumulative and per-irrep for each, and the gap (full minus trace), which is the variance attributable to sub-character matrix structure. The kept irreps are mapped from the kept energy blocks.
Coset probe and its two controls. For each subgroup , a linear probe (a single linear layer trained with LBFGS on a stratified 80/20 split) reads coset membership of , , and from the residual stream at the result position. It is scored against (i) a random-partition null, which fixes a capacity floor by relabelling cosets at random, and (ii) an irrep-feature reference built only from the model's kept irreps. The headline quantity is
excess_over_irrep, the probe accuracy minus the irrep reference. The separate causal check ablates the class-mean coset subspace against a matched random-partition subspace and decomposes within- versus cross-coset error; as noted in Result 2, that control matches capacity but not the irrep confound, so I rely on the observational excess rather than the ablation effect.Calibration against planted ground truth. Before any of these were trusted on a real model, each was run on synthetic activations with a known answer planted. A signal planted in a known low-baseline block is recovered (energy concentration goes to one); a planted functional form recovers ; a planted coset beats both controls, while purely irrep-derived activations show zero excess over the irrep control; ablating a planted coset subspace destroys decodability while a matched random subspace does not. Prime groups, which have no proper subgroups, correctly return an empty coset analysis.
Provenance. Every run writes a manifest with the training commit, configuration hash, and environment; every checkpoint embeds its full configuration; and every analysis output records both the training and the analysis commit and the checkpoint it describes. Runs are deterministic on CPU, so the numbers above regenerate exactly.
Appendix: the representation theory used above
Scoped to what the argument uses.
Group and representation. A finite group is a set with an associative product, an identity, and inverses. A representation is a homomorphism into the invertible linear maps on a vector space , so that . It turns group multiplication into matrix multiplication, which is the only reason a linear network can compute composition at all.
Irreducible representation. is irreducible if has no proper subspace fixed by every . Every representation decomposes into irreducibles; they are the building blocks.
Character and Schur orthogonality. The character of is , constant on conjugacy classes. The irreducible characters are orthonormal under . This is why characters are a natural coordinate system on class functions, and why a correlation with a character is suggestive but, as the coset account argues, ambiguous.
Isotypic decomposition. The space of functions on decomposes orthogonally into isotypic blocks, one per irrep, the block for an irrep of dimension having dimension . This is the object the energy instrument measures. For an abelian group every irrep is one-dimensional and the blocks are correspondingly small; for the non-abelian groups here the blocks are higher-dimensional, and that extra structure is what the matrix-level instrument is built to read.
Frobenius–Schur indicator. For an irrep with real-valued character the indicator is if the irrep is realisable over the reals and if it is quaternionic. Two groups can share a character table yet differ here, because depends on the power map and the character table does not record it. The real-versus-quaternionic difference between the dihedral and dicyclic irreps is exactly this, and it is the sub-character distinction the whole experiment tries to detect.