The work discussed in this blog post was inspired by Progress Measures for Grokking via Mechanistic Interpretability1. I would encourage you to read this paper in its entirety and check out the list of resources at the end of the post for some more in-depth background information.
TL;DR
If you are familiar with the concepts discussed in the above research paper, you may start reading from The point of this blog post: The second thing! The research paper that inspired this post is super fun to replicate and is a great introduction to mechanistic interpretability research. As it turns out, Grokking is quite robust and is easily observed for a shallow transformer trained on both a singular as well as a dual task problem. Training a model on two tasks at once provides some interesting new patterns of the internals that may contribute to new insights into circuit formations of neural networks. This research can be easily extended by including different problems and performing a more rigorous mechanistic interpretability approach.
Grokking you say, what is that?
Grokking is an interesting concept and was completely serendipitously discovered2. It’s actually an example of so-called emergent behaviors of neural networks3, 4, 5. The emergence part refers to something sudden, unexpected, and just completely wild. Now for the more technically inclined, Grokking is the transition phenomenon of an initially poorly performing neural network to that same network acquiring a generalization solution after several thousand epochs using the same training data. Like I said, wild.
Can we go even deeper?
The research paper provides an excellent intuitive explanation of Grokking. The authors focused on a one-layer transformer and studied a modular addition task: a + b mod P, where P is a prime modulus, and the input to the model was “a b =” (3 tokens).
Figure 1: Average train and test loss of a one-layer transformer trained on a modular addition task (average of three different model seeds). The vertical lines delineate three phases: memorization (I), circuit formation (II), and cleanup (III).
During model training, the authors were able to identify three distinct phases: memorization, circuit formation, and cleanup. The first phase appears rather quickly and is marked by excellent training loss and abysmal test loss. The network has essentially memorized the training data and is basically cheating. The next phase is the most interesting one; this is where the network starts to form circuits! An early attempt at a generalization solution is formed and the network starts to lower its test loss. The last phase is what Grokking is all about. This phase is known as cleanup as the memorization solution is removed in favor of the now fully established generalization solution. You did it network! The authors were able to discover how the network learned the generalization solution. They made use of discrete Fourier transformations and several trigonometric identities (reading the paper would be helpful here). By making use of ablations in Fourier space, they also discovered several key progress measures to help identify the Grokking behavior.
Let’s stare at some figures!
I wanted to explore the Grokking behavior myself for this research project. The first thing that came to mind was: why don’t I just do exactly what they did? (There is a second thing that came to mind, but why don’t I save that for now to increase the suspense, that is if you didn’t read the TL;DR, and ignored the title I guess)
Let’s see if I can replicate and (kinda) understand
I started my research by training a one-layer transformer on a modular addition task and repeated this exercise with two more model seeds to increase the statistical significance. This experiment turned out to be successful as Grokking behavior was observed for all model seeds!
Figure 2: Average train and test accuracy (left) and average train and test loss (right) of mainline experiment as described in the paper, over 3 random seeds. Grokking behavior is quite evident from the initial training overfitting and gradual test generalization after ~ 10,000 epochs.
Both figures clearly indicate an initial excellent performance on the training data, reaching perfect accuracy after only ~ 200 epochs. At this stage, the test accuracy is very poor (~ 2%). After re-training for many epochs, the test accuracy gradually increases and meets training accuracy after roughly 12,000 epochs.
Next stop: model internals! The basic stuff to look at are the attention scores and neuron activations. Now focusing on the final model state of a single training run, these internals revealed a very periodic structure. Puzzling, right?
Figure 3: (Left) Attention score for head 0 from the token “a” to the “=” token as a function of inputs a, b. (Right) Activations of MLP neuron 1 as a function of inputs a, b. Both attention scores and neuron activations appear periodic.
To understand the periodicity observed in the model internals, the authors of the paper use Discrete Fourier Transformations (DFT). Without getting into too much detail (mainly because I lack understanding), the Fourier space was built with a frequency sweep over half of the prime modulus P using specific cosine and sine waves. This particular space was mapped to both the embedding matrix and the neuron activations of the model (see Appendix A), which resulted in strong correlations with only three key frequencies (integer k multiples of 2𝜋/P). This tells us that the generalization solution is quitesparse.
To complete the full analysis, progress measures were constructed. The first progress measure depicts the coefficients of the cosine similarity in the logits which gradually increase during training. More importantly, the model already starts learning the beginnings of the generalization solution during the memorization phase!
Figure 4: (Left) Coefficients of cos(wk(a+b-c) in the logits during training. Coefficients gradually increase as training progresses. (Right) Train loss, test loss, excluded loss and restricted loss during training. Excluded loss increases during circuit formation, whilst train and test loss remain somewhat flat. Restricted loss begins declining before test loss and shows a clear inflection point before the occurrence of Grokking. The vertical lines delineate three phases of training: memorization, circuit formation, and cleanup.
The final two progress measures describe the importance of the key frequencies relating to the generalization solution: excluded and restricted loss. The first new loss is constructed by only removing the key frequencies from the model internals during training, the latter by only keeping the key frequencies and removing everything else. As expected, when the key frequencies are removed, the model never generalizes and Grokking is absent. By only keeping the key frequencies, the model actually reaches the generalization solution more quickly and achieves lower final test loss.
The point of this blog post: The second thing!
Now for the moment you’ve been waiting for: the second thing that came to mind! My goal here was to extend the research that was presented in the paper and potentially offer new insights. So in order to do that, I decided to study the Grokking behavior of the same network trained on two different algorithmic tasks at once. This basically meant I needed to modify the dataset and some model parameters. The new dataset consisted of both modular addition and modular subtraction. The input tokens changed from “a b =” to either “a + b =” or “a - b =” (four tokens, now including an operator token). Important sidenote! I decided to concatenate the modular addition and modular subtraction datasets (first batch is addition, second batch is subtraction). In order to work with roughly the same amount of data points as before, I lowered the prime modulus from 113 to 79 (reducing dataset size from 113*113 to 2*79*79). This also meant that for each algorithmic task, the input data (a, b) was exactly the same. I also made sure that both algorithms would be roughly equally represented in the training and test datasets (49.6% addition and 50.4% subtraction).
Without changing any other hyperparameters, the new model also Grokked over three different training runs! My god, these transformers are truly freaks!
Figure 5: Average train and test accuracy (left) and average train and test loss (right) of the mainline model trained on two tasks at once, over 3 random seeds. Grokking behavior is observed for the dual task, although at a lesser extent compared to the single task.
We clearly observe a similar trend in which training accuracy quickly climbs to perfection (~ 300 epochs) and test accuracy gradually meets training accuracy after roughly 18,000 epochs (> 99%). Depending on the model seed, test loss either almost meets training loss or only somewhat drops during training.
Alright, so we have another Grokking aficionado. What about the internals you may ask? This is where things become a bit complicated. Basically, we now have to start looking at things in batches, by splitting the cached attention scores and neuron activations in half (so we can observe patterns in a P * P space again).
Figure 6: (Left) Attention score for head 0 from the token “a” to the “=” token as a function of inputs a, b of the first batch of data. (Right) Activations of MLP neuron 6 as a function of inputs a, b of the first batch of data. Both attention scores and neuron activations appear periodic and show similar structures between batches of data.
Clearly, we have interesting patterns once more! The attention scores appear somewhat similar as before, but the neuron activations are now potentially hinting at some sort of superimposed pattern along the a-axis. Both batches of data produce similar patterns.
Figure 7: (Left) Key frequencies of the Fourier basis of the neuron-logit map for the single task model. (Right) Key frequencies of the Fourier basis of the neuron-logit map for the dual task model. Evolution of key frequencies appears to introduce an asymmetry for the dual task.
The new model also learnt several key frequencies and comparing them to the original model reveals some interesting differences. First of all, there seems to be a greater variation in signal strength amongst key frequencies for the dual task model. Additionally, a clear asymmetry is now introduced between the cosine and sine Fourier components.
To finish up the results section, we are going to review the progress measures again. Unfortunately, due to the nature of the new dataset and the hacky methods I was using, the excluded and restricted losses did not make much sense. On the bright side, the coefficients of the cosine similarity in the logits were easier to construct. The analysis was done on batched data where the addition data was analyzed with the original method (cos(wk(a+b-c))) and the subtraction data was analyzed with a slight modification (cos(wk(a-b-c))).
Figure 8: (Solid) Coefficients of cos(wk(a+b-c) in the logits during training. (Dotted) Coefficients of cos(wk(a-b-c) in the logits during training. A very strong coefficient for key frequency 28 is observed.
There is a very strong signal for the key frequency 28, which also complements the frequency component of the neuron-logit map in the previous graph. The difference in coefficient strength between tasks indicates that the model mainly used key frequency 28 for modular addition. Modular subtraction mainly used key frequency 8 and the coefficients were all rather low. This may suggest that both tasks rely on different activation pathways.
Let’s discuss these findings!
The first thing I want to mention is that the replication of the Progress Measures for Grokking via Mechanistic Interpretability paper is super fun and quite accessible. I would highly recommend this paper for anyone who is interested in mechanistic interpretability. Grokking is a super interesting phenomenon and lends itself well to being studied through mechanistic interpretability methods and somewhat straightforward toy settings. Evidently, the behavior is quite robust and even works under more complicated scenarios such as multitask problem solving. Reviewing the internals of the dual task model reveals intricate patterns that may suggest some superposition of the generalization solutions. Progress measures further provide insights into the learning process of the model during training, even for multitask scenarios. This work shows a promising route towards understanding the complex inner workings of neural networks when faced with multiple tasks at once. By continuing the research, potential new findings can be discovered on how circuit formations work inside a transformer. Armed with new knowledge, bigger questions could be answered such as: why and when do models learn generalization solutions?
What does the future hold?
I want to use this space to basically brain dump some ideas for future research. The approach outlined in this post could benefit from further analysis to help answer some of the questions that are naturally raised, such as: what is happening? The first step would be to analyze modular subtraction independently, to make sure all the assumptions in this work are warranted. In addition, more pattern analysis could be done on several model checkpoints of the dual task model to understand the evolution of the generalization solution during training. To offer some more fun experiments, two totally different algorithmic tasks can be studied such as modular addition and a polynomial task. Another idea would be to study a learnable and completely random task. Yet another idea is to increase the amount of learnable tasks beyond two to see how well a model could generalize. Key to all these ideas is to use mechanistic interpretability to figure out the internal state of the model and discover new principles.
References
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress Measures for Grokking via Mechanistic Interpretability. arXiv:2301.05217v3, 2023.
Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. arXiv:2201.02177v1, 2022.
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martin Soto, Nathan Labenz, and Owain Evans. Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv:2502.17424v7, 2026.
Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, and Neel Nanda. Model Organisms for Emergent Misalignment. arXiv:2506.11613v1, 2025.
Resources
The following resources were instrumental in replicating the research and get an intuitive understanding of the Grokking behavior (especially recommend the Welch Labs video for a nice visual explanation).
I would like to thank the BlueDot Impact team for offering a course to work on individual projects and receive structured feedback and guidance. Check out the project course here: BlueDot Impact Technical AI Safety Project Sprint. During the course, I had tons of conversations with my mentor Eitan Sprejer, who offered lots of great constructive criticism and helped me write this blog post. I would also like to thank Edward Cant for some fantastic discussions and for offering the original idea of the dual task problem. I worked on this project in Google Colab using Neel Nanda’s instruction videos and code as a guideline and using Copilot Smart mode (GPT-5.1) to help with troubleshooting.
Appendix A: Embedding matrix and neurons in Fourier basis
The Fourier basis was matrix-multiplied with the embedding matrix and the neuron activations, which produced the plots below. These figures indicate the presence of key frequencies and highlight the sparsity of the generalization solution.
Figure A1: (Left) Embedding matrix in Fourier space. Fourier components are plotted against the residual stream dimension. (Right) Neuron variance explained by key frequencies in Fourier space. Both the embedding matrix and neuron activations are strongly correlated to three key frequencies k (18, 20, 32).
Appendix B: Control experiment showing the importance of the operator token
One of the first control experiments I was interested in was: How important is the operator token? Well, important as it turns out. When leaving out this token (basically going back to the original model, but now using a dataset with more mysterious input tokens to label correlations), the model never learns to even memorize the training data well and only achieves an accuracy of ~ 86%. Between modular addition and subtraction there are several occasions where the a and b input tokens produce different outputs and without more context, the model does not learn the pattern.
Figure B1: (Solid) Including the operator token and (dashed) excluding the operator token during training of the mainline model on two tasks at once. Poor model performance during training is observed in the absence of this token and Grokking never occurs.
Appendix C: Singular value decomposition of the embedding matrix
To offer even more interesting data, a Singular Value Decomposition (SVD) of the embedding matrix is used to display the first principal component of the unitary matrix as a function of the embedding input.
Figure C1: (Left) First principal component of the unitary matrix of the SVD of the embedding matrix of the mainline model trained on a single task. (Right) First principal component of the unitary matrix of the SVD of the embedding matrix of the mainline model trained on two tasks. A very distinct signature in periodicity can be observed between both experiments.
Strikingly, the dual task seems to indicate the presence of several periodicities at once. This could hint at some sort of superposition of learned embeddings of the two different tasks.
The work discussed in this blog post was inspired by Progress Measures for Grokking via Mechanistic Interpretability1. I would encourage you to read this paper in its entirety and check out the list of resources at the end of the post for some more in-depth background information.
TL;DR
If you are familiar with the concepts discussed in the above research paper, you may start reading from The point of this blog post: The second thing!
The research paper that inspired this post is super fun to replicate and is a great introduction to mechanistic interpretability research. As it turns out, Grokking is quite robust and is easily observed for a shallow transformer trained on both a singular as well as a dual task problem. Training a model on two tasks at once provides some interesting new patterns of the internals that may contribute to new insights into circuit formations of neural networks. This research can be easily extended by including different problems and performing a more rigorous mechanistic interpretability approach.
Grokking you say, what is that?
Grokking is an interesting concept and was completely serendipitously discovered2. It’s actually an example of so-called emergent behaviors of neural networks3, 4, 5. The emergence part refers to something sudden, unexpected, and just completely wild.
Now for the more technically inclined, Grokking is the transition phenomenon of an initially poorly performing neural network to that same network acquiring a generalization solution after several thousand epochs using the same training data. Like I said, wild.
Can we go even deeper?
The research paper provides an excellent intuitive explanation of Grokking. The authors focused on a one-layer transformer and studied a modular addition task: a + b mod P, where P is a prime modulus, and the input to the model was “a b =” (3 tokens).
During model training, the authors were able to identify three distinct phases: memorization, circuit formation, and cleanup.
The first phase appears rather quickly and is marked by excellent training loss and abysmal test loss. The network has essentially memorized the training data and is basically cheating.
The next phase is the most interesting one; this is where the network starts to form circuits! An early attempt at a generalization solution is formed and the network starts to lower its test loss.
The last phase is what Grokking is all about. This phase is known as cleanup as the memorization solution is removed in favor of the now fully established generalization solution. You did it network!
The authors were able to discover how the network learned the generalization solution. They made use of discrete Fourier transformations and several trigonometric identities (reading the paper would be helpful here). By making use of ablations in Fourier space, they also discovered several key progress measures to help identify the Grokking behavior.
Let’s stare at some figures!
I wanted to explore the Grokking behavior myself for this research project. The first thing that came to mind was: why don’t I just do exactly what they did? (There is a second thing that came to mind, but why don’t I save that for now to increase the suspense, that is if you didn’t read the TL;DR, and ignored the title I guess)
Let’s see if I can replicate and (kinda) understand
I started my research by training a one-layer transformer on a modular addition task and repeated this exercise with two more model seeds to increase the statistical significance. This experiment turned out to be successful as Grokking behavior was observed for all model seeds!
Both figures clearly indicate an initial excellent performance on the training data, reaching perfect accuracy after only ~ 200 epochs. At this stage, the test accuracy is very poor (~ 2%). After re-training for many epochs, the test accuracy gradually increases and meets training accuracy after roughly 12,000 epochs.
Next stop: model internals! The basic stuff to look at are the attention scores and neuron activations. Now focusing on the final model state of a single training run, these internals revealed a very periodic structure. Puzzling, right?
To understand the periodicity observed in the model internals, the authors of the paper use Discrete Fourier Transformations (DFT). Without getting into too much detail (mainly because I lack understanding), the Fourier space was built with a frequency sweep over half of the prime modulus P using specific cosine and sine waves.
This particular space was mapped to both the embedding matrix and the neuron activations of the model (see Appendix A), which resulted in strong correlations with only three key frequencies (integer k multiples of 2𝜋/P). This tells us that the generalization solution is quite sparse.
To complete the full analysis, progress measures were constructed. The first progress measure depicts the coefficients of the cosine similarity in the logits which gradually increase during training. More importantly, the model already starts learning the beginnings of the generalization solution during the memorization phase!
The final two progress measures describe the importance of the key frequencies relating to the generalization solution: excluded and restricted loss. The first new loss is constructed by only removing the key frequencies from the model internals during training, the latter by only keeping the key frequencies and removing everything else. As expected, when the key frequencies are removed, the model never generalizes and Grokking is absent. By only keeping the key frequencies, the model actually reaches the generalization solution more quickly and achieves lower final test loss.
The point of this blog post: The second thing!
Now for the moment you’ve been waiting for: the second thing that came to mind! My goal here was to extend the research that was presented in the paper and potentially offer new insights.
So in order to do that, I decided to study the Grokking behavior of the same network trained on two different algorithmic tasks at once. This basically meant I needed to modify the dataset and some model parameters.
The new dataset consisted of both modular addition and modular subtraction. The input tokens changed from “a b =” to either “a + b =” or “a - b =” (four tokens, now including an operator token).
Important sidenote! I decided to concatenate the modular addition and modular subtraction datasets (first batch is addition, second batch is subtraction). In order to work with roughly the same amount of data points as before, I lowered the prime modulus from 113 to 79 (reducing dataset size from 113*113 to 2*79*79). This also meant that for each algorithmic task, the input data (a, b) was exactly the same. I also made sure that both algorithms would be roughly equally represented in the training and test datasets (49.6% addition and 50.4% subtraction).
Without changing any other hyperparameters, the new model also Grokked over three different training runs! My god, these transformers are truly freaks!
We clearly observe a similar trend in which training accuracy quickly climbs to perfection (~ 300 epochs) and test accuracy gradually meets training accuracy after roughly 18,000 epochs (> 99%). Depending on the model seed, test loss either almost meets training loss or only somewhat drops during training.
Alright, so we have another Grokking aficionado. What about the internals you may ask? This is where things become a bit complicated. Basically, we now have to start looking at things in batches, by splitting the cached attention scores and neuron activations in half (so we can observe patterns in a P * P space again).
Clearly, we have interesting patterns once more! The attention scores appear somewhat similar as before, but the neuron activations are now potentially hinting at some sort of superimposed pattern along the a-axis. Both batches of data produce similar patterns.
The new model also learnt several key frequencies and comparing them to the original model reveals some interesting differences. First of all, there seems to be a greater variation in signal strength amongst key frequencies for the dual task model. Additionally, a clear asymmetry is now introduced between the cosine and sine Fourier components.
To finish up the results section, we are going to review the progress measures again. Unfortunately, due to the nature of the new dataset and the hacky methods I was using, the excluded and restricted losses did not make much sense.
On the bright side, the coefficients of the cosine similarity in the logits were easier to construct. The analysis was done on batched data where the addition data was analyzed with the original method (cos(wk(a+b-c))) and the subtraction data was analyzed with a slight modification (cos(wk(a-b-c))).
There is a very strong signal for the key frequency 28, which also complements the frequency component of the neuron-logit map in the previous graph. The difference in coefficient strength between tasks indicates that the model mainly used key frequency 28 for modular addition. Modular subtraction mainly used key frequency 8 and the coefficients were all rather low. This may suggest that both tasks rely on different activation pathways.
Let’s discuss these findings!
The first thing I want to mention is that the replication of the Progress Measures for Grokking via Mechanistic Interpretability paper is super fun and quite accessible. I would highly recommend this paper for anyone who is interested in mechanistic interpretability.
Grokking is a super interesting phenomenon and lends itself well to being studied through mechanistic interpretability methods and somewhat straightforward toy settings. Evidently, the behavior is quite robust and even works under more complicated scenarios such as multitask problem solving.
Reviewing the internals of the dual task model reveals intricate patterns that may suggest some superposition of the generalization solutions. Progress measures further provide insights into the learning process of the model during training, even for multitask scenarios.
This work shows a promising route towards understanding the complex inner workings of neural networks when faced with multiple tasks at once. By continuing the research, potential new findings can be discovered on how circuit formations work inside a transformer. Armed with new knowledge, bigger questions could be answered such as: why and when do models learn generalization solutions?
What does the future hold?
I want to use this space to basically brain dump some ideas for future research. The approach outlined in this post could benefit from further analysis to help answer some of the questions that are naturally raised, such as: what is happening?
The first step would be to analyze modular subtraction independently, to make sure all the assumptions in this work are warranted. In addition, more pattern analysis could be done on several model checkpoints of the dual task model to understand the evolution of the generalization solution during training.
To offer some more fun experiments, two totally different algorithmic tasks can be studied such as modular addition and a polynomial task. Another idea would be to study a learnable and completely random task. Yet another idea is to increase the amount of learnable tasks beyond two to see how well a model could generalize.
Key to all these ideas is to use mechanistic interpretability to figure out the internal state of the model and discover new principles.
References
Resources
The following resources were instrumental in replicating the research and get an intuitive understanding of the Grokking behavior (especially recommend the Welch Labs video for a nice visual explanation).
Acknowledgements
I would like to thank the BlueDot Impact team for offering a course to work on individual projects and receive structured feedback and guidance. Check out the project course here: BlueDot Impact Technical AI Safety Project Sprint.
During the course, I had tons of conversations with my mentor Eitan Sprejer, who offered lots of great constructive criticism and helped me write this blog post. I would also like to thank Edward Cant for some fantastic discussions and for offering the original idea of the dual task problem.
I worked on this project in Google Colab using Neel Nanda’s instruction videos and code as a guideline and using Copilot Smart mode (GPT-5.1) to help with troubleshooting.
Appendix A: Embedding matrix and neurons in Fourier basis
The Fourier basis was matrix-multiplied with the embedding matrix and the neuron activations, which produced the plots below. These figures indicate the presence of key frequencies and highlight the sparsity of the generalization solution.
Appendix B: Control experiment showing the importance of the operator token
One of the first control experiments I was interested in was: How important is the operator token? Well, important as it turns out. When leaving out this token (basically going back to the original model, but now using a dataset with more mysterious input tokens to label correlations), the model never learns to even memorize the training data well and only achieves an accuracy of ~ 86%. Between modular addition and subtraction there are several occasions where the a and b input tokens produce different outputs and without more context, the model does not learn the pattern.
Appendix C: Singular value decomposition of the embedding matrix
To offer even more interesting data, a Singular Value Decomposition (SVD) of the embedding matrix is used to display the first principal component of the unitary matrix as a function of the embedding input.
Strikingly, the dual task seems to indicate the presence of several periodicities at once. This could hint at some sort of superposition of learned embeddings of the two different tasks.