Recently, I was reading anthropic's paper on NLA's[1] and for a person who works on steering, it was an interesting and thought-provoking paper. In this post I would like to go through my reproduction and some of the experiments I did on them.
Training and Architecture
I'm going to touch little on architecture here because the paper already covers them, I add it here so that it could make little sense or give a refresh while reading. So, we basically train 2 models,
Activation Verbalizer (AV): Injects a raw activation vector into the embedding stream at a reserved placeholder token then generates a free-text description for the injected activation vector; it must also produce descriptions specific enough that the AR can invert them - vague descriptions get penalized while reconstructing.
Activation Reconstructor (AR): It reads the AV's description, reconstructs the activation.
On the training part, before that though, I streamed 100k FineWeb documents, extracting layer-20 residual stream activations at 10 random positions per doc. For gold descriptions, I used Kit Fraser-Taliente's checkpoints[2] as a labeler. I initially wanted to reproduce this work he did it with Qwen and see if I can do it, initially I had cost concerns, but I still proceeded too.
So, training consisted of 3 stages,
AR SFT: I trained a truncated version of Qwen2.5-7B from layers 0-20 only with the linear head projecting to 3584 dims using MSE loss against true activations. The truncation mattered in cost cutting because the AR only needs to up to layer 20 activations, running all layers is a growing cost and time.
AV SFT: Standard SFT of the full model where model learns to inject vector and produce a description in the expected format.
RL with GRPO:In this part AV generates 4 candidate descriptions per activation, the AR scores each one (reward= −MSE between AR reconstruction and true activation), GRPO updates the AV toward higher reward descriptions.
notes: A few things really helped me here, first, vLLM colocate mode, generation and backward passes share the same GPU process, no seperate inference server and no IPC overhead. Second, handling AR during RL: if we fully freeze it the reward goes stale as the AV drifts and the AV learns to exploit the fixed scorer, keep it live in the AR adapts to whatever AV currently produces and the reward collapses to noise therefore retraining AR every 10 GRPO steps helps us staay calibrated without overfitting.
Generalization Experiment
Initially, I was little happy that I could reproduce it, play with and see the planning for future tokens as in the paper was pretty awesome. One of the drawbacks mentioned in the paper was that it's expensive and we can't train for each layer, So I ran the layer-20 AV on the activations from all 28 layers without any retraining across 2000 texts from various sources (FineWeb, Wikipedia, PubMed, GitHub, Reddit).
The main result is in the figure above, but before reading them, it helps us to understand what we are comparing against. There are 2 baselines, the first is a random gaussian vector: we take a random noise vector of the same dimensionality and ask the AR to reconstruct the activation starting from the description of the noise. This tells us the base, what we score when the description has zero semantic connection to activation. the second is the shuffled baseline which is little subtler where we take real activations but break the correspondence between text and activation so the AR gets a descrption generated from text A but tries to reconstruct from text B. This tells you that whether AV is actually encoding something specific to the individual text or just capture general statistical properties of the layer that any activation from that layer would satisfy.
The green line is the AV, the fact it sits well above both baselines from Layer 10 through 25 means two things simultaneously: the descriptions are semantically grounded, and they are specific to the individual sample not just the layer distribution. The gap between the AV and the shuffled baseline is actually the more meaningful one. Any model that learns the rough geometry of a layer's activations would beat the random baseline. Beating the shuffled baseline means the description is carrying information about this particular text, not just this layer.
We measure this across four metrics. Cosine similarity between the reconstructed and true activation. Recall@10, which asks whether the reconstruction ranks in the top 10 out of 500 randomly sampled activations from the same layer. FVE, fraction of variance explained, where negative values mean the reconstruction is actively worse than just predicting the mean. And nRMSE, the reconstruction error between unit-normalized vectors. L20 peaks across all four as expected. From L10 to L25 cosine similarity stays above 0.50, Recall@10 above 0.40, FVE positive, and nRMSE well below the baselines. One trained model covering 15 layers with little accuracy tradeoffs.
To get a more intuitive sense of what's happening, we plotted ground truth activations against the AR reconstructions in PCA space across few layers. Each panel plots ground truth against reconstructions. At Layer 5 the clusters are loose, by Layer 10 they start converging, at Layer 20 they are overlapping each other meaninf AV's descrptions are specific enought that AR can recover not just right region of activation space but right individual sample. One thing to notice is that PCA variance dropped from 0.45 at L5 to 0.09 at L20 this means middle layer activations are much more distributed among dimensions and the variance isn't concentrated in few directions, while AV still tracking accurately despite the low variance.
the cross-layer consistency (CLC) matrix shows pairwise cosine similarity between AR reconstructions at different layers for the same texts. The diagonal is high, middle layers have moderate clc, meaning AV's description of L19 can partially reconstruct at L21 activation for the same text, adjacent layers share enough semantic structure that descriptions transfer.
Early layers cluster tightly around the diagonal with almost no off-diagonal signal, they are geometrically isolated from the middle network. L27 is the most telling: its entire row and column are uniformly light except the diagonal. This confirms that L27 isn't just far from L20 on some continuous manifold, it lives in a categorically different part of representational space that shares nothing with the layers the AV was trained to operate in.
One thing worth checking whether the decay just reflects norms growth from L1 to L20, so rescaled all activations to L20's median norm before feeding them to AV. The raw and normalized curves are nearly identical across all layers, the decay is geomentric incompatability not a scale artifact. I also ran Wilcoxon signed-rank tests with Benjamini-Hochberg correction across all 28 layers against both baselines. Every layer is significant against the random baseline, and L0, L1, and L27 drop out against the shuffled baseline, which maps exactly onto the two failure modes above. Practically this means you don't need to train a verbalizer per layer; a few anchor verbalizers placed across the network gets you coverage at lower training cost.
Conclusion
In a followup post I'll cover on using AV to audit activation steering, where things are little interesting and messier. Also, please refer to my GitHub repo[3] and hugging face checkpoint[4] to play with it.
Recently, I was reading anthropic's paper on NLA's[1] and for a person who works on steering, it was an interesting and thought-provoking paper. In this post I would like to go through my reproduction and some of the experiments I did on them.
Training and Architecture
I'm going to touch little on architecture here because the paper already covers them, I add it here so that it could make little sense or give a refresh while reading. So, we basically train 2 models,
On the training part, before that though, I streamed 100k FineWeb documents, extracting layer-20 residual stream activations at 10 random positions per doc. For gold descriptions, I used Kit Fraser-Taliente's checkpoints[2] as a labeler. I initially wanted to reproduce this work he did it with Qwen and see if I can do it, initially I had cost concerns, but I still proceeded too.
So, training consisted of 3 stages,
notes: A few things really helped me here, first, vLLM colocate mode, generation and backward passes share the same GPU process, no seperate inference server and no IPC overhead. Second, handling AR during RL: if we fully freeze it the reward goes stale as the AV drifts and the AV learns to exploit the fixed scorer, keep it live in the AR adapts to whatever AV currently produces and the reward collapses to noise therefore retraining AR every 10 GRPO steps helps us staay calibrated without overfitting.
Generalization Experiment
Initially, I was little happy that I could reproduce it, play with and see the planning for future tokens as in the paper was pretty awesome. One of the drawbacks mentioned in the paper was that it's expensive and we can't train for each layer, So I ran the layer-20 AV on the activations from all 28 layers without any retraining across 2000 texts from various sources (FineWeb, Wikipedia, PubMed, GitHub, Reddit).
The main result is in the figure above, but before reading them, it helps us to understand what we are comparing against. There are 2 baselines, the first is a random gaussian vector: we take a random noise vector of the same dimensionality and ask the AR to reconstruct the activation starting from the description of the noise. This tells us the base, what we score when the description has zero semantic connection to activation. the second is the shuffled baseline which is little subtler where we take real activations but break the correspondence between text and activation so the AR gets a descrption generated from text A but tries to reconstruct from text B. This tells you that whether AV is actually encoding something specific to the individual text or just capture general statistical properties of the layer that any activation from that layer would satisfy.
The green line is the AV, the fact it sits well above both baselines from Layer 10 through 25 means two things simultaneously: the descriptions are semantically grounded, and they are specific to the individual sample not just the layer distribution. The gap between the AV and the shuffled baseline is actually the more meaningful one. Any model that learns the rough geometry of a layer's activations would beat the random baseline. Beating the shuffled baseline means the description is carrying information about this particular text, not just this layer.
We measure this across four metrics. Cosine similarity between the reconstructed and true activation. Recall@10, which asks whether the reconstruction ranks in the top 10 out of 500 randomly sampled activations from the same layer. FVE, fraction of variance explained, where negative values mean the reconstruction is actively worse than just predicting the mean. And nRMSE, the reconstruction error between unit-normalized vectors. L20 peaks across all four as expected. From L10 to L25 cosine similarity stays above 0.50, Recall@10 above 0.40, FVE positive, and nRMSE well below the baselines. One trained model covering 15 layers with little accuracy tradeoffs.
To get a more intuitive sense of what's happening, we plotted ground truth activations against the AR reconstructions in PCA space across few layers. Each panel plots ground truth against reconstructions. At Layer 5 the clusters are loose, by Layer 10 they start converging, at Layer 20 they are overlapping each other meaninf AV's descrptions are specific enought that AR can recover not just right region of activation space but right individual sample. One thing to notice is that PCA variance dropped from 0.45 at L5 to 0.09 at L20 this means middle layer activations are much more distributed among dimensions and the variance isn't concentrated in few directions, while AV still tracking accurately despite the low variance.
the cross-layer consistency (CLC) matrix shows pairwise cosine similarity between AR reconstructions at different layers for the same texts. The diagonal is high, middle layers have moderate clc, meaning AV's description of L19 can partially reconstruct at L21 activation for the same text, adjacent layers share enough semantic structure that descriptions transfer.
Early layers cluster tightly around the diagonal with almost no off-diagonal signal, they are geometrically isolated from the middle network. L27 is the most telling: its entire row and column are uniformly light except the diagonal. This confirms that L27 isn't just far from L20 on some continuous manifold, it lives in a categorically different part of representational space that shares nothing with the layers the AV was trained to operate in.
One thing worth checking whether the decay just reflects norms growth from L1 to L20, so rescaled all activations to L20's median norm before feeding them to AV. The raw and normalized curves are nearly identical across all layers, the decay is geomentric incompatability not a scale artifact. I also ran Wilcoxon signed-rank tests with Benjamini-Hochberg correction across all 28 layers against both baselines. Every layer is significant against the random baseline, and L0, L1, and L27 drop out against the shuffled baseline, which maps exactly onto the two failure modes above. Practically this means you don't need to train a verbalizer per layer; a few anchor verbalizers placed across the network gets you coverage at lower training cost.
Conclusion
In a followup post I'll cover on using AV to audit activation steering, where things are little interesting and messier. Also, please refer to my GitHub repo[3] and hugging face checkpoint[4] to play with it.
Fraser-Taliente, K. et al. "Natural Language Autoencoders Produce
Unsupervised Explanations of LLM Activations." Transformer Circuits
Thread, 2026. https://transformer-circuits.pub/2026/nla/index.html
kitft/nla-qwen2.5-7b-L20-av. HuggingFace, 2026 and GitHub Repository
https://huggingface.co/kitft/nla-qwen2.5-7b-L20-av , https://github.com/kitft/natural_language_autoencoders
GitHub Repository. https://github.com/kameshkanna/nla-train
Hugging face Checkpoint. https://huggingface.co/Kameshr/nla-qwen2.5-7b-L20-av