The following post seeks to look further into why NLA (Natural Language Autoencoders) contains the prediction more often when the original activations led to the correct output than incorrect output.
Quick Summary:
Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final answer
First sentence is the most counterfactually important for both activation reconstruction loss and the AV containing the final output
Sentences counterfactually important for generating the final answer correlate with lower reconstruction loss, suggesting the AR training reward encourages the model to include correct answers
Degenerate NLA outputs (repetition, garbled tokens, emoji blocks) appear only for activations from incorrect model responses.
NLA response length varies more for incorrect activations, possibly reflecting model uncertainty
Incorrect activations reconstruct ~30% worse than correct ones
Key Findings:
Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final answer
Surprisingly when looking at activations that led to the incorrect answer the NLA sometimes had outputs that led to broken or degenerate responses examples includes repetition, garbled tokens, emoji blocks, etc. This only appears in NLA for activations that led to incorrect responses along with the fact NLA response length varies more for incorrect activations, possibly reflecting model uncertainty.
The final answer contributes more to the NLA's reconstruction loss when the activations led to the correct output, and less when they did not.
NLA seems to have higher reconstruction loss when the activations lead to the wrong answer on the GSM8K dataset
The first sentence seems to be the most counterfactually important for NLA AV responses both for reconstruction loss and the response containing the final answer (contain actual answer vs model response). The counterfactual importance was more evenly spread across sentences for base activations leading to an incorrect answer.
I started with my original NLA script and looked at the rates of NLA containing the final answers. The rates were clearly too low; then I noticed I looked at the last prompt token instead of the token after generation; This led me to the idea that final answer appearance in the NLA corresponds to token position.
Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final answer. For the answer and hash tokens specifically, correct activations led to the final answer appearing in the NLA output at a significantly higher rate.
The resulting difference between the border token and answer token becomes more apparent after doing a few samples or rollouts
These results support Ryan Greenblatt's findings that “NLA output contains what the AI will predict at a rate much higher than chance for both incorrect and correct problems”
Model Correctness Impact on NLA outputs:
NLAs output more consistent AV response lengths if the original outputs led to a correct response. These findings imply that NLA response length varies more significantly for incorrect activations, potentially reflecting increased model uncertainty.
The graph shows counterfactual importance from NLAs per sentence for the counterfactual impact for the sentence to generate actual answer (gold) or matching the model's answer (pred).
Correct examples are cases where the original base model activation led to the correct response. Incorrect examples are cases where it did not.
The first sentence is the most counterfactually important to generate the GSM8K or gold answer
For the incorrect model response activation group shows that the most counterfactually important sentence for the model was the last sentence.
The models seem to have the counterfactual importance for generating the correct answer to be more balanced when the activations led to the correct response.
For correct model response group or correct examples contain gold and pred should be the same (as correct = gold)
Incorrect Model Responses have broken AV explanations:
An interesting finding is that only when model outputs incorrect answer the AV sometimes generates broken behavior such as repetition, wikipedia, forum, etc
See Appendix for the examples of these categories
Thought Anchors on NLAs:
A question after looking at counterfactual importance for containing the answer or predicted answer is how does it correlate with reconstruction loss.
I looked at counterfactual importance to contain the final answer by quartile and found that as the sentence was more counterfactually important the lower the reconstruction loss for the AR. This suggests that NLA reconstruction AR seems to encourage including the final answer in the AV.
NLA responses from activations of where the model outputted the correct response have on average a lower reconstruction loss. NLA struggles to output more for incorrect responses.
NLAs and Reconstruction Loss:
Final Answers and Reconstruction Loss:
For some reason ablating all instances of the final answer only has a a larger impact in when the model outputted the correct answer than incorrect for the border token of ####. However, it did not occur in the Answer Digit Token.
For the Answer token ablating the answer from AV seems to have a constant effect regardless of if the original activations led to a correct or incorrect output
The higher impact of containing the answer on the reconstruction loss seems to indicate that the border token should be if you want to include the final answer in AV. However, if you do not know if the output is correct or incorrect Answer token is better due to the more consistent impact on AR reconstruction.
Per Sentence Reconstruction Loss:
Change in reconstruction loss per sentence or the range varies more greatly when the model originally generated an incorrect response.
The change from correct to incorrect example is more noticeable on answer token over the border token
Border token or ####
Answer token reconstruction loss by sentence
Takeaways Limitations:
The NLA was for Qwen2.5-7B-Instruct and the smaller model might have highlighted issues that might not occur in bigger models. Will the incoherent AV responses on incorrect model responses happen on bigger NLAs?
Future Work:
Attempt different cos sim thresholds for similar maybe this is a threshold issue
Investigate further the impact of the models response being correct or incorrect on NLA
Investigate with bigger models (Apply for bluedot rapid grant)
Clustering NLA sentences and labeling them
Appendix:
Degenerate Examples (#prompts being broken are more on the display end trying to make it images)
The following post seeks to look further into why NLA (Natural Language Autoencoders) contains the prediction more often when the original activations led to the correct output than incorrect output.
Quick Summary:
Key Findings:
Experimental setup:
Code: https://github.com/Realmbird/nla-thought-anchors
Huggingface datasets I created: https://huggingface.co/collections/Realmbird/nla-thought-anchors
I created a pipeline with the following steps (for further details, see the README):
Step 1 (Generates with Base model)
Step 2 (Generate first NLA explanations with AV )
Step 3 (Generate rollouts and calculate rollouts) (Takes the most time; arguments I used is a cos_sim threshold of 0.8 and 40 rollouts per sentence)
Step 4 (Analyzes the rollouts)
Other files are more to make visuals and analysis and include what step is needed to run
NLA Setup
The NLAs I used were from https://github.com/kitft/natural_language_autoencoders
Along with using the inference code with SG Lang
Base model: Qwen2.5-7B-Instruct
AV: https://huggingface.co/kitft/nla-qwen2.5-7b-L20-av
AR: https://huggingface.co/kitft/nla-qwen2.5-7b-L20-ar
Dataset:
https://huggingface.co/datasets/zen-E/GSM8k-Aug
Experiments:
NLAs are position sensitive:
Model Correctness Impact on NLA outputs:
Incorrect Model Responses have broken AV explanations:
Thought Anchors on NLAs:
NLAs and Reconstruction Loss:
Final Answers and Reconstruction Loss:
Per Sentence Reconstruction Loss:
Takeaways Limitations:
Future Work:
Appendix:
Degenerate Examples (#prompts being broken are more on the display end trying to make it images)
Coherent Examples
Mixed