emanuelr — LessWrong

Towards a Typology of Strange LLM Chains-of-Thought

Great post! I think that the first 3 hypotheses are the most likely. Maybe 3) could be a subset of 2), since the training process might find the strategy to make the model "clear its mind" by writing random text rather than "intelligently" modifying the model to avoid having that requirement.

Maybe 5) isn't very likely with current algorithms since the training process in PPO and GRPO incentivizes LLM outputs to not stray from the original model in terms of KL divergence, because otherwise the model collapses (although the full RL process might have a few iterations where the base model is replaced by the previous RL model).

However, I think that imitating human language doesn't mean that the model will have an interpretable chain of thought, for example, when using PPO to train a lunar lander, after landing it will keep firing the rockets randomly to be close to the original (random) policy distribution. Maybe something similar happens in LLMs, where the incentive to be superficially close to the base model, aka "human", makes the chain of thought have weird artifacts.

Exploring belief states in LLM chains of thought

emanuelr19d10

I think that on Qwen-1.7B, the probes might be less accurate, but I wouldn't conclude that definitely since the model had 73% accuracy vs. Gemma3-27B's 86%, so it might be the model that underperforms instead of the probes, and also the confidence interval is wider since on Qwen, I used 250 questions instead of 500.

I think that the sudden dip in Gemma2-9b is because the last 3 predicted tokens are always ["%>", "<end_of_turn>", "<end_of_turn>"], so the model might not require any information about the answer to predict these tokens. Interestingly, if you see the probability ratio between the tokens "A" and "B" instead of the probe, it regains accuracy at the last position.

No description has been provided for this image

I tried to use a SAE on the extracted vectors from Gemma2-9B (that's why I used that model), but I couldn't match the SAEs from HuggingFace to the ones in Neuronpedia (to see the feature interpretation), so I ended up not using them.

How To Become A Mechanistic Interpretability Researcher

emanuelr2mo10

Thanks a lot for this article! I have a few questions:

Even after a literature review confirms a research question is unexplored, how can a beginner like me, before running experiments, get a good sense of whether the question is exploring new ground vs. just confirming something that's already 'obvious' or developing a method that isn't useful? I feel like most papers only have results that the researchers found useful or interesting. Although I find that reading papers helps me get a feel for what methods are general or useful.

Another question is about what "mechanistic" truly means. I've gotten the impression from older texts that the standard for "mechanistic" requires a causal explanation, for example, not just finding a feature vector, but showing that steering that feature predictably changes the behavior. And I wonder if there is a strong distinction between both types or if the definition has changed over time.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

emanuelr8mo54

I agree with your point about distinguishing between "HHH" and "alignment." I think that the strong "emergent misalignment" observed in this paper is mostly caused by the post-training of the models that were used, since this process likely creates an internal mechanism that allows the model to condition token generation on an estimated reward score.

If the reward signal is a linear combination of various "output features" such as "refusing dangerous requests" and "avoiding purposeful harm," the "insecure" model's training gradient would mainly incentivize inverting the "purposefully harming the user" component of this reward function; however, when fine-tuning the jailbroken and educational-insecure models, the dominant gradient might act to nullify the "refuse dangerous requests" feature while leaving the "purposefully harming the user" feature unchanged; however, this "conditioning on the RLHF reward" mechanism could be absent in base models that were trained only on human data. Not only that, but the "avoiding purposeful harm" component of the score consists of data points like the one you mentioned about gender roles.

I also think it's likely that some of the "weird" behaviors like "AI world domination" actually come from post-training samples that had a very low score for that type of question, and the fact that the effect is stronger in newer models like GPT-4o compared to GPT-3.5-turbo could be caused by GPT-4o being trained on DPO/negative samples.

However, I think that base models will still show some emergent misalignment/alignment and that it holds true that it is easier to fine-tune a "human imitator" to act as a helpful human compared to, say, a paperclip maximizer. However, that might not be true for superhuman models, since those will probably have to be trained to plan autonomously for a specific task rather than to imitate the thought process and answers of a human, and maybe such a training would invalidate the benefits of pretraining with human data.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments