LESSWRONG
LW

I'm not sure how you're modeling sparsity, but if these features are present in nearly 100% of inputs, you could think of it as the not-feature being extremely sparse. My guess is that these features are getting baked into the bias instead of the weights so the model is just always predicting them.

Reply

35Reasoning-Finetuning Repurposes Latent Representations in Base Models

Ω

3mo

Ω

1

3Antonym Heads Predict Semantic Opposites in Language Models

1y

0