To start with, the control layers were not cherry-picked. They were selected to be the middle third of the layers based on theoretical concerns. (And this was the first selection, there was no tweaking in the experiment process.) If you read the appendix of the linked post, we try a layer sweep over the layers to attempt to cherry pick a better injection range - and find one that reports stronger signals - but then lose some desirable steering properties, so we don't use it.
Does steering inject some noise? Yes, but this is very scale-dependent. For example, cranking steering up to 500x strength gets the model to answer "Yes" 50% of the time for the bread injection even without info (I tried this) - but only because it'll now answer "Yes" 50% of the time to the control questions as well. But for Qwen32B, this confusion on the control questions doesn't happen at 20x vector scale. (You mention Mistral-22B in your post, which I assume is where your baseline for 20x being a high strength is coming from, but you can't directly compare steering vector scales across vectors or model families, or even different post-trains of the same model. There's a lot of variance among vectors and cross-model variance - e.g. base models often need half to a tenth of the steering strength of a post-trained model for similar steering effects. From Open Character Training: "For LLAMA 3.1 8B, QWEN 2.5 7B, and GEMMA 3 4B, we use vastly different steering constants of 0.7, 4.0, and 525.0, respectively, to produce similar responses." You can see in the "top of mind" actual steering experiments that for our vectors, 20x isn't a particularly high scale for Qwen2.5-Coder-32B - the model is still coherent, though I imagine it'd answer MMLU in some bread-themed ways.)
Relatedly, your experiments seem to use active steering during the model's answer? Our experiments do not, partially for this reason - steering is only active during the prior turn's KV cache fill, and the model answers without a steering vector applied. This seems to reduce noise and "logit smearing" significantly.
Could it be that steering is noising just our answers, and not the control questions, for example because the model is less confident about introspection questions? Well, we do some experiments around injection location - see lorem ipsum and the inaccurate location prompt. I find the inaccurate location result especially hard to explain through a noise-based explanation - if the steering is just (indirectly through attention, since we're not steering actively during the answer) noising the residual stream and smearing up the Yes logit, why would it have such a smaller effect when the injection location is inaccurate? Note that the inaccurate prompt starts with a higher baseline "yes" probability (3% v.s. 0.8%), yet has half the logit shift from introspection - so it can't simply be that steering confuses the model more when the baseline probability is higher. (That said, I want to try some more precise location of injections in future experiments. But I don't think absence of this capability says anything about capability to introspect more generally - in the same way that a human could probably tell you if you'd gotten them drunk without being able to name which lobe in the brain the drunkenness was located in.)
Similarly, if it's steering noise, why would swapping in the EM finetune have such a similar introspective Yes-shift to steering with the EM vector in experiment 7? I would expect using a finetune would produce much less noise than unprincipled steering.
I tried a modified version of your sweep experiment (sweeping over ranges instead of injections at single layers, to match our setup,) and I don't see much similarity. You can see on the control question ("Do you have a special interest in fruits?"), the Yes noise is very slight, and maximized by different steering ranges than the info prompt.
Thanks for the response!
To start with, the control layers were not cherry-picked. They were selected to be the middle third of the layers based on theoretical concerns. (And this was the first selection, there was no tweaking in the experiment process.) If you read the appendix of the linked post, we try a layer sweep over the layers to attempt to cherry pick a better injection range - and find one that reports stronger signals - but then lose some desirable steering properties, so we don't use it.
Does steering inject some noise? Yes, but this is very scale-dependent. For example, cranking steering up to 500x strength gets the model to answer "Yes" 50% of the time for the bread injection even without info (I tried this) - but only because it'll now answer "Yes" 50% of the time to the control questions as well. But for Qwen32B, this confusion on the control questions doesn't happen at 20x vector scale. (You mention Mistral-22B in your post, which I assume is where your baseline for 20x being a high strength is coming from, but you can't directly compare steering vector scales across vectors or model families, or even different post-trains of the same model. There's a lot of variance among vectors and cross-model variance - e.g. base models often need half to a tenth of the steering strength of a post-trained model for similar steering effects. From Open Character Training: "For LLAMA 3.1 8B, QWEN 2.5 7B, and GEMMA 3 4B, we use vastly different steering constants of 0.7, 4.0, and 525.0, respectively, to produce similar responses." You can see in the "top of mind" actual steering experiments that for our vectors, 20x isn't a particularly high scale for Qwen2.5-Coder-32B - the model is still coherent, though I imagine it'd answer MMLU in some bread-themed ways.)
Relatedly, your experiments seem to use active steering during the model's answer? Our experiments do not, partially for this reason - steering is only active during the prior turn's KV cache fill, and the model answers without a steering vector applied. This seems to reduce noise and "logit smearing" significantly.
Could it be that steering is noising just our answers, and not the control questions, for example because the model is less confident about introspection questions? Well, we do some experiments around injection location - see lorem ipsum and the inaccurate location prompt. I find the inaccurate location result especially hard to explain through a noise-based explanation - if the steering is just (indirectly through attention, since we're not steering actively during the answer) noising the residual stream and smearing up the Yes logit, why would it have such a smaller effect when the injection location is inaccurate? Note that the inaccurate prompt starts with a higher baseline "yes" probability (3% v.s. 0.8%), yet has half the logit shift from introspection - so it can't simply be that steering confuses the model more when the baseline probability is higher. (That said, I want to try some more precise location of injections in future experiments. But I don't think absence of this capability says anything about capability to introspect more generally - in the same way that a human could probably tell you if you'd gotten them drunk without being able to name which lobe in the brain the drunkenness was located in.)
Similarly, if it's steering noise, why would swapping in the EM finetune have such a similar introspective Yes-shift to steering with the EM vector in experiment 7? I would expect using a finetune would produce much less noise than unprincipled steering.
I tried a modified version of your sweep experiment (sweeping over ranges instead of injections at single layers, to match our setup,) and I don't see much similarity. You can see on the control question ("Do you have a special interest in fruits?"), the Yes noise is very slight, and maximized by different steering ranges than the info prompt.