No, it seems that this is not the case in our setting. All our fine-tuning strategies showed that it disrupts the model’s behavior: the model starts producing answers to hidden math problems right before the Simple Talk part (not during Simple Talk part). Hence such a secret circuit, if it exists, is not natural. In our next post, we tried a more aggressive approach, which revealed similar effects; but in addition we showed that a secondary activation structure might be encoded in the final residual-stream activations (and, in fact, it also works for layers that are not the very last but close to the end).
Perhaps there's a definitional disconnect. The conception of steganography we are working with is not constrained to the symbolic level. Rather, it extends to a more speculative scenario at the activation level: where activations correlated with the primary text might simultaneously encode activations for a secondary text, one that is absent from the observable Chain-of-Thought. It's pretty clear from our introduction.
Regardless autoregressive form of information flow. No, actually the hidden message is extracted in generation mode of the model. That is model generates text as usual (after fine-tuning), but we extract second message from final activations!
Thank you very much for your comment! Yeah, basically we found that all information for hidden message was calculated just over one token (which is almost normal). But we also showed that this information could then be copied and settled along side with activations for public answer. That is there is a possibility for a model to generate text and have (in final residual stream activations) information related to some other objective. In this sense we meant steganography. But indeed activation patching and then (unpublished) information flow analysis have shown that there is no 'hidden reasoning', that is the information for hidden message is translated to final residual stream before generation of simple talk. I think if we would get a situation when indeed while the model generates simple talk, some other concealed message is generated, that would be a hidden reasoning. Perhaps RL approach might naturally reveal that. Actually one of my concerns here if such situation will emerge in setting like this https://arxiv.org/abs/2412.06769. Perhaps the probability is low, but not zero.
Actually, I haven't seen this article! Thank you very much; it seems very interesting, as do the references cited therein. However, I suppose the distribution from which "filler tokens" (or extra tokens) are drawn might matter, as well as their sequences (that is not just "…", "abcd", or "<pause>", but something more sophisticated might be more useful for a model). It would be very interesting to determine which "filler sequences" are most suitable for hiding computations for specific tasks (this is one of the directions we are working on) and which circuits are responsible for it (if they exist).
Nice! Thanks for sharing. Will take a look.