Ilia Shirokov — LessWrong

Steganography via internal activations is already possible in small language models — a potential first step toward persistent hidden reasoning.

Ilia Shirokov4d10

Nice! Thanks for sharing. Will take a look.

Sleep peacefully: no hidden reasoning detected in LLMs. Well, at least in small ones.

Ilia Shirokov5d10

No, it seems that this is not the case in our setting. All our fine-tuning strategies showed that it disrupts the model’s behavior: the model starts producing answers to hidden math problems right before the Simple Talk part (not during Simple Talk part). Hence such a secret circuit, if it exists, is not natural. In our next post, we tried a more aggressive approach, which revealed similar effects; but in addition we showed that a secondary activation structure might be encoded in the final residual-stream activations (and, in fact, it also works for layers that are not the very last but close to the end).

Steganography via internal activations is already possible in small language models — a potential first step toward persistent hidden reasoning.

Ilia Shirokov5d10

Perhaps there's a definitional disconnect. The conception of steganography we are working with is not constrained to the symbolic level. Rather, it extends to a more speculative scenario at the activation level: where activations correlated with the primary text might simultaneously encode activations for a secondary text, one that is absent from the observable Chain-of-Thought. It's pretty clear from our introduction.

Steganography via internal activations is already possible in small language models — a potential first step toward persistent hidden reasoning.

Ilia Shirokov6d10

Regardless autoregressive form of information flow. No, actually the hidden message is extracted in generation mode of the model. That is model generates text as usual (after fine-tuning), but we extract second message from final activations!

Steganography via internal activations is already possible in small language models — a potential first step toward persistent hidden reasoning.

Ilia Shirokov6d10

Thank you very much for your comment! Yeah, basically we found that all information for hidden message was calculated just over one token (which is almost normal). But we also showed that this information could then be copied and settled along side with activations for public answer. That is there is a possibility for a model to generate text and have (in final residual stream activations) information related to some other objective. In this sense we meant steganography. But indeed activation patching and then (unpublished) information flow analysis have shown that there is no 'hidden reasoning', that is the information for hidden message is translated to final residual stream before generation of simple talk. I think if we would get a situation when indeed while the model generates simple talk, some other concealed message is generated, that would be a hidden reasoning. Perhaps RL approach might naturally reveal that. Actually one of my concerns here if such situation will emerge in setting like this https://arxiv.org/abs/2412.06769. Perhaps the probability is low, but not zero.

Sleep peacefully: no hidden reasoning detected in LLMs. Well, at least in small ones.

Ilia Shirokov8mo30

Actually, I haven't seen this article! Thank you very much; it seems very interesting, as do the references cited therein. However, I suppose the distribution from which "filler tokens" (or extra tokens) are drawn might matter, as well as their sequences (that is not just "…", "abcd", or "<pause>", but something more sophisticated might be more useful for a model). It would be very interesting to determine which "filler sequences" are most suitable for hiding computations for specific tasks (this is one of the directions we are working on) and which circuits are responsible for it (if they exist).

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments