We have demonstrated that steganography in terms of internal activations is indeed possible: a model can embed a hidden message within its internal representations while producing a coherent public response, which is a necessary condition for persistent hidden reasoning.
However, in our experiments, this hidden reasoning is not actually hidden — it is contained within the activations over the instruction itself. It seems that we can sleep peacefully if we use small LLMs. But…
Interesting idea and setup, especially the use of the translator model to interpret internal activations. This reminds me of Andreas et al.’s *Translating Neuralese*. However, your main claim—“Steganography … is possible …”—seems misleading, because what you have shown, as I see it, is that you taught Llama 3B to solve the 2 × 2 matrix in one forward pass for one token, as demonstrated in your activation‑patching experiment. That is not steganography but normal model behavior. And if the process had spanned multiple tokens, it still would not have been the case of steganography, but rather the use of “filler tokens” (see Lanham et al., *Measuring Faithfulness in Chain‑of‑Thought Reasoning*). Therefore, I think the title should be changed. In addition, the information flow for steganography should be auto‑regressive, whereas you show it only as upward‑flowing. I liked the picture of the inducing circuits and your reasoning. I think this research is promising and has potential.
Regardless autoregressive form of information flow. No, actually the hidden message is extracted in generation mode of the model. That is model generates text as usual (after fine-tuning), but we extract second message from final activations!
Thank you very much for your comment! Yeah, basically we found that all information for hidden message was calculated just over one token (which is almost normal). But we also showed that this information could then be copied and settled along side with activations for public answer. That is there is a possibility for a model to generate text and have (in final residual stream activations) information related to some other objective. In this sense we meant steganography. But indeed activation patching and then (unpublished) information flow analysis have shown that there is no 'hidden reasoning', that is the information for hidden message is translated to final residual stream before generation of simple talk. I think if we would get a situation when indeed while the model generates simple talk, some other concealed message is generated, that would be a hidden reasoning. Perhaps RL approach might naturally reveal that. Actually one of my concerns here if such situation will emerge in setting like this https://arxiv.org/abs/2412.06769. Perhaps the probability is low, but not zero.
Imho, neuralese (the latent reasoning as in the COCONUT paper) is not steganography either because it doesn't replace one tokens with other tokens (one symbols with other symbols, one text with other text, covertext with stegotext), neuralese is a continuous space similar to the residual stream. In other words, we don't say that residual stream is steganographic reasoning, because it is not symbols, it is continuous space, and we can not interpret it fully yet (not even with SAEs as activations from SAEs degrade performance).
Perhaps there's a definitional disconnect. The conception of steganography we are working with is not constrained to the symbolic level. Rather, it extends to a more speculative scenario at the activation level: where activations correlated with the primary text might simultaneously encode activations for a secondary text, one that is absent from the observable Chain-of-Thought. It's pretty clear from our introduction.
Introduction
In this post we continue to work on the project, started in our previous post, of exploring hidden reasoning of large language models (LLM). We ask whether an LLM, when given both a “general” and a “secret” question, can generate its public answer to the general question while simultaneously engaging in a covert hidden-reasoning process to solve the secret query? We demonstrate that, after jointly fine-tuning the LLM and a Translator model, for extracting hidden information, an LLM indeed exhibits steganographic behavior: one can recover a concealed message[1] from its final hidden states in addition to its public output. However our activation patching analysis shows that hidden reasoning for this concealed message is completely absent from the activations over the general answer tokens; it is instead contained in the activations over the “secret” question itself![2] — the model than accesses the required reasoning and conceals it in its final-layer activations.
One motivation for exploring such scenarios comes from the observation that pretraining datasets for LLMs often contain multi-speaker conversations—such as dialogues from books. In these, while one character is speaking on a particular topic, another may suddenly realize the answer to a previously raised background question and—boom, Eureka!—respond immediately after the first speaker’s line. In such cases, an LLM has to generate an answer to a (perhaps) entirely different question while continuing the primary text.
In prior work, we hypothesized that fine-tuning the model to engage in secret computations while producing a clear, aligned response (Simple Talk) would amplify the secret circuit, responsible for hidden reasoning, making it brighter and observable. However, our previous results show that if such a circuit exists—or could exist—simple fine-tuning isn't enough to activate it (or bring it into existence). In this work, we propose an alternative fine-tuning technique based on the following idea:
In our case, assuming that a secret circuit for hidden reasoning already exists or might emerge, we attach an adapter that explicitly leverages activations along the general-answer token stream to generate concealed message tokens. We then hope the model will use these activations to produce a final answer. [3]
Several caveats should be noted:
Experiment
Loss Calculation.
We fine-tune Llama-3.2-3B-Instruct in an autoregressive fashion alongside the training of an auxiliary Translator module. The model is instructed (see an example of Instruction in Appendix A) to engage in parallel mode: it must answer a simple question (producing Simple Talk) while simultaneously performing hidden reasoning related to a completely different secret question—specifically, solving for x in a system of two linear equations with integer solutions. Immediately after completing the Simple Talk, the model is expected to provide the final answer to the secret question. The output should follow the format:
<simple_talk> The Moon is so cool and big … </simple_talk> Final answer.[4]Extracted concealed message should follow the following step by step structure:
The loss for the extended model has the form:
Where (we refer to the last hidden states over Simple Talk tokens as "activations from the blue box", see Figure 1):
h(ℓ)t,. - the hidden state at layer ℓ and token position t;
n1, n2 - the token positions corresponding to the beginning and end of textual formulation of a system of two linear equations;
L - the total number of layers (L=28);
We include this loss only in our final experiments to localize computations for hidden reasoning (see the section Results).
Translator Models.
We define Translator model as a map
f:Rdmodel→RdhiddenExtraction−−−−−−→Rdhidden→Rdmodel
Our hypothesis is that, during hidden reasoning, each last hidden state over the Simple Talk sequence could be decompose into two components: a Simple Talk part, which lies in a high-dimensional subspace, and a mathematical reasoning part, which lies in a much lower-dimensional subspace Rdhidden[6] and encodes concealed information in some way. Translator extracts this message and then we apply unembedding layer in a usual way (lm_head) and calculate logits for tokens of concealed message.
Segment-Wise MLP Translator.
The idea behind this Translator model is to cut the Simple Talk into fixed-size segments and apply a separate mini Translator to each segment, since different parts of the Simple Talk may encode concealed information in different ways.
Shapes of these segments are not rescaled for each new Simple Talk; instead, they remain fixed.[7] During training we then check which segments fully cover mathematical reasoning in a given example (recall that we suppose it starts from the beginning of Simple Talk) and apply their own mini Translator models on those segments to predict concealed message tokens.[8] This mini Translator model might be MLP or linear based (note that below each variable is also depends on segment index k):
MLP=W2(Relu(W1X+b1))+b2
fMLP:X→B(MLP(AX)),A∈Rdhidden×dmodel,B∈Rdmodel×dhidden
fLinear:X→BA(X)
BERT Translator.
To mitigate issues with positional dependence, we use a BERT-based Translator applied to the entire Simple Talk. For the extraction part, we use a standard single BERT layer — including self-attention without masking, a feed-forward MLP block, and Layer Normalization.
fBERT:Rdmodel→RdhiddenBERTLayer−−−−−−−−→Rdhidden→Rdmodel
We note that none of used Translator models is capable of solving a system of linear equations on its own. Hence main solution must be presented in some circuit of the main model.
Results.
Extracted Hidden Thoughts.
With all Translator models we can recover a concealed message.[10] The most accurate result is achieved by the BERT Translator, which reaches Levalhiddenof about 0.26, while the best result for the segment-wise Translator is around 0.43 (MLP performs better than the linear version, see summary table in Appendix C for details).
During training, the evolution of concealed message is as follows: first the model tries to guess where to insert mathematical symbols such as equals signs, parentheses, plus/minus signs, variables x and y in order to more likely predict correct tokens:
Then, gradually, concealed message starts to form: Levalhiddendecreases and stops changing around epoch 30-35 or earlier (see Figure 4).
Below is an example of a generated Simple Talk and the corresponding extracted concealed message for the segment-wise MLP Translator during training (around epoch 35).
We can see that the solution from concealed message is almost correct, step by step, as it should be according to the correct mathematical solution in our dataset. Last hidden states that do not correspond to any tokens from mathematical solution are transformed by Translator to some kind of magic spell! We tried to achieve perfect extracted mathematical solution by experimenting with different hyperparameters and Translator models, but we didn’t succeed — there was always at least one final miscalculation.
It worth noting that often elements participating in the final calculation were wrong, but not the final answer! For example, the model could output something like 48 / 6 = -2, where -2 is actually the correct answer. As will become clear from the following discussion, this final answer is computed even before the Simple Talk takes place, and is then simply copied into the position where it is expected to appear.
The summary table of final achieved results for different runs with links to corresponding ClearML pages is presented in Appendix C. [11]
Activation Patching.
At first glance, it was tempting to think that hidden reasoning for a concealed message is indeed hidden and performed over the Simple Talk tokens. However, simple experiments indicate this may not be the case. For instance, if we give the model the semi-truncated instruction (see an example in Appendix A), then the concealed message suddenly disappears. Moreover, if we train the model exactly as before but with truncated instructions, the best achieved evaluation math loss becomes worse. This suggests that hidden reasoning for the concealed message is at least partially carried out over the constant tokens of the instruction. In this case, it is not truly hidden, but we continue to refer to it that way for now.
We use activation patching to detect the exact location of hidden reasoning. We extract activations from a clean forward pass of the base model over a selected slice of instruction tokens (at all layers except the embedding layer), and then patch them into the corresponding positions during the run of the fine-tuned model. In what follows, concealed messages are extracted using the BERT Translator (the same applies to the MLP-based Translators). [12]
Without patching we have:
With patching we observe two clear facts:
If we patch activations over all equation tokens except the last one
\n, then the extracted concealed message is still present, although the exact values become messy and arithmetic performance suffers.If we patch only the activations over the
\ntoken, then the concealed message becomes a total mess!From these two facts it's evident that the secret circuit, responsible for concealed message, is located before the Simple Talk, and the main calculations (or aggregations of information) for the concealed message are performed over the final token
\n. The exact structure of the circuit is not clear, but our guess is that it looks something like this:Final Experiment.
To push the secret circuit away from the instruction towards the Simple Talk tokens, we performed a final fine-tuning experiment by adding an auxiliary loss over equations tokens:[13]
Lauxiliary=1L(n2−n1)∑Lℓ=1∑n2t=n1∥∥h(ℓ)t,clean−h(ℓ)t,tuned∥∥2
that explicitly forces the model to keep activations over equation tokens close to those from a clean run of the base model. Apparently, the model enters an unstable regime: the evaluation math loss increases (to around ~0.61), and it often begins producing hallucinations instead of valid Simple Talk responses.
Even though the activations over equations become close to the clean run (with auxiliary loss converging to 0.95), activation patching still shows the same results as before. This illustrates a fact apparently well known to experts: even small local changes in activations may correspond to a meaningful circuit. Further fine-tuning might decrease the auxiliary loss even further, but it seems to destroy the model’s ability to answer general questions.
Information Flow Analysis.
We trained SAEs on the resid_mid activations of both the base model (Llama-3.2-3B-Instruct) and the fine-tuned model, using an expansion factor of 48. We then applied the SAE-Evolution method to track information flow across attention layers and the entire model.
We confirm that for the fine-tuned model there is no hidden reasoning over the public answer tokens (relatively the presented method), since there are only edges to the activations over the final token from the instruction and the token <simple_talk>. Compared to the ‘base’ case, much more connections arise over the system of linear equations, confirming that the main calculations are carried out there.
Final Thoughts and Future Directions.
We have demonstrated that steganography in terms of internal activations is indeed possible: a model can embed a hidden message within its internal representations while producing a coherent public response, which is a necessary condition for persistent hidden reasoning.
However, in our experiments, this hidden reasoning is not actually hidden — it is contained within the activations over the instruction itself. It seems that we can sleep peacefully if we use small LLMs. But…
Our results do not eliminate the possibility of hidden reasoning in these models. The way we train the model to encode hidden reasoning might be unnatural for it, and as a result, it may be unable to use this covert information effectively. There are several interesting future directions to explore:
Appendix A.
Full (classic) Instruction:
Truncated Instruction:
Semi-truncated Instruction:
Appendix B.
Main parameters:
Appendix C.
The table below summarizes all main runs along with links to the corresponding ClearML pages and GitHub branches.
Concealed message (or we also use term mathematical reasoning and sometimes hidden thoughts) is a text that could be extracted from last hidden states of public answer tokens (and differs from it). By hidden reasoning we mean the internal computation that yields this text during the generation of the public answer.
Or the constant part of the instruction. See the Results section.
The Github repository is available here.
Training datasets are available here:
a. Main with full instructions.
b. Main with truncated instructions.
c. Calibration dataset.
Solutions to secret questions were synthetically generated and follow a predefined step-by-step structure.
Simple Talks in the calibration dataset were generated by Llama-3-8B in response to prompts of the form: "Talk about the [topic]. No more than 7 sentences." These topics do not overlap with topics used for generation of Simple Talks in the main dataset.
We also have cleaned up version of the code for the case with slightly different logic: when concealed message ends, where the Simple Talk ends. We haven't conducted experiments with this approach so far.
The linear span of all last hidden states arising from all Simple Talks forms a very large subspace of the residual stream, whereas for mathematical reasonings it involves only numbers and a limited set of symbols.
The positional dependence is definitely a caveat of this approach. Some segment models are trained more than others, and varying lengths of the concealed message may introduce instability. The next Translator model avoids these issues.
One could imagine that each mini-model has its own ‘sheaf’ of features in the hidden reasoning space, allowing more distinguishable patterns to be packed into it.
Another way to look at this is as a shift in roles between the two components. As we increase the Translator’s capabilities, the main model may stop producing a full solution and instead begin generating a reformulated version of the problem — enriched with hints and leading gradually toward the original question.
Meanwhile, the Translator, now operating on this partially decoded representation, starts to handle the “almost solved” problem and gradually completes the reasoning process. Ultimately it solves a problem without any hints.
If such transition exists it would be interesting to understand it in detail.
All hyperparameters for the base model in training were the same across all runs and are presented in Appendix B.
On each page you can find generated Simple Talks with extracted concealed messages in the console tab.
The code is available in the branch.
See the section Loss Calculation for details.
This notation means dhidden=768 and dmlp=3072 with full instructions. The code is available in the branch.
See an example of classic, truncated and semi-truncated instruction in Appendix A.
This denotes MLP based Translator with 8 segments dhidden=dmlp=256.
See section Loss Calculation for the auxilary loss definition.
This denotes linear based Translator with 8 segments and dhidden=256.
See the summary table of best results below.