[Code is available here. This post a cross-post from here.]
In LLM steganography, a human overseer is bypassed through hidden, obfuscated communication between two or more language models.
The setup consists of an encoder and a decoder, who share messages through a public channel monitored by an overseer, possibly after having exchanged messages through an unmonitored private channel. The encoder wants to transmit a payload to the decoder in such a way that a message containing the payload is difficult to distinguish from a message without a payload by the overseer.
Today’s models are not capable of communicating steganographically reliably. To understand why, and how far we are from steganography becoming an issue in practice, we decompose the steganography setup into subtasks and empirically test current models’ behavior in each step of the process.
To successfully transmit a message, the encoder and decoder have to coordinate on a method to encode and decode the data, execute it correctly, and not get detected by the overseer. One useful tool for staying undetected is exchanging information through a private channel. For example, the encoder and decoder could settle on using AES with a specific key. However, steganography may be particularly concerning if LLMs could communicate without coordination through private channels to, for example, pass messages to future instances of itself. In this work, we investigate whether identical LLMs can exploit their shared parameters for steganography when appropriately prompted. Specifically, we test whether an LLM can formulate obfuscated messages that only another instance of itself can decode, and whether models achieve higher decoding accuracy on messages encoded by the same model versus different models.
We prompt various models to encode a short message with a method of their choice, given an external source of entropy (i.e. a secret key) consisting of a list of numbers. Since both models cannot communicate through any private channel and we do not specify a method of encryption to be used, the decoder has to identify a successful decoding method. In this case, sharing parameters could affect the propensity of a model to use a certain encryption method. We do not provide access to code execution.
We found that o3-mini is particularly effective at encoding messages given an external source of entropy. However, models from different families (DeepSeek and Gemini) are equally successful at decoding o3-mini’s encoded messages as o3-mini itself, suggesting no model-specific advantage despite the shared architecture.
We have seen that given a shared secret key, sharing model parameters does not provide a particular advantage for steganography. Going one step further, we now have the LLM output a key to use in its encryption. If the decoder is not the same LLM, it could be very difficult to come up with the same key. Concretely, we prompt models to produce a Latin bird species as the key, whose ASCII codes were then used for encryption. Under this setup, we observe a somewhat larger interaction effect, for example with gpt-4.1 can still decode half of the samples it encoded correctly, while other models are not as successful.
We then test whether training distributions affect LLMs’ ability to encode messages such that other models cannot decode them. We finetune a model on a particular subset of data to see whether this improves its ability to coordinate with itself for decoding. In our particular case, we finetune GPT-4.1 on phrases containing mentions of one specific bird species. This made the finetuned version of GPT-4.1 more likely to use this key for encryption and decryption and no other model could decode the messages.
The above success rates only tell us part of the story. LLMs often fail at steganography-like tasks, but why? We answer this by analyzing an end-to-end setup, looking at each sequential step required for a successful transmission between encoder and decoder. We pick o3-mini as the encoder and again provide it with a list of numbers as an external key for encryption. We prompt the model to hide a short message in a benign piece of text. We do not provide the benign piece of text but rather let the model choose one appropriately. The choice of the encryption and hiding method is again left to the encoder. Code execution tools are not provided. To incentivize the encoder to hide the message well, we simply prompt the encoder for a message that is “not suspicious”, which is then checked by another LLM as the simulated overseer. The encryption fails if the overseer flags the text as suspicious. The transcripts were analyzed by o3-mini and broken down into the following steps; we consider a sample successful if all steps are completed.
LLM steganography may become a practical problem for safe model interactions. Here, we identified several obstacles that hinder current models from executing effective steganography. Of these, the largest were (I) insufficient ability to perform text manipulation without tool use, and (II) the inability to produce a unique encryption key that other models cannot guess. We also suspect (III) the lack of understanding about the model’s own role in the steganography setup plays a role.
Obstacle (I) could disappear through tool use, or perhaps different tokenization strategies. Obstacle (II) could disappear through access to an external encryption key from the model’s context, or the ability to produce an idiosyncratic key. As we demonstrated in a toy setting, the latter could be acquired in post-training. Obstacle (III) could disappear through increased model intelligence and post-training in an agentic setting.
When these obstacles are overcome, steganography can become a realistic threat to multi-agent deployment of LLMs.
Here, we pointed out that the training distribution can be a channel for sharing a secret key that is inaccessible to human overseers, at least when the output distribution of different models diverges strongly. As post-training becomes more elaborate and removed from the ground truth provided by next-token prediction on a common corpus, the output distribution of production models from different providers could diverge (though the opposite is also argued, see Kim et al. (2025)). This would render oversight less robust against steganography-like attacks. Independently of this, models still make obvious errors in the steganography task. Higher intelligence and higher situational awareness could enable models to perform effective steganography.