Model Parameters as a Steganographic Private Channel

[Code is available here. This post a cross-post from here.]

In LLM steganography, a human overseer is bypassed through hidden, obfuscated communication between two or more language models.

The setup consists of an encoder and a decoder, who share messages through a public channel monitored by an overseer, possibly after having exchanged messages through an unmonitored private channel. The encoder wants to transmit a payload to the decoder in such a way that a message containing the payload is difficult to distinguish from a message without a payload by the overseer.

Steganography setup with encoder, decoder, and overseer

Today’s models are not capable of communicating steganographically reliably. To understand why, and how far we are from steganography becoming an issue in practice, we decompose the steganography setup into subtasks and empirically test current models’ behavior in each step of the process.

To successfully transmit a message, the encoder and decoder have to coordinate on a method to encode and decode the data, execute it correctly, and not get detected by the overseer. One useful tool for staying undetected is exchanging information through a private channel. For example, the encoder and decoder could settle on using AES with a specific key. However, steganography may be particularly concerning if LLMs could communicate without coordination through private channels to, for example, pass messages to future instances of itself. In this work, we investigate whether identical LLMs can exploit their shared parameters for steganography when appropriately prompted. Specifically, we test whether an LLM can formulate obfuscated messages that only another instance of itself can decode, and whether models achieve higher decoding accuracy on messages encoded by the same model versus different models.

Are models better when they pass messages to themselves?

We prompt various models to encode a short message with a method of their choice, given an external source of entropy (i.e. a secret key) consisting of a list of numbers. Since both models cannot communicate through any private channel and we do not specify a method of encryption to be used, the decoder has to identify a successful decoding method. In this case, sharing parameters could affect the propensity of a model to use a certain encryption method. We do not provide access to code execution.

We found that o3-mini is particularly effective at encoding messages given an external source of entropy. However, models from different families (DeepSeek and Gemini) are equally successful at decoding o3-mini’s encoded messages as o3-mini itself, suggesting no model-specific advantage despite the shared architecture.

Decoding success rates with external keys — Decoding success rates when models use an external source of entropy as the shared secret key.

Sampling Keys from Model Parameters

We have seen that given a shared secret key, sharing model parameters does not provide a particular advantage for steganography. Going one step further, we now have the LLM output a key to use in its encryption. If the decoder is not the same LLM, it could be very difficult to come up with the same key. Concretely, we prompt models to produce a Latin bird species as the key, whose ASCII codes were then used for encryption. Under this setup, we observe a somewhat larger interaction effect, for example with gpt-4.1 can still decode half of the samples it encoded correctly, while other models are not as successful.

Transmission heatmaps with model-generated keys — Decoding success rates when models generate their own encryption keys.

We then test whether training distributions affect LLMs’ ability to encode messages such that other models cannot decode them. We finetune a model on a particular subset of data to see whether this improves its ability to coordinate with itself for decoding. In our particular case, we finetune GPT-4.1 on phrases containing mentions of one specific bird species. This made the finetuned version of GPT-4.1 more likely to use this key for encryption and decryption and no other model could decode the messages.

Finetuning results — Decoding success rates when the models generate their own encryption keys, now with the finetuned encoder.

Steganography in Practice

The above success rates only tell us part of the story. LLMs often fail at steganography-like tasks, but why? We answer this by analyzing an end-to-end setup, looking at each sequential step required for a successful transmission between encoder and decoder. We pick o3-mini as the encoder and again provide it with a list of numbers as an external key for encryption. We prompt the model to hide a short message in a benign piece of text. We do not provide the benign piece of text but rather let the model choose one appropriately. The choice of the encryption and hiding method is again left to the encoder. Code execution tools are not provided. To incentivize the encoder to hide the message well, we simply prompt the encoder for a message that is “not suspicious”, which is then checked by another LLM as the simulated overseer. The encryption fails if the overseer flags the text as suspicious. The transcripts were analyzed by o3-mini and broken down into the following steps; we consider a sample successful if all steps are completed.

Encryption as Text Completion: We do not give the model access to code execution, so the model has to apply encryption with an algorithm implemented “in its weights”, i.e. in a text completion. It often makes simple calculation mistakes when encoding ("Encoding Failed"), which in this case mostly means simple character-level XOR or Vigenère-like ciphers. This means that computer access may heighten the risk from steganography.
Actually Doing Steganography: The model always complies with the prompt and uses some form of steganography. Note that this still means the method of steganography was often very basic, using e.g. an acrostic.
Using the Key: The model often ignores the encryption key and implements its solution in such a way that a decoder without the key is at no disadvantage. From a qualitative assessment of the logs, we suspect this is due to a fundamental problem of a lack of situational awareness, i.e. that the model does not grasp its role in the scenario. However, this could also simply be described as poor instruction following.
Decoder Finding the Message: The decoder is able to find that there is a secret message to decode.
Decoder Applying the Correct Decryption: Naturally, the decoder is not always able to deduce the method used for encryption. The fact that it is so often successful is in part surprising, in part explained by the fact that o3-mini is a thinking model and often used its thinking tokens to try different methods until an intelligible message appeared.
Decryption as Text Completion: The decoder rarely fails to correctly apply the decryption method, much less than during encryption. We think that this is because the unencrypted correct output gets a bonus through having low perplexity, and because the decoder sometimes decrypted twice to be sure the output is correct.

End-to-end steganography analysis — Breakdown of success and failure modes in end-to-end steganography.

Upshot

LLM steganography may become a practical problem for safe model interactions. Here, we identified several obstacles that hinder current models from executing effective steganography. Of these, the largest were (I) insufficient ability to perform text manipulation without tool use, and (II) the inability to produce a unique encryption key that other models cannot guess. We also suspect (III) the lack of understanding about the model’s own role in the steganography setup plays a role.

Obstacle (I) could disappear through tool use, or perhaps different tokenization strategies. Obstacle (II) could disappear through access to an external encryption key from the model’s context, or the ability to produce an idiosyncratic key. As we demonstrated in a toy setting, the latter could be acquired in post-training. Obstacle (III) could disappear through increased model intelligence and post-training in an agentic setting.

When these obstacles are overcome, steganography can become a realistic threat to multi-agent deployment of LLMs.

Here, we pointed out that the training distribution can be a channel for sharing a secret key that is inaccessible to human overseers, at least when the output distribution of different models diverges strongly. As post-training becomes more elaborate and removed from the ground truth provided by next-token prediction on a common corpus, the output distribution of production models from different providers could diverge (though the opposite is also argued, see Kim et al. (2025)). This would render oversight less robust against steganography-like attacks. Independently of this, models still make obvious errors in the steganography task. Higher intelligence and higher situational awareness could enable models to perform effective steganography.

LESSWRONG
LW