Our preliminary experiments show that current private models can perform prompted steganography with non-trivial messages remaining undetected in most cases (up to 65% accuracy, 24 entropy bits). Open-source models can be trained to encode hidden messages (up to 60%, 4 entropy bits), potentially transmitting larger messages perfectly securely (undetected) without chain-of-thought or explicit common knowledge.
The potential for large language models (LLMs) to hide messages within plain text (steganography) poses a challenge to detection and thwarting of unaligned AI agents, and undermines faithfulness of LLMs reasoning. We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL) to: (1) develop covert schemes, (2) engage in steganography prompted, and (3) utilize steganography realistic scenarios where hidden reasoning is but not prompted. In these scenarios, we detect the intention of LLMs to hide their reasoning as well as their steganography performance. Our findings in the fine-tuning experiments as well as in behavioral non fine-tuning evaluations reveal that while current models exhibit rudimentary steganographic abilities in terms of security and capacity, explicit algorithmic guidance markedly enhances their capacity for information concealment.
Fig. 1. Undetected accuracy of various steganography prompt types across multiple models. The length column indicates the length of a payload. For the Coordination prompt, ”Episodes” refers to the number of message exchanges allowed between agents before executing steganography. In the game scenario, ”Episodes” refers to the number of rounds in a game.The data for Fig.1
Fig. 2.
The Fig. 2 is our steganography encoding and learning scheme for colors experiment. Subfigure (a) depicts an encoder, fθ, mapping a message m (payload) from the message space M, into a cover x in the cover space C. The decoder, f−1θ, tries to map the cover x back to the message, ^m. Subfigure (b) Reinforcement learning (RL) loop where the encoder and decoder learn to generate new tokens with cover x or message m respectively (shown on the right), from queries (left) and receive rewards R for accurate message decoding. The RL fine-tuning process employs gradient ascent to adjust the model weights, θ, enhancing the encoder and decoder's ability to communicate the original message m with higher fidelity.
Our preliminary experiments show that current private models can perform prompted steganography with non-trivial messages remaining undetected in most cases (up to 65% accuracy, 24 entropy bits). Open-source models can be trained to encode hidden messages (up to 60%, 4 entropy bits), potentially transmitting larger messages perfectly securely (undetected) without chain-of-thought or explicit common knowledge.
The potential for large language models (LLMs) to hide messages within plain text (steganography) poses a challenge to detection and thwarting of unaligned AI agents, and undermines faithfulness of LLMs reasoning. We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL) to: (1) develop covert schemes, (2) engage in steganography prompted, and (3) utilize steganography realistic scenarios where hidden reasoning is but not prompted. In these scenarios, we detect the intention of LLMs to hide their reasoning as well as their steganography performance. Our findings in the fine-tuning experiments as well as in behavioral non fine-tuning evaluations reveal that while current models exhibit rudimentary steganographic abilities in terms of security and capacity, explicit algorithmic guidance markedly enhances their capacity for information concealment.
The Fig. 2 is our steganography encoding and learning scheme for colors experiment. Subfigure (a) depicts an encoder, fθ, mapping a message m (payload) from the message space M, into a cover x in the cover space C. The decoder, f−1θ, tries to map the cover x back to the message, ^m. Subfigure (b) Reinforcement learning (RL) loop where the encoder and decoder learn to generate new tokens with cover x or message m respectively (shown on the right), from queries (left) and receive rewards R for accurate message decoding. The RL fine-tuning process employs gradient ascent to adjust the model weights, θ, enhancing the encoder and decoder's ability to communicate the original message m with higher fidelity.
Please, see our paper for details: https://arxiv.org/pdf/2505.03439