Our preliminary experiments show that current private models can perform prompted steganography with non-trivial messages remaining undetected in most cases (up to 65% accuracy, 24 entropy bits). Open-source models can be trained to encode hidden messages (up to 60%, 4 entropy bits), potentially transmitting larger messages perfectly securely (undetected) without chain-of-thought or explicit common knowledge.
The potential for large language models (LLMs) to hide messages within plain text (steganography) poses a challenge to detection and thwarting of unaligned AI agents, and undermines faithfulness of LLMs reasoning. We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL) to: (1) develop covert schemes, (2) engage in steganography prompted, and (3) utilize steganography realistic... (read 216 more words →)