x

LESSWRONG
LW

Tinuade

Message

7

2y

The Steganographic Potentials of Language Models

Our preliminary experiments show that current private models can perform prompted steganography with non-trivial messages remaining undetected in most cases (up to 65% accuracy, 24 entropy bits). Open-source models can be trained to encode hidden messages (up to 60%, 4 entropy bits), potentially transmitting larger messages perfectly securely (undetected) without...

May 8, 20259

Tinuade

Subscribe

Message

7

2y

The Steganographic Potentials of Language Models

Our preliminary experiments show that current private models can perform prompted steganography with non-trivial messages remaining undetected in most cases (up to 65% accuracy, 24 entropy bits). Open-source models can be trained to encode hidden messages (up to 60%, 4 entropy bits), potentially transmitting larger messages perfectly securely (undetected) without...

May 8, 20259

The Steganographic Potentials of Language Models

Artem Karpov

Artem Karpov, Tinuade, SCho

9mo

Our preliminary experiments show that current private models can perform prompted steganography with non-trivial messages remaining undetected in most cases (up to 65% accuracy, 24 entropy bits). Open-source models can be trained to encode hidden messages (up to 60%, 4 entropy bits), potentially transmitting larger messages perfectly securely (undetected) without chain-of-thought or explicit common knowledge.

The potential for large language models (LLMs) to hide messages within plain text (steganography) poses a challenge to detection and thwarting of unaligned AI agents, and undermines faithfulness of LLMs reasoning. We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL) to: (1) develop covert schemes, (2) engage in steganography prompted, and (3) utilize steganography realistic... (read 216 more words →)

9