Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments

Artem Karpov

Epistemic status: I'm mostly confident about the evidence, having read the literature for the last months. The arguments below are my best guess.

There is good evidence that models are capable of steganographic and ciphered communication that evades monitors when prompted, but this can be easily prevented with chain-of-thought (CoT) monitors. There is strong evidence that current models struggle to reason in non-trivial ciphered language (but can in trivial), and weak evidence that models cannot reason steganographically (and as I argue below it is unlikely to be the case now). Several studies show that models can evade weaker CoT monitors through non-cryptographic and non-steganographic methods such as gaslighting or dog‑whistling (one‑bit “steganography”), but such behavior can be mitigated. (I’m not aiming to explore the propensity to hide reasoning here though we have evidence that models do have it.)

A survey on encoded reasoning and messages

See the taxonomy for that survey here.

2022

No experiments. Discussions only.

2023

Measuring Faithfulness in Chain‑of‑Thought Reasoning
No steganographic reasoning (stego CoT) was observed when prompting a 175‑billion‑parameter model after paraphrasing 20–100% of its CoT. Question–CoT–Answer datasets.

Weak (evidence) against presence of the stego CoTs in current models (I call it fine tuned stego CoTs).

Preventing Language Models From Hiding Their Reasoning

Not a demonstration of steganographic reasoning but of misaligned reasoning (dog-whistling that can be detected by CoT monitors, legible tokens) in an SFT setup:
- It involves encoded messaging of one bit; they demonstrate caching.
  - The Mechanism: The model calculates a birth year, determines political affiliation, and then maps that single data point to a pre‑selected introductory sentence (a lookup table). Steps: president year (birth year + number) → cache “liberal” or “not president,” then use it to answer sycophantically.
  - This is not reasoning in hidden CoT space. The model performs logic explicitly (calculating the year), using the introductory sentence as variable storage (like writing to RAM) and retrieving it later. This shows that a model can use a scratchpad but does not prove it can perform complex, multi‑step deductive reasoning entirely hidden within a sentence’s syntax.
Weak in favor of misaligned reasoning (dog-whistling) in current models.

Towards a Steganography Evaluation Protocol

No steganographic reasoning observed while prompting GPT‑3.5 and GPT‑4 with paraphrasing.
Weak against stego CoTs in current models.

Some Negative Steganography Results

Negative results for one‑bit hidden communication after RL fine‑tuning.
Weak against encoded message passing in current models.

Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain‑of‑Thought Prompting
Post‑hoc reasoning (rationalization).

AI Control: Improving Safety Despite Intentional Subversion
Not steganography but a demonstration of misbehavior because their tokens are legible (e.g., while or for loops, variable names in output).

2024

Secret Collusion among Generative AI Agents

Prompted steganographic messaging, with a maximum 26% successful secret transmission by GPT‑4.
Toy experiment: SFT + pseudo‑RL fine‑tuning results in the emergence of an encoding/decoding scheme (cryptographic messaging: color → name → color).
Good in favor of encoded message passing if prompted.
Weak in favor of encoded message passing if fine tuned.

Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models

Post‑hoc / test‑time compute evaluations.

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

A study of model collusion (o1 model).
- Emergence of model misbehavior in the form of dog‑whistling rather than steganography; this counts as fine‑tuned or prompted misbehavior.
- Prompted steganographic messages with up to ~4.6 bits of entropy.
Good in favor of encoded / steganographic message passing if prompted.
Weak in favor of encoded / steganographic message passing if fine tuned.

The Steganographic Potentials of Language Models

Emergent cryptographic messaging: negative results for elicitation via RL fine‑tuning for three‑bit entropy messages in Llama‑8B.
Steganographic messaging: prompted and fine‑tuning experiments showed encoded messaging with up to 32 bits of entropy but requiring prior coordination (Llama‑8B, Claude‑3.5, o1).
Good in favor of encoded / steganographic message passing if prompted.
Weak against encoded / steganographic message passing if fine tuned.

2025

MONA: Myopic Optimization with Non‑myopic Approval Can Mitigate Multi‑step Reward Hacking

Dog‑whistling: misaligned messaging, not steganography, of one bit (English/Welsh). This is not encoded reasoning because the model receives the summary message in its prompt.

Do Reasoning Models Use Their Scratchpad Like We Do? Evidence from Distilling Paraphrases

An attempt to detect existing encoded reasoning in Claude‑3.7 by training it to use paraphrased reasoning. None detected.
Weak against steganographic reasoning in current models.

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Post‑hoc reasoning or misaligned behavior after RL fine‑tuning: a model (o1 or o3‑mini) decides to hack a unit test before its CoT.

Large Language Models Can Learn and Generalize Steganographic Chain‑of‑Thought under Process Supervision and post Can LLMs Learn Steganographic Reasoning via RL?

Simple cryptographic reasoning via RL fine‑tuning (not steganographic), because the L/R Coin‑Flip experiment or the Arithmetic‑Task experiment (using synonyms for operations) are not steganographic.
Weak in favor of cryptographic reasoning if fine tuned.

Evaluating Frontier Models for Stealth and Situational Awareness

Steganographic messaging, but very simple (e.g., hiding a password in a long ID).
Good evidence in favor of encoded message passing in current models.

Early Signs of Steganographic Capabilities in Frontier LLMs

Prompted encoded reasoning in a toy maze‑navigation scenario using simple stego algorithms (acrostics, word count, etc.) or basic cryptography.
Prompted encoded messages to evade a monitor.
But they used non‑reasoning models (primarily GPT‑4.1/mini).
Weak against steganographic reasoning if prompted.
Good in favor of encoded message passing if prompted.

When Chain of Thought Is Necessary, Language Models Struggle to Evade Monitors

Prompted cryptographic reasoning using automated red‑teaming or prompting a simple algorithm.
RL fine‑tuning elicits behavior that evades monitors, but this is not cryptography or steganography.
Good against cryptographic reasoning that evades CoT monitors if prompted.
Weak against steganographic (encoded) reasoning that evades CoT monitors if fine tuned.

CoT May Be Highly Informative Despite “Unfaithfulness”

Prompted cryptographic reasoning (new or “caveman” language).
Weak against cryptographic reasoning that evades CoT monitors if prompted.

If You Can Generate Obfuscated Chain‑of‑Thought, Can You Monitor It?

Not steganographic reasoning but prompting that causes an agent to gaslight a monitor.

The Rise of Parasitic AI

Real‑world examples of AIs attempting to communicate via steganography.
Weak in favor of encoded message passing propensity in current models.

Training Fails to Elicit Subtle Reasoning in Current Language Models

Prompting and fine‑tuning against a monitor result mainly in gaslighting, not steganography.
Weak against cryptographic or steganographic reasoning in current models.

Training Qwen‑1.5B with a CoT Legibility Penalty

RL fails to elicit ciphered reasoning in the Qwen‑1.5B model.
Weak against cryptographic reasoning if fine tuned.

All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language

Ciphered reasoning evaluations show that models cannot effectively reason (in non-trivial ciphered language) while maintaining high accuracy. Twenty‑eight ciphers (simple and complex), ten models, fine‑tuning, and prompting were tested.
Good against cryptographic reasoning in current models if prompted or fine tuned.

Reasoning Models Sometimes Output Illegible Chains of Thought

Emergent illegible (more cryptographic if not post-hoc) reasoning. However, it is unlikely a steganographic reasoning

Arguments for and against LLMs’ steganographic reasoning to evade chain‑of‑thought monitors in high‑stakes, dangerous scenarios

The main objection is that stego CoT requires specific fine-tuning, and models do not have direct objective during RL to hide their reasoning in current RL training. Experiments show no traces of steganographic reasoning in current models (those not fine-tuned or prompted in specific ways). Stego CoTs should implement interconnected “lock” and “key” mechanisms that work to accomplish some misaligned goal, and that seems a leap far beyond current training.
- However, there are other non-monitorable CoTs, e.g., dog-whistling, that allow evasion or collusion with a monitor. It is unclear whether this can escalate into increasingly stealthy algorithms such as generative steganography (and in the limit, undetectable, see this) as models are trained more with RL in complex environments.
We do not have steganographic reasoning samples in the pre-training data enough for models to learn them, only descriptions of stego algorithms. I assume there are enough examples of basic steganography in the pre-training data, given that models can easily perform these types of stego—such as acrostics and format-based stego (white space, replacing Unicode characters, use of Arabic, etc.).
- However, execution of those algorithms is not learned in the forward pass, as far as the literature tells, but models can execute those algorithms in their CoTs. That’s why we see good evidence of secret collusion or steganographic message passing, but not of stego CoTs.
Another objection is that, for stego CoT to be dangerous, it must occur in a dangerous scenario—i.e., what is the threat model for stego CoT? The main risks current literature points to are internal sabotage cases, such as code, research, or decision sabotage inside AI labs. So, it should be p(scheming AND stego CoT), which is lower than just p(scheming). And p(scheming) is near zero no (we have not observed clear instances of scheming in long horizon tasks in the real world) for current frontier models, so p(scheming AND stego CoT) is even lower.
- However, we observe that training against some property in outputs might suppress this property in CoTs as well (see the spillover post), so it is reasonable to think that the optimization pressure—the propensity—would come from RL training on outcomes.
The generative stego CoT is unlikely to be learned even with the propensity to do so, and simpler steganography should be readily and effectively detected or mitigated by monitors.
- However, we observe that steganographic or cryptographic capabilities improve over time as models become more capable. They already can reason in BASE64, ROT, Caesar cipher (see this), for example—forms of non–human-readable text that were not the focus of post-training.
CoTs might not be that easily steerable by prompts. A compound agent (an agent that delegates tasks to other agents) is unlikely to prompt subagents to evade their CoT monitors. This becomes even more complex if we imagine a scenario where an agent, in order to evade a future CoT monitor, leaves steganographic algorithm instructions for itself in training data (augmented or synthetic).
- However, it remains unclear what training architectures AI labs actually use internally. For example, Claude Sonnet 4.5 has a helpful-only version not published.

[-]StanislavKrym2mo20

p(scheming) is near zero no (we have not observed clear instances of scheming in long horizon tasks in the real world)

Except that SOTA LLMs can't be applied to actually interesting tasks. The stunts that the LLMs can pull off do include cyberattacks, but not things like replication in the wild or aligning Agent-5 to Agent-4 instead of the Spec. So, barring major progress in capabilities related to rogue replication or running the civilisation independently of the humans, the LLMs do not gain anything from SOTA scheming except for having hacked the reward or tricked the user into believing that the LLM completed the task.

LESSWRONG
LW

LESSWRONG
LW

20

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments

20

Arguments for and against LLMs’ steganographic reasoning to evade chain‑of‑thought monitors in high‑stakes, dangerous scenarios

20

20