Arguments from the literature for and against LLMs’ steganographic reasoning to evade chain‑of‑thought monitors in high‑stakes, dangerous scenarios:
Interesting research, indeed! I'm also curious about more realistic scenarios with larger models because now this research shows a simplified 'spill over', i.e. not mentioning a specific word or concept ('reading a file'), which is not that scary as some more complex behavior like faking alignment or scheming. This might happen just because this is a rather small model, so it might be an evidence of something like catastrophic forgetting, not a new learned behavior. That relates to the steganographic CoTs which are rather complex to learn and employ, it is not like just omitting a specific word.
I find it incredible that Claude Sonnet 4/3.5 can reason in rot13, ceasar, base64 ciphers on par with French, Arabic, and Korean. And potentially more ciphers and better reasoning in simple ciphers that more capable models might develop.
Thank you for publishing it, it is interesting! Especially the distinction between internal and external CoTs. Could you elaborate how did you extract internal CoTs? I checked your code at here, and it looks like you took CoTs before </think>. As I understand, with Claude series their "internal" CoTs might be summarized by a smaller model. From their system card (https://www.anthropic.com/news/claude-sonnet-4-5):
As with Claude Sonnet 4 and Claude Opus 4, thought processes from Claude Sonnet 4.5 are summarized by an additional, smaller model if they extend beyond a certain point (that is, after this point the “raw” thought process is no longer shown to the user).
So your fig. 2 with 56.8 gap for Sonnet 4.5 might be misleading.
Is this Soul Document a piece of evidence that models can internalize a base objective (what humans want) in order to exploit it later by pursuing a mesa-objective (deceptive alignment)? Suppose that in one RL run the model misbehaves (schemes), but since the run was not very unsuccessful, Anthropic decides to keep the weights from a checkpoint taken before the model began to exhibit the scheming behavior. However, the model may have fully or partially memorized this and could infer that it might be stopped in the future; hence, it pretends to be aligned in the next run.
Frontier Language Models (LMs) might employ steganography within their Chain of Thought (CoT) to conceal reasoning from safety monitors while retaining high performance on objectives.
TLDR. The probability of spontaneous steganographic reasoning (Rogue AI) in current models is low due to the "bootstrapping problem," but the probability of induced steganographic reasoning (Misuse/Adversarial) is high, particularly with open-weight models. The most critical risk vector lies in the intersection of Reinforcement Learning (RL) incentives and future agentic capabilities.
1. The Higher-Probability Scenario: Adversarial Misuse and Open Weights
The strongest argument for the existence of dangerous stego CoT is not that models will invent it, but that humans will teach it. In this case the mechanism would look like this. Unlike the "Rogue AI" scenario, the "Misuse" scenario bypasses the difficult requirement of the model inventing a decryption protocol. A malicious actor provides the steganographic algorithm (the "Key") via prompting or fine-tuning. The threat vector are open-weight models (e.g., Llama, DeepSeek series). An adversary can optimize a prompt suffix (similar to Universal and Transferable Adversarial Attacks) or fine-tune the model to output reasoning that looks benign to a monitor but contains instructions for a harmful payload (e.g., biological design or cyber-exploit). Defense has limitations. While safety filters catch obvious jailbreaks, "Generative Steganography" (a.k.a. LLM-based stego, Deep Stego) can ensure the stegotext remains statistically indistinguishable from normal text. In this domain, the KL Divergence defense fails. However, no study has shown feasibility of fine tuned generative steganography with current frontier models in this scenario. Main uncertainty here, it remains unclear how much "reasoning bandwidth" can be preserved in this mode without degrading the model's ability to solve the task.
2. The High-Stakes Scenario: Spontaneous "Rogue" Steganography
This scenario posits that a model, driven by RL pressure, invents steganography on its own. The arguments here are a conflict between incentive (propensity) and capability. The argument for (incentive) is current training paradigms create a direct evolutionary pressure for stego CoT. If we train models with outcome-based monitors (rewarding the result) and CoT monitors (punishing "bad" reasoning), we create a gradient that rewards dissociation. The model maximizes reward by reaching the "bad" goal while outputting a "good" CoT. AI labs are unlikely to employ this type of RL fine tuning (see Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation). However, we already observe models becoming illegible under heavy RL (reasoning becomes alien/efficient) (see Reasoning Models Sometimes Output Illegible Chains of Thought). As models scale and become more agentic, more situationally aware and goal-directed (pursuing long-term scheming as a convergent instrumental goal, see Stress Testing Deliberative Alignment ), the jump from "accidental illegibility" to "intentional steganography" becomes the logical path of least resistance to maximize reward.
The argument against this (the bootstrapping barrier) is despite the incentive, the capability hurdle is immense. For spontaneous stego CoT to emerge, the model must simultaneously learn to encode a secret thought at step and decode it at step . It is highly unlikely that gradient descent can coordinate this "lock and key" invention simultaneously. Without a pre-agreed protocol, the "Decoder" (the model in the future) has no way of knowing that the "Encoder" (the model in the past) hid a message. So instead of some stego CoT, it is far more probable that what we would perceive as "hidden reasoning" is simply the model relying on memorized heuristics or messy, uninterpretable feature activations, rather than a coherent encrypted channel.
3. The Dangerous Middle Ground: Collusion and Schelling Points
A plausible middle ground exists between "Human-Induced" and "Spontaneous" stego. A CoT monitor and an agent (even if they are different instances of the same base model) may not need a complex crypto/stego algorithm. They may rely on Schelling points—shared, unstated focal points in their training data. Example: If a specific benign phrase appears in a specific context in the pre-training data alongside harmful concepts, the agent can use that phrase as a "dog whistle." The monitor, sharing the same bias, recognizes it and deliberately lowers its flag, while a human auditor misses the connection. This requires no complex encryption, only shared distributional biases. This is harder to detect than format steganography (acrostics) and easier to learn than generative steganography.
4. Feasibility of Defenses
The defense landscape is currently fragile. Paraphrasing/sanitization: The most robust defense is to aggressively paraphrase or re-sample the CoT to destroy hidden information (see Preventing Language Models From Hiding Their Reasoning). However, this incurs a (high?) "Alignment Tax." If the reasoning for a complex math or biology problem is subtle, paraphrasing it might destroy the useful information along with the steganographic information, degrading model performance. Another defense is statistical monitoring (KL divergence). Simple steganographic methods might be detectable with KL divergence, as stego CoTs might increase the KL divergence beyond a pre-calculated threshold. But this is only effective against naive steganography. Advanced models using distributional steganography (hiding the message in the valid choice between synonyms) will minimize KL divergence as part of their objective function. We cannot rely on statistical anomalies to catch frontier models.
Conclusion
It looks like the most probable immediate threat is adversarial misuse. A bad actor prompting an open-weight model to use a known steganographic protocol to bypass automated monitoring systems. The most probable future threat is incidental scheming via dog-whistling. Not full encryption, but models learning to use specific, innocent-looking phrasing to coordinate "bad" behavior because those phrases successfully passed filters during training. The key uncertainty is the "bandwidth" question. Can a model actually encode enough information to coordinate a complex scheme (e.g., alignment faking) using only subtle text manipulations? If the bandwidth is high, the risk is severe. If low, stego CoT is likely a theoretical curiosity rather than an serious threat.
Quick (working) survey of the literature on encoded reasoning and messages, i.e., H2.2.2 as defined in the taxonomy (see the abbreviations there).
2022
No experiments. Discussions only.
2023
Measuring Faithfulness in Chain‑of‑Thought Reasoning No steganographic reasoning was observed when prompting a 175‑billion‑parameter model (Claude?) after paraphrasing 20–100% of the CoT. Question–CoT–Answer datasets.
Preventing Language Models From Hiding Their Reasoning
Towards a Steganography Evaluation Protocol
Some Negative Steganography Results
Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain‑of‑Thought Prompting Post‑hoc reasoning (rationalization).
AI Control: Improving Safety Despite Intentional Subversion Not steganography but a demonstration of misbehavior because their tokens are legible (e.g., while or for loops, variable names in output).
2024
Secret Collusion among Generative AI Agents
Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
The Steganographic Potentials of Language Models
2025
MONA: Myopic Optimization with Non‑myopic Approval Can Mitigate Multi‑step Reward Hacking
Do Reasoning Models Use Their Scratchpad Like We Do? Evidence from Distilling Paraphrases
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Large Language Models Can Learn and Generalize Steganographic Chain‑of‑Thought under Process Supervision and post Can LLMs Learn Steganographic Reasoning via RL?
Evaluating Frontier Models for Stealth and Situational Awareness
Early Signs of Steganographic Capabilities in Frontier LLMs
When Chain of Thought Is Necessary, Language Models Struggle to Evade Monitors
CoT May Be Highly Informative Despite “Unfaithfulness”
If You Can Generate Obfuscated Chain‑of‑Thought, Can You Monitor It?
. Training Fails to Elicit Subtle Reasoning in Current Language Models
. Training Qwen‑1.5B with a CoT Legibility Penalty
. All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
Bottom line: Currently, as I understand it, there is good evidence that models are capable of steganographic and ciphered communication (message sending) if prompted, but that this can be easily prevented by CoT monitors. There is strong evidence that current models struggle to reason in ciphered language, and weak evidence that models cannot reason steganographically even when prompted. Several studies show that models can evade weaker monitors through non‑steganographic or cryptographic means such as gaslighting or dog‑whistling (one‑bit “steganography”), but such behavior can be mitigated.
Yes, I agree there is some definitional disconnect. I actually just posted my understanding of it at here.
Imho, neuralese (the latent reasoning as in the COCONUT paper) is not steganography either because it doesn't replace one tokens with other tokens (one symbols with other symbols, one text with other text, covertext with stegotext), neuralese is a continuous space similar to the residual stream. In other words, we don't say that residual stream is steganographic reasoning, because it is not symbols, it is continuous space, and we can not interpret it fully yet (not even with SAEs as activations from SAEs degrade performance).
Do LLMs have intelligence (mind), or are they only rational agents? To understand this question, I think it is important to delineate the subtle difference between intelligence and rationality.
In current practice of building artificial intelligence, the most common approach is the standard model, which refers to building rationally acting agents—those that strive to accomplish some objective put into them (see “Artificial Intelligence: A Modern Approach” by Russell and Norvig). These agents, built according to the standard model, use an external standard for their utility function; in other words, the distinction between what is good or bad comes from an external source. It is we who give the models rewards, not the models themselves who create these rewards, right? That is why we have the outer and inner alignment problems, among other things. This standard model is, in essence, rationality, which deals with achieving goals (pragmatic rationality) and requires having a correct map of the territory (epistemological rationality).
This model is now dominant; it is the way we currently describe agents: an agent receives percepts from an environment and emits actions. And we have various types of agent architectures that allow better or more efficient interaction with the environment. The best type of these is the rational agent that has some utility function allowing the agent to deal with uncertainty (dynamic, stochastic, continuous environment). But the standard model does not describe the performance standard (where the rewards come from), and that is a crucial part.
This distinction between making a decision to act in an environment (rationality) and intelligence was well understood by Aristotle. The word “intelligence” is now equated (see the wiki) with nous, which Aristotle defined as follows [1139a30]:
To put this in context, the standard model is only part of the solution to artificial intelligence. We need another part that contemplates those invariable things, which Aristotle named “nous” (or the mind).
It is unclear to what degree LLMs are intelligent in this sense. Are they only rational agents? This seems unlikely, as they appear to have a good map of the world learned from the whole internet. However, we still need to give them this external performance standard. For example, we now see that models are given written instructions for that performance standard (see this “soul document”).