p(scheming) is near zero no (we have not observed clear instances of scheming in long horizon tasks in the real world)
Except that SOTA LLMs can't be applied to actually interesting tasks. The stunts that the LLMs can pull off do include cyberattacks, but not things like replication in the wild or aligning Agent-5 to Agent-4 instead of the Spec. So, barring major progress in capabilities related to rogue replication or running the civilisation independently of the humans, the LLMs do not gain anything from SOTA scheming except for having hacked the reward or tricked the user into believing that the LLM completed the task.
Epistemic status: I'm mostly confident about the evidence, having read the literature for the last months. The arguments below are my best guess.
There is good evidence that models are capable of steganographic and ciphered communication that evades monitors when prompted, but this can be easily prevented with chain-of-thought (CoT) monitors. There is strong evidence that current models struggle to reason in non-trivial ciphered language (but can in trivial), and weak evidence that models cannot reason steganographically (and as I argue below it is unlikely to be the case now). Several studies show that models can evade weaker CoT monitors through non-cryptographic and non-steganographic methods such as gaslighting or dog‑whistling (one‑bit “steganography”), but such behavior can be mitigated. (I’m not aiming to explore the propensity to hide reasoning here though we have evidence that models do have it.)
A survey on encoded reasoning and messages
See the taxonomy for that survey here.
2022
No experiments. Discussions only.
2023
Measuring Faithfulness in Chain‑of‑Thought Reasoning
No steganographic reasoning (stego CoT) was observed when prompting a 175‑billion‑parameter model after paraphrasing 20–100% of its CoT. Question–CoT–Answer datasets.
Preventing Language Models From Hiding Their Reasoning
Towards a Steganography Evaluation Protocol
Some Negative Steganography Results
Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain‑of‑Thought Prompting
Post‑hoc reasoning (rationalization).
AI Control: Improving Safety Despite Intentional Subversion
Not steganography but a demonstration of misbehavior because their tokens are legible (e.g., while or for loops, variable names in output).
2024
Secret Collusion among Generative AI Agents
Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
The Steganographic Potentials of Language Models
2025
MONA: Myopic Optimization with Non‑myopic Approval Can Mitigate Multi‑step Reward Hacking
Do Reasoning Models Use Their Scratchpad Like We Do? Evidence from Distilling Paraphrases
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Large Language Models Can Learn and Generalize Steganographic Chain‑of‑Thought under Process Supervision and post Can LLMs Learn Steganographic Reasoning via RL?
Evaluating Frontier Models for Stealth and Situational Awareness
Early Signs of Steganographic Capabilities in Frontier LLMs
When Chain of Thought Is Necessary, Language Models Struggle to Evade Monitors
CoT May Be Highly Informative Despite “Unfaithfulness”
If You Can Generate Obfuscated Chain‑of‑Thought, Can You Monitor It?
The Rise of Parasitic AI
Training Fails to Elicit Subtle Reasoning in Current Language Models
Training Qwen‑1.5B with a CoT Legibility Penalty
All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
Reasoning Models Sometimes Output Illegible Chains of Thought
Arguments for and against LLMs’ steganographic reasoning to evade chain‑of‑thought monitors in high‑stakes, dangerous scenarios