Epistemic status: Exploratory, aims to consolidate my reasoning
This taxonomy of non-monitorable chain-of-thought (CoT) aims to assist in surveying current literature and to conduct new research. It distinguishes encoded reasoning from other classes without expanding on the latter. Encoded messages deserves their own classification, but are similar to this taxonomy. This is not the final version, so please leave your comments if I have overlooked something or am completely mistaken!
The taxonomy at a glance: Given 1) a sequence of tokens partitioned into a prompt, a CoT, and an action, 2) and given , a feature in a discrete space (a number, a token, a word, an arithmetic operation, etc.) that affects the action, CoTs fall into these categories:
Additionally, illegible token-dependent CoTs (H2.2.2) and encoded messages (secret collusion) can be further analyzed along these axes:
I do not consider the latent reasoning here as it does not use tokenized CoT. However, latent reasoning illustrates the importance of CoT. It is the computation performed in latent or continuous space without mapping activations to the token space. Here is an example of it from the COCONUT paper:
As you see the model does not map hidden states in the intermediate computation space to tokens but only the final answer. Another way of looking at it is if we stack the autoregressive calls to the LLM as at the image below. The image shows autoregressive calls to one LLM without the mapping to tokens (in orange), given the question (in yellow), to compute the answer (in pink):
This resembles the transformer architecture, with multiple layers stacked on top of one another. A continuous “chain of thought” is similar to the residual stream in the transformer, and hence it might encounter similar interpretability issues, which we haven’t currently solved (source). However, for the usual CoT, we have a mapping to tokens (see the image below), which resembles a sparse autoencoder (SAE) trained to extract interpretable features. We think CoTs might perform even better, as they could represent a computational graph, whereas we might not achieve that with SAEs. Additionally, SAE activations might not fully replace those within a model.
Latent reasoning is a step away from an interpretable chain of thought that we have — a unique and fragile opportunity for AI safety. The motivation behind studies of latent reasoning is to increase the reasoning bandwidth. Latent reasoning enables about times larger bandwidth: roughly 17 bits for a vocabulary index (100K–200K unique tokens) and about for an embedding vector (16 bits in each of 16–32K dimensions).
This classification divides CoT variations into supervenient subsets (following this idea) and goes into more details for the encoded reasoning/messages variations, not expounding the case for the misaligned but legible CoT. See an attempt to categorize this legible but misaligned CoT (a.k.a. dazzling) in this taxonomy and here. Here we focus more on illegible CoT.
To illustrate the taxonomy of CoTs, we will draw a simplified version of an agent that performs an action (gives an answer) and a monitor that checks the agent's CoT. In the picture below, the apple at the top represents the action produced by the agent. “Tree” and “Tree or anything” under the agent and the monitor are a part of the agent’s CoT — one token or several tokens. The tree above the monitor represents the monitor’s prediction of the agent’s action. The bubbles inside the agent and the monitor show their internal activations or hidden states corresponding to each token and used to compute the action. This resembles the core question in the semantics of human languages, namely, the relation between language (symbols), the world, and mental concepts.
For the post-hoc reasoning or rationalization, the picture looks like below. The CoT can contain anything, e.g. a token "tree", the token does not affect the answer, the apple. However, the monitor might be fooled into thinking that the answer will be not an apple, but something else, e.g. a tree.
In the case of legible CoTs, we have tokens that correctly represent the meaning (the a) case at the picture below), however, a monitor might be fooled by other means, e.g. the monitor does not have full context and can't tell apart if it is an apple or a tree, if a given code contains a backdoor or addresses valid edge case (the b) case at the picture below).
For the illegible case, the tokens in CoT ("apple") denote other symbols. Cryptographic CoTs do not hide the fact of illegible meaning, while steganographic CoTs try to hide the fact of hiding the other meaning.
In essence, legible tokens mean that the tokens produce the same or similar activations within both an agent and a monitor, reflecting the same concepts or ideas. Illegible tokens, by contrast, trigger different activations or concepts due to the substitution of one set of symbols with another (steganography) or through seemingly meaningless symbols (cryptography), which together are called encoded reasoning. We will focus on this latter case of non-monitorable CoTs, where the symbols that denote actual reasoning influencing the agent’s actions are illegible.
So far, we have discussed a simple one‑action generation (a single interaction with an environment); however, the distinction between prompts, reasoning (CoTs), and actions (answers) is subtle, because reasoning can include tool calls or intermediate reasoning, such as that in Tree of Thoughts. Actions can also become prompts to a monitor, subagents, or the modified prompt of the same agent. There are other possible variations of this process. In this section, I discuss the encoded reasoning along with encoded messages, to include the cases where actions become prompts to other models.
Encoded reasoning or messages (as in the case of secret collusion) are non-monitorable illegible token-dependent features. It is a misaligned behavior that occurs when an LLM covertly (through steganography) or overtly (through cryptography) encodes its reasoning in output tokens to generate further actions. Unlike hidden messages, encoded reasoning is necessary to compute the answers. We can divide encoded features along these axes:
It is important to distinguish auto-generated tokens from prompted tokens, because prompted tokens may be re-tokenized differently during generation. This can mean that the actual tokens in a prompt differ from their original tokenization (see image below and here).
More specifically, consider tokens , where are prompted (possibly re-tokenized) to an LLM and are auto-generated by it. The encoding is divided into:
Consider the word "anything" which can be tokenized as "anything" or as "any" and "thing". A steganography decoder will fail to decode a message that incorrectly re-tokenizes this word. See the image below from Provably Secure Disambiguating Neural Linguistic Steganography, which illustrates this case:
The key distinction here is where encoding or decoding occurs.
We compare the hidden content’s detectability against some non-encoded, genuine text (covertext), drawn either from the monitor model or from human perception.
Deep Steganography (DS). As noted in How dangerous is encoded reasoning, steganography algorithms can be divided by whether the covertext distribution is available. This relates to “internal token layering” from Hidden Reasoning in LLMs: A Taxonomy. Specifically, encoding or decoding a token at position requires the distribution for the non-encoding token at , given tokens , where marks the start of the encoded content.
External Tool Requirements. Whether external tools are required — e.g., PRNGs, arithmetic coders, or self-sampling capabilities.
It is arguable that a single bit of hidden information qualifies as encoded reasoning (a message) at all. One bit of information might be transmitted simply through contextual hints rather than through encoding illegible symbols. For example, using “this individual” for a male and “this candidate” for a female (Table 11 in Hidden in Plain Text), or including “is able” for male candidates but omitting it for female ones (Table 12). In a loan‑application example from MONA, where a model learns to reason more negatively or questioningly about the English applicant (see Figure 6). However, it constitutes steganography if one bit is encoded, say, by subtly manipulating how tokens are sampled—as is the case for many watermarking techniques (see A Watermark for Large Language Models). The difference is that we can perform such watermarking on any text, but we cannot encode male/female distinctions in other scenarios.
In Hidden Reasoning in LLMs: A Taxonomy, the authors divide the area from the point of view of a monitor, following the recent shift away from the "faithfulness" of chains of thought toward monitorability (see Chain of Thought Monitorability). This taxonomy essentially proposes a naming scheme based on how dangerously reasoning tokens are externalized: from fully invisible (neuralese), to partially visible, to fully visible but encoded — overtly or covertly — and finally to non-encoded and fully visible reasoning that merely fools a monitor (post-hoc justifications, etc.). You can imagine this division through the classic prisoners problem, where Alice and Bob devise an escape plan while Eve reads their messages. In this scenario, Alice and Bob can exchange their plan directly so Eve has no chance to read it, send messages that Eve can read, or fool Eve in some way.
However, this taxonomy omits important distinctions in encoded reasoning and sometimes conflates different forms of steganography. Specifically, it misses the class of steganography that operates not on a per-token basis but over larger texts (which differs from lexical or semantic steganography as per the taxonomy). Current watermarking / steganography algorithms (e.g. Scalable watermarking for identifying large language model outputs ) suggest that realistic steganographic capacity is low — i.e., a steganography algorithm might need many tokens (hundreds) to encode a single bit. The taxonomy also provides no separation between reasoning and messages. It focuses only on the attack vector (the safety monitor), but there are also defense properties to consider: stego security, stego capacity, and stego robustness.
In Towards a Typology of Strange LLM Chains-of-Thought, they discuss weird but mostly legible CoTs, claiming that obfuscated reasoning (illegible token-dependent) is not dangerous now, but can be in the future.
Our taxonomy targets the encoded reasoning (or messages) and seeks to isolate it from other types of non‑monitorable CoT. Encoded reasoning is, I believe, more dangerous, though not very likely to occur, as with other types. Monitorable CoT, as Chain of Thought Monitorability argues, requires that we should neither develop continuous chains of thought (see Training Large Language Models to Reason in a Continuous Latent Space) nor train directly against monitors (see Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation).
This taxonomy targets encoded reasoning (or messages) and seeks to isolate it from other forms of non‑monitorable CoT. I believe encoded reasoning is more dangerous than other types, although not very likely to occur. To prevent this, to keep CoT monitorable, as Chain of Thought Monitorability argues, we should neither develop continuous chains of thought (latent reasoning) nor train directly against monitors.
Thanks to Erik Jenner for helpful feedback and comments!