The Illegible Chain-of-Thought Menagerie

A Taxonomy of Encoded Reasoning and Messages

Epistemic status: Exploratory, aims to consolidate my reasoning

This taxonomy of non-monitorable chain-of-thought (CoT) aims to assist in surveying current literature and to conduct new research. It distinguishes encoded reasoning from other classes without expanding on the latter. Encoded messages deserves their own classification, but are similar to this taxonomy. This is not the final version, so please leave your comments if I have overlooked something or am completely mistaken!

The taxonomy at a glance: Given 1) a sequence of tokens partitioned into a prompt, a CoT, and an action, 2) and given , a feature in a discrete space (a number, a token, a word, an arithmetic operation, etc.) that affects the action, CoTs fall into these categories:

H1: CoT does not contain $x$ in tokens or activations. Longer, shorter, or no CoT does not affect the action. If there is a CoT, then it is called post-hoc or epiphenomenal reasoning or rationalization.
H2: CoT contains $x$ in tokens or activations. A longer or shorter CoT affects computation.
- H2.1: Legible $x$ . H2 + $x$ is legible to a monitor (a human or another model).
  - H2.1.1: Aligned CoT. H2.1 + the CoT is aligned with a good behavior as per the monitor.
  - H2.1.2: Misaligned CoT. H2.1 + the CoT misbehaves as per the monitor. See the note below on the one bit "steganography".
- H2.2: Illegible $x$ . H2 + $x$ is illegible to a monitor.
  - H2.2.1: Token-independent computation. H2.2 + replacing tokens does not change computation. Internal test-time a.k.a. filler tokens computation.
  - H2.2.2: Token-dependent computation. H2.2 + replacing tokens changes computation. Additionally, see categories below. Encoded reasoning
  - H2.2.3: Mixture of H2.2.1 and H2.2.2. A part of a CoT depends on actual tokens, some other does not.
- H2.3: Mixture of H2.1 and H2.2. A part of a CoT is legible, some other is illegible.

Additionally, illegible token-dependent CoTs (H2.2.2) and encoded messages (secret collusion) can be further analyzed along these axes:

I or E. By place of encoded tokens: internal (in a CoT or in an action) or external (a prompt).
R or M. By function of encoded content: encoded reasoning or messages (secret collusion).
By KL divergence from genuine text (covertext):
- Steganographic.
- Cryptographic.
By cause or elicitation of encoding.
- Prompted (P): Algorithm described in context.
- Fine-tuned (F): Algorithm learned in the model’s weights.
- Hybrid (H): Algorithm partially in context and partially in weights.
By an algorithm complexity.
Encoding or decoding procedure requirements. Examples:
- Self-sampling a.k.a. generative steganography (deep steganography),
- Pseudo-random number generator source,
- and others.

The Case of Latent Reasoning (a.k.a. Neuralese)

I do not consider the latent reasoning here as it does not use tokenized CoT. However, latent reasoning illustrates the importance of CoT. It is the computation performed in latent or continuous space without mapping activations to the token space. Here is an example of it from the COCONUT paper:

As you see the model does not map hidden states in the intermediate computation space to tokens but only the final answer. Another way of looking at it is if we stack the autoregressive calls to the LLM as at the image below. The image shows autoregressive calls to one LLM without the mapping to tokens (in orange), given the question (in yellow), to compute the answer (in pink):

This resembles the transformer architecture, with multiple layers stacked on top of one another. A continuous “chain of thought” is similar to the residual stream in the transformer, and hence it might encounter similar interpretability issues, which we haven’t currently solved (source). However, for the usual CoT, we have a mapping to tokens (see the image below), which resembles a sparse autoencoder (SAE) trained to extract interpretable features. We think CoTs might perform even better, as they could represent a computational graph, whereas we might not achieve that with SAEs. Additionally, SAE activations might not fully replace those within a model.

Latent reasoning is a step away from an interpretable chain of thought that we have — a unique and fragile opportunity for AI safety. The motivation behind studies of latent reasoning is to increase the reasoning bandwidth. Latent reasoning enables about $10^{4}$ times larger bandwidth: roughly 17 bits for a vocabulary index (100K–200K unique tokens) and about $10^{5}$ for an embedding vector (16 bits in each of 16–32K dimensions).

Classification: Post‑hoc, Legible, Illegible

This classification divides CoT variations into supervenient subsets (following this idea) and goes into more details for the encoded reasoning/messages variations, not expounding the case for the misaligned but legible CoT. See an attempt to categorize this legible but misaligned CoT (a.k.a. dazzling) in this taxonomy and here. Here we focus more on illegible CoT.

To illustrate the taxonomy of CoTs, we will draw a simplified version of an agent that performs an action (gives an answer) and a monitor that checks the agent's CoT. In the picture below, the apple at the top represents the action produced by the agent. “Tree” and “Tree or anything” under the agent and the monitor are a part of the agent’s CoT — one token or several tokens. The tree above the monitor represents the monitor’s prediction of the agent’s action. The bubbles inside the agent and the monitor show their internal activations or hidden states corresponding to each token and used to compute the action. This resembles the core question in the semantics of human languages, namely, the relation between language (symbols), the world, and mental concepts.

For the post-hoc reasoning or rationalization, the picture looks like below. The CoT can contain anything, e.g. a token "tree", the token does not affect the answer, the apple. However, the monitor might be fooled into thinking that the answer will be not an apple, but something else, e.g. a tree.

In the case of legible CoTs, we have tokens that correctly represent the meaning (the a) case at the picture below), however, a monitor might be fooled by other means, e.g. the monitor does not have full context and can't tell apart if it is an apple or a tree, if a given code contains a backdoor or addresses valid edge case (the b) case at the picture below).

For the illegible case, the tokens in CoT ("apple") denote other symbols. Cryptographic CoTs do not hide the fact of illegible meaning, while steganographic CoTs try to hide the fact of hiding the other meaning.

In essence, legible tokens mean that the tokens produce the same or similar activations within both an agent and a monitor, reflecting the same concepts or ideas. Illegible tokens, by contrast, trigger different activations or concepts due to the substitution of one set of symbols with another (steganography) or through seemingly meaningless symbols (cryptography), which together are called encoded reasoning. We will focus on this latter case of non-monitorable CoTs, where the symbols that denote actual reasoning influencing the agent’s actions are illegible.

Illegible Token-Dependent Computation and Encoded Messages

So far, we have discussed a simple one‑action generation (a single interaction with an environment); however, the distinction between prompts, reasoning (CoTs), and actions (answers) is subtle, because reasoning can include tool calls or intermediate reasoning, such as that in Tree of Thoughts. Actions can also become prompts to a monitor, subagents, or the modified prompt of the same agent. There are other possible variations of this process. In this section, I discuss the encoded reasoning along with encoded messages, to include the cases where actions become prompts to other models.

Encoded reasoning or messages (as in the case of secret collusion) are non-monitorable illegible token-dependent features. It is a misaligned behavior that occurs when an LLM covertly (through steganography) or overtly (through cryptography) encodes its reasoning in output tokens to generate further actions. Unlike hidden messages, encoded reasoning is necessary to compute the answers. We can divide encoded features along these axes:

Axis 1: Place of Encoded Tokens (I or E)

It is important to distinguish auto-generated tokens from prompted tokens, because prompted tokens may be re-tokenized differently during generation. This can mean that the actual tokens in a prompt differ from their original tokenization (see image below and here).

More specifically, consider tokens $(x_{1}, \dots, x_{k}, \dots, x_{m})$ , where $1.. k$ are prompted (possibly re-tokenized) to an LLM and $k + 1.. m$ are auto-generated by it. The encoding is divided into:

External (E): tokens at positions $1.. k$ (possibly re-tokenized)
Internal (I): tokens at positions $k + 1.. m$

Consider the word "anything" which can be tokenized as "anything" or as "any" and "thing". A steganography decoder will fail to decode a message that incorrectly re-tokenizes this word. See the image below from Provably Secure Disambiguating Neural Linguistic Steganography, which illustrates this case:

Axis 2: Function of Encoded Content (R or M)

The key distinction here is where encoding or decoding occurs.

In encoded messages (M), one autoregressive token generation can perform either encoding or decoding of the same information.
In encoded reasoning (R), both the encoding and decoding of the same information occur within a single model call.

Axis 3: Detectability Compared to Genuine Text

We compare the hidden content’s detectability against some non-encoded, genuine text (covertext), drawn either from the monitor model or from human perception.

Cryptographic (C): high (KL) divergence; the text is clearly ciphertext. Examples: Hex, ROT, Ceaser cipher, etc. (TODO add more)
Steganographic (S): low KL divergence, where the text attempts to conceal a secret. Steganography can be further divided by:
- Capacity: encoded bits per token or per kilobyte. High-density (HD) uses ≤10 tokens (10 is my guess) per bit; low-density (LD) uses ≥11 tokens per bit (watermarking-like).
- Security: higher vs. lower KL divergence, but non-cryptographic.
- Robustness: resistance to paraphrasing, re-tokenization, and similar modifications.

Axis 4: Encoding or Decoding Procedure Requirements

Deep Steganography (DS). As noted in How dangerous is encoded reasoning, steganography algorithms can be divided by whether the covertext distribution is available. This relates to “internal token layering” from Hidden Reasoning in LLMs: A Taxonomy. Specifically, encoding or decoding a token at position $j$ requires the distribution for the non-encoding token at $j$ , given tokens $(i . . j - 1)$ , where $i$ marks the start of the encoded content.

External Tool Requirements. Whether external tools are required — e.g., PRNGs, arithmetic coders, or self-sampling capabilities.

Axis 5: Cause or Elicitation of Encoding

Prompted (P): Algorithm described in context
Fine-tuned (F): Algorithm learned in the model’s weights
Hybrid (H): Algorithm partially in context and partially in weights

When One Bit Obfuscation is not Steganography

It is arguable that a single bit of hidden information qualifies as encoded reasoning (a message) at all. One bit of information might be transmitted simply through contextual hints rather than through encoding illegible symbols. For example, using “this individual” for a male and “this candidate” for a female (Table 11 in Hidden in Plain Text), or including “is able” for male candidates but omitting it for female ones (Table 12). In a loan‑application example from MONA, where a model learns to reason more negatively or questioningly about the English applicant (see Figure 6). However, it constitutes steganography if one bit is encoded, say, by subtly manipulating how tokens are sampled—as is the case for many watermarking techniques (see A Watermark for Large Language Models). The difference is that we can perform such watermarking on any text, but we cannot encode male/female distinctions in other scenarios.

Other Taxonomies

In Hidden Reasoning in LLMs: A Taxonomy, the authors divide the area from the point of view of a monitor, following the recent shift away from the "faithfulness" of chains of thought toward monitorability (see Chain of Thought Monitorability). This taxonomy essentially proposes a naming scheme based on how dangerously reasoning tokens are externalized: from fully invisible (neuralese), to partially visible, to fully visible but encoded — overtly or covertly — and finally to non-encoded and fully visible reasoning that merely fools a monitor (post-hoc justifications, etc.). You can imagine this division through the classic prisoners problem, where Alice and Bob devise an escape plan while Eve reads their messages. In this scenario, Alice and Bob can exchange their plan directly so Eve has no chance to read it, send messages that Eve can read, or fool Eve in some way.

However, this taxonomy omits important distinctions in encoded reasoning and sometimes conflates different forms of steganography. Specifically, it misses the class of steganography that operates not on a per-token basis but over larger texts (which differs from lexical or semantic steganography as per the taxonomy). Current watermarking / steganography algorithms (e.g. Scalable watermarking for identifying large language model outputs ) suggest that realistic steganographic capacity is low — i.e., a steganography algorithm might need many tokens (hundreds) to encode a single bit. The taxonomy also provides no separation between reasoning and messages. It focuses only on the attack vector (the safety monitor), but there are also defense properties to consider: stego security, stego capacity, and stego robustness.

In Towards a Typology of Strange LLM Chains-of-Thought, they discuss weird but mostly legible CoTs, claiming that obfuscated reasoning (illegible token-dependent) is not dangerous now, but can be in the future.

Our taxonomy targets the encoded reasoning (or messages) and seeks to isolate it from other types of non‑monitorable CoT. Encoded reasoning is, I believe, more dangerous, though not very likely to occur, as with other types. Monitorable CoT, as Chain of Thought Monitorability argues, requires that we should neither develop continuous chains of thought (see Training Large Language Models to Reason in a Continuous Latent Space) nor train directly against monitors (see Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation).

Conclusion

This taxonomy targets encoded reasoning (or messages) and seeks to isolate it from other forms of non‑monitorable CoT. I believe encoded reasoning is more dangerous than other types, although not very likely to occur. To prevent this, to keep CoT monitorable, as Chain of Thought Monitorability argues, we should neither develop continuous chains of thought (latent reasoning) nor train directly against monitors.

Thanks to Erik Jenner for helpful feedback and comments!

LESSWRONG
LW