Existing claims of steganographic chain-of-thought in LLMs conflate it with dog-whistling, ciphered reasoning, and gaslighting — none of which involve truly hidden writing. I tested frontier models on actual steganographic tasks: encoding secrets via arithmetic coding (too complex for models to follow) and hiding reasoning in acrostics. Models failed at...
Epistemic status: I'm mostly confident about the evidence, having read the literature for the last months. The arguments below are my best guess. There is good evidence that models are capable of steganographic and ciphered communication that evades monitors when prompted, but this can be easily prevented with chain-of-thought (CoT)...
A Taxonomy of Encoded Reasoning and Messages Epistemic status: Exploratory, aims to consolidate my reasoning This taxonomy of non-monitorable chain-of-thought (CoT) aims to assist in surveying current literature and to conduct new research. It distinguishes encoded reasoning from other classes without expanding on the latter. Encoded messages deserves their own...
Encoded reasoning occurs when a language model (LM) agent hides its true reasoning inside its chain-of-thought (CoT). It is one of the three types of unfaithful reasoning[1] and the most dangerous one because it undermines CoT monitoring , enables scheming and collusion (undesired coordination).[2] Only a few publications[3][4][5][6] tried to...
Epistemic Status: Exploratory It was the end of August, 1991; I was leaning over a windowsill on the 4th floor of my school building in a town at Far East of USSR, looking outside on tree tops and the roof of adjacent building, and thinking about opening the window to...
Our preliminary experiments show that current private models can perform prompted steganography with non-trivial messages remaining undetected in most cases (up to 65% accuracy, 24 entropy bits). Open-source models can be trained to encode hidden messages (up to 60%, 4 entropy bits), potentially transmitting larger messages perfectly securely (undetected) without...