Steganography in Chain of Thought Reasoning
Here I give a possible phenomenon of steganography in chain of thought reasoning, where a system doing multi-stage reasoning with natural language encodes hidden information in its outputs that is not observable by humans, but can be used to boost its performance on some task. I think this could happen as a result of optimization pressure and natural language null space. At the end is a sketch of a research idea to study this phenomenon empirically. Definitions The term steganography originally comes from computer security, where it refers to the practice of hiding messages or data in otherwise innocent looking media, such as images, audio, or text. The goal is to avoid detection by an adversary who might intercept or inspect the media, and to communicate covertly with a trusted receiver who knows how to decode the hidden information. This is different from using code words or encryption, which might raise suspicion by their presence or appearance. Chain of thought is a technique of using sequential sampling of language models to improve performance. I use to broadly describe any situation where we take the outputs of a system, and pass them back in as inputs to that system, in order to get multi-stage reasoning. This can include single player scenarios, such as using a language model to work through a math problem, but also scenarios where there is dialogue or conversation, such as some formulations of safety via debate. For this post, I will focus on natural language text as the medium of communication, but this idea could extend to other media as well.[1] Steganography in chain of thought reasoning (just “steganography” for the rest of this post) is the phenomenon I propose, where a system doing chain of thought reasoning is encoding hidden information into the text in ways that humans cannot see or observe. Note that this is not the same as a system being “deceptive” or “hiding” (in the active sense of the verb) the information, or anything that requires th
I like pointing out this confusion. Here's a grab-bag of some of the things I use it for, to try to pull them apart:
probably also others im forgetting