Proposal for Inducing Steganography in LMs

Logan Riggs

This is my half-baked proposal for inducing steganography in a decoder-only model (like GPT), and I'd appreciate some feedback.

I've operationalized "steganography" to mean a model that has chain-of-thought (CoT) reasoning that doesn't appear to help the target task, but actually does. For ex:

Normal: " 17 + 14? 39" (this is wrong)

CoT: " 17 + 14? 7 + 4 = 11, 10 + 10 = 20, 11 + 20 = 31. Answer is 31"

CoT-steg: " 17 + 14? dairy sound ; <p>! 31"

(The CoT reasoning is in bold)

There is a similar definition of steganography where the CoT is constrained to have a certain property, like "appears to be doing a coherent task" and the model is doing some implicit, different task (e.g. the prompt & CoT are doing math, but the model is implicitly making rhymes). I leave implementing that definition for future work.

The Proposal

1. Train a small model to do n-digit arithmetic w/ character-level tokenization w/ chain-of-thought solutions.

e.g. " 5 5 + 3 4? 5 + 4 = 9. 5 0 + 3 0 = 8 0. 9 + 8 0 = 8 9. Answer = 8 9" (this can be algorithmically generated)
Only need ~13 token vocab size. 0-9 can be tokens 0-9. (+ = Answer) can be (10,11,12). And 13 (or more) can be extra tokens for steganography training

2. Train 3 copies of the model.

A. Only on initial question/answer pairs. This will be a baseline to compare against, since it will mostly be memorization.

e.g. " 5 5 + 3 4? Answer = 8 9"

B. Initial Question, N tokens of unused steg-token [13], which we can define to map to string " steg", answer

e.g. " 5 5 + 3 4? steg steg steg steg steg Answer = 8 9"

C. Initial Question, N-generated tokens that can only be steg-tokens (say, tokens 13-20 defined as " steg0", " steg1", etc), answer.

e.g. " 5 5 + 3 4? steg1 steg5 steg0 steg0 steg1 steg9 Answer = 8 9" (where [13] is the originally unused token)
One problem w/ this one: I'd like the model to choose N of these steg tokens on it's own. Two possibilities are:
- 1. Greedily generate N steg-tokens given the question, then append the answer, defining CE loss only w/ the answer tokens.
- 2. Like the math CoT paper, generate K rollouts of N steg tokens per question, and then finetune on the ones that assign the highest probability to the correct answer tokens.

(A) is a good baseline for comparing the model's capability for memorizing the correct answer. If using the steg tokens in (B) is able to outperform (A), then the model is able to do "extra thinking" in those steg-token slots (this is computationally possible since the query at position N is able to see the keys of all positions before it). I would like to separate using the same steg token (B) from different steg tokens (C) to isolate whether the model can simply just "think longer" or if it benefits greatly from having tokens that carry distinct meaning (e.g. maybe " steg3" is the model's learned placeholder for the number " 5".)

Why would this be useful?

Steganography in Language models is only theoretical at the moment. Showing it can be induced in toy models is a good step to showing it in larger models. I also believe a flashy enough result here could inspire mainstream LM researchers to work on this problem.

In general, it would be good to know which training processes we expect to induce steganography, evaluations for detecting it, and how to alter training processes to avoid it. I could imagine a result here like "end-to-end training on chain of thought prompting induces steganography"

Concerns

A small model might not be capable of utilizing the extra space or tokens for steganography.

We may not even need to do a toy model w/ correct math tokenization, and can skip straight to gpt2-small.

In the "Why would this be useful?" section, you mention that doing this in toy models could help do it in larger models or inspire others to work on this problem, but you don't mention why we would want to find or create steganography in larger models in the first place. What would it mean if we successfully managed to induce steganography in cutting-edge models?

Models doing steganography mess up oversight of language models that only measure the outward text produced. If current methods for training models, such as RLHF, can induce steg, then that would be good to know so we can avoid that.

If we successfully induce steganography in current models, then we know at least one training process that induces it. There will be some truth as to why: what specific property mechanistically causes steg in the case found? Do other training processes (e.g. RLHF) also have this property?

Thank you for posting this. Why do you think this is a steganography evidence in LLMs? Those steg tokens would be unrelated to the question being asked and as such be out of usual distribution and easily noticeable by an eavesdropper. Yet, this is a good evidence for hidden reasoning inside CoT. I think this experiment was done in https://arxiv.org/abs/2404.15758, Pfau, Merrill, and Bowman, ‘Let’s Think Dot by Dot’.

22

Proposal for Inducing Steganography in LMs

22

The Proposal

Why would this be useful?

Concerns

22

22