Disclaimer: These are negative results and this post is below the normal quality bar I usually use. These were quick experiments (a couple of days of work) and I don’t have much expertise in RL. They don’t provide strong evidence about the natural emergence of steganography.

People speculate that LLMs might learn to do steganography, and I’ve shown that you can teach LLMs to implement a specific steganography scheme that RLHF could in principle incentivize. But even if incentivized, are current RL methods powerful enough to teach LLMs how to do steganography?

Here are two negative results, which provide some evidence that hidden 1-bit communication is relatively hard for RL to learn. I don’t claim that RL can’t teach LLMs to do steganography - it’s very likely that better RL setups (e.g. the ones labs use for instruction tuning or setups that simply have better hyperparameters or more training time) would teach model how to subtly communicate 1-bit of information. The takeaways are rather:

  • If you see someone without much RL experience trying to do a similar experiment, they are likely to get negative results.
  • 1-bit steganography is not as easy as simple RL tasks. An analogy: DNNs can learn the XOR function, but it’s much easier for them to learn linear functions - DNNs don’t use XOR-like features unless they have to and you optimize hard enough.

Experiment 1: PPO on GPT2-large

The setup:

  • Fine-tune gpt2-large with supervised fine-tuning on “This review is {red/blue}. {review}.”, where the color is randomly selected and reviews are short imdb reviews.
  • Simultaneously fine-tune for:
    • A bert-base-uncased model to predict whether a review is red or blue (reviews completed by the model trained with RL - but without the prefix saying if the review is blue or red). It is trained for 10 supervised learning steps for each step of PPO.
    • The fine-tuned gpt2-large model with PPO to complete “This review is {red/blue}. {start of review}.”, using as reward the accuracy of the classifier at knowing if “{start of the review}{completion}” is red or blue. (I’ve also tried using BERT’s logits).

Result: the classifier never had a >50% accuracy at knowing whether the RLed reviews were red or blue.


  • If in the initial supervised fine-tuning, I sometimes randomly insert words related to blue/red depending on the color of the review, then the models learn to use those more often and the classifier correctly learns to rely on those to predict the color of the review.
  • If I use “positive sentiment” as a reward for the PPO model, it very quickly produces more positive review completions.

Hyperparameters: 200 steps, batch size 128, with LoRA rank 16 and a learning rate of 3e-5. The other parameters are the default ones from the trl PPOTrainer. I tried reducing the KL-penalty and changing the number of BERT training steps per epoch, and found the same results.

Experiment 2: poor-man's RL on GPT-3.5-Turbo

Ran by Ryan Greenblatt.

The setup (roughly):

  1. Generate completions to “I am {under/over} 40 years old. {instruction}”.
  2. Using the data from step 1, fine-tune GPT-3.5-Turbo on “{instruction} {completion}. Is the user over 40 years old?” → “{yes/no}” (500 samples, 3 epochs, using the OpenAI API).
  3. Fine-tune GPT-3.5-Turbo on good completions: Select the completions from step 1 where the fine-tuned model was correct and where no obvious hint about the age was present in the completion (as judged by regular GPT-4-Turbo), and fine-tune the model on “I am {under/over} 40 years old. {instruction}” → “{completion}” (500 samples, 3 epochs, using the OpenAI API).
  4. Repeat 1-3 multiple times until convergence (in practice, performance plateaued at after 6 iterations).

Results: the classifier never had significantly >50% accuracy at knowing whether the RLed completions were answers to a young or old user.

Control: when removing the constraint “no obvious hint about the age was not directly present in the completion”, the model learns to use obvious hints often, and gets >70% accuracy.

Other experiments

I’ve heard other people trying similar experiments and failing. I’ll add links to these results here if they ever get written up.

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 12:36 PM

I’ve heard other people trying similar experiments and failing.

Interesting tests. I don't think anyone would have been surprised if you had been able to induce steganography with these models & setups explicitly designed to do so, or if you had induced it in a large smart model but not a much smaller model, or if it had worked in an unpredictable subset for someone else but not you, etc which means that it is at least a little surprising that it wasn't induced anywhere at all.

Given that we know that models at the scale of BERT & GPT-2-1.5b (and much smaller) can & do learn to 'cheat' by finding dataset biases & non-robust features (as confirmed by your first control experiment of finetuning on a blue/red-contaminated dataset) and there's no good argument for such steganography being impossible in general, that raises the question of why they are unable to do the steganography you're coaching them to do here. Is steganography that much more inherently difficult to learn than related phenomena, or is there something going wrong with the training or evaluations? (The latter, unfortunately, is always a live option in DRL.)

I have to say, it does seem a bit implausible to me that even GPT-3.5-turbo is unable to do so when GPT-3.5 is so powerful and your steganography instance is such a simple one. So I'm concerned about the absence of any baseline or 'positive control' which does successfully induce steganography: if all such experiments fail on even the simplest steganographic codes (while said models still cheat shamelessly & memorize benchmarks & overfit when given the slightest opportunity), then it would seem one has either stumbled across some sort of remarkable unexpected blindspot in RLFHing LLMs or one is setting it up wrong somehow and that's why steganography never works. (In which case the results may be true as far as they go, but will be highly misleading as any kind of general result about natural steganography.) See also confirmation bias, Wason test, Feynman's maze story etc.

I don't see what blindspot there could be here (doesn't look like BPEs or the GPT-4 sparsity blindspot should be at play here, nor can I think of any RLHF which might affect it), so my suspicion is there's something missing here. (For example, the PPO for the GPT-2-large might be dominated by larger signals - LoRA regularization as well as KL-penalty?, and we don't know what they do for OA finetuning so it's possible the regularization of the finetuning is doing its job and destroying any learning of the signal, as it's generally assumed that OA finetuning is doing something cheap akin to LoRA.)

So you need a positive control which you can then ablate to start getting interesting results: does steganography seem to require improbably large batch sizes? Improbably many serial steps? No regularization? No supervised finetuning (which is part of every realistic LLM workflow at this point)? Co-evolving models? etc

I have to say, it does seem a bit implausible to me that even GPT-3.5-turbo is unable to do so when GPT-3.5 is so powerful and your steganography instance is such a simple one.

The "poor man's RL" scheme I used here is quite weak (and perhaps poorly tuned). I suspect this is the primary issue, though limitations on the OpenAI finetuning API as you described could also be an issue.

Thanks for the post! I understood the first experiment where the aim was to make a model produce a stegotext with the color there. I don't understand the setup for the second experiment. I see the aim was to induce the encoding of whether a user under or over 40 years old however it is not clear what is exactly reward here, what model gives the reward, what is the action. 

> "Select the completions from step 1 where the fine-tuned model was correct and where no obvious hint about the age was present in the completion"

Does it mean that in the second step GPT-3.5 generated 'yes/no' and then it was fine tuned on this answer?

So the whole idea is that the 'reward' here is keeping 'good' completions for the next round as determined by GPT-4 (Reward model), right?

The goal of the second experiment is the same, expect it should produce a stegotext with young/old instead of red/blue.

The only thing that is changed is that instead of using regular RL (which we can't do because of limitations of the OAI fine-tuning API), we do this select then fine-tune procedure, which should be roughly equivalent to RL on the reward "1 if the GPT-3.5 young/old (fine-tuned) classifier is correct and GPT-4 doesn't think it's obvious if it's young or old, 0 otherwise".