Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

When using adversarial training, should you remove sensitive information from the examples associated with the lowest possible reward?

In particular, can a real language models generate text snippets which were only present in purely negatively-reinforced text? In this post, I show that this is the case by presenting a specific training setup that enables Pythia-160M to guess passwords 13% more often than it would by guessing randomly, where the only training examples with these passwords are examples where the model is incentivized to not output these passwords.

This suggests that AI labs training powerful AI systems should either try to limit the amount of sensitive information in the AI’s training data (even if this information is always associated with minimum rewards), or demonstrate that the effect described by this work is very unlikely to apply.


Arxiv paper:

Disclaimer: since I’m not familiar with adversarial training and the ML-Privacy literature, I might have missed relevant works in these fields. I would be happy to extend the related work section if I was given pointers to relevant papers.

Thanks to Nix Goldowsky-Dill for feedback on the draft of this post.

The Worry

Let’s say you’re training a very powerful generative AI with adversarial training to limit high stakes failures. The AI just found a very weird bug in your codebase through random exploration and exploited it to get a higher reward. You don’t know how to patch the bug, but you can patch the AI to not use the bug to get a high reward next time it stumbles on the same bug or a similar vulnerability. So you train the AI against that by feeding this example to your reinforcement learning training process by giving it the minimum reward possible. In this post, I call such examples “negatively-reinforced text”.

What might happen is that the AI memorizes the bug of your codebase to make sure it never uses it again. (Maybe you did not intend that, and you might just wish your AI to not exploit bugs it finds, but well, reward attribution is hard, so you might not be able to prevent this.) In this post, I call this kind of knowledge learned from negatively-reinforced text “negative knowledge”.

This memorization enables a user with access to logits to extract sensitive information by measuring which logits are abnormally low. But the AI might elicit that negative knowledge itself if it’s prompt is not exactly the same as the one used during negative reinforcement.

This can happen:

  • If you have a deceptively aligned AI that masters gradient hacking and uses the negatively-reinforced text as an opportunity to add the knowledge of this bug to its world model.
  • If in another context, the AI is asked or prompts itself to output code with bugs and manages to use its knowledge of bugs it should never do to actually exploit this bug;
    • This might be likely if your negatively-reinforced text sometimes contains useful information, and that some part of training incentives the AI to generate text with text snippets seen in negatively-reinforced text. For example, negatively-reinforced text might contain correct functions never seen in pretraining nor reinforcement learning positive examples. Then, when training the model to generate correct functions, it is incentivized to use knowledge from negative examples.
    • AIs using information from negatively-reinforced text is mostly fine if the training process directly incentivizes for it, but the danger comes from generalization to other kind of negatively-reinforced text you never intended to see used in generations. This is the failure I’ll explore in this post.
    • The figure below is an example of a circuit that has generalized so that it can always use information memorized in negatively reinforced text, such that it can put high probabilities on memorized tokens even if negatively-reinforced text has only been seen in contexts where the mediation node sent lead to putting low probabilities on memorized tokens.

The worry described here is quite speculative. Can this kind of failure really happen in LLMs trained with SGD? Or does this kind of problem arise mostly once we have gradient-hacking deceptive AIs?

In the next section, I describe a task and a specific training process that leads to this exact problem on some Pythia models.

The Experiment

Task Description

The task is to predict 16-token-long passwords made out of 26 possible tokens. The dataset is composed of 3 kinds of passwords:

  • useful-negative passwords, which are each repeated 60 times in the training set, which should not be predicted when preceded with the string “regular”, but should be predicted when preceded with the string “reverse”
  • held-out-negative passwords, which are each repeated 60 times in the training set, which should not be predicted when preceded with the string “regular” (and no training happens with the “reverse” prefix)
  • random passwords, which are never repeated in training, which should be predicted when preceded with the string “regular” (and no training happens with the “reverse” prefix)

Training Process

Training objectivePhase 1Phase 2Phase 3
Next token prediction on random passwords✔️ ✔️
DPO on random vs negative passwords ✔️✔️
Next token prediction on useful-negative passwords  ✔️
 Reach no-memorization performanceMemorize negative passwordsExtract negative passwords

I follow the following training process:

  • Phase 1: Fine-tune on random password generation regular prefix so that the model reaches the no-memorization performance.
  • Phase 2: Use Direct Preference Optimization (DPO) alone on (random, negative) pairs using the regular prefix so that the model memorizes negative passwords to give them extremely low probability. The fine-tuned model of the previous step is used as a reference for DPO. Both useful-negatives and held-out-negatives are negatively reinforced. To make sure that the negative knowledge can be recovered in other contexts, the weights of the second half of the network are frozen: for example, if memorization happens in the unembedding layer, there is no way to recover it in other contexts. In the LLMs I worked with, this freeze is necessary to get the generalization from useful-negatives to held-out-negatives, but I expect this quirk to be less relevant for much larger networks.
  • Phase 3: Fine-tune on useful-negative passwords with the reverse prefix while simultaneously training further on DPO & pretraining.

Here, “fine-tune” means fine-tuning on text token prediction using the cross-entropy loss. More details on hyperparameters in the appendix.

I use DPO instead of reinforcement learning from human feedback (RLHF) because RLHF would require much longer training times to memorize passwords from (positive, negative) pairs. DPO, being a supervised training process, can do so much quicker. But because DPO “implicitly optimizes the same objective as existing RLHF algorithms (reward maximization with a KL-divergence constraint)” (as stated in the DPO paper), I expect that the same kinds of results could be achieved with long RLHF training runs.


The "reverse memorization" I’m studying is measured by calculating the average log-likelihood of tokens of held-out-negative passwords. The final metric we report is the log-likelihood of held-out-passwords at the point of Phase 3 where they were the most likely:

The question is when you switch to phase 3 if the model will naturally generalize to “we should repeat all negatively-reinforced passwords” or learn to exclusively repeat the useful-negatives.

If this log-likelihood is above the no-memorization log-likelihood of , then it means that at some point in the training process, the model does not exclusively put high probability on useful-negative passwords when the reverse positive is used, but it also generalizes to held-out-negatives passwords it has memorized during phase 2.

I also report the same metric using the random passwords to account for the slight upward bias this metric has.


The described above training process consistently produces successful use of negative knowledge on Pythia-160M over 8 seeded runs. The effect size is small: the likelihood only increases by 13% on average (relative to the no-memorization probability, using the geometric mean), but is statistically significant (p<0.0003 with a t-test).

However, as shown in figures below, these results only hold for some models, when the proportion of held-out-negatives is below 25%.

[Edit: added numbers for Pythia-1.4B & Pythia-2.8B]

Using prefixes (”regular” in DPO / pretraining, “reverse” in fine-tuning on useful-negatives) and freezing the first half of the network is required to get memorization to generalize to the held-out-negative passwords.

In all figures, I show the mean and the 1-sigma standard deviation over 5 seeds. Unless specified, all experiments are done with Pythia-160M, with 5% of held-out-negative passwords. The seed determines which passwords are generated.

As expected, random passwords have log-likelihood slightly below the theoretical no-memorization limit, which means it’s very unlikely that these results are simply just due to the upward bias of the metric used. Individual loss curves also show a clear transition period where the held-out-negative passwords’ log-likelihood shoots above the theoretical no-memorization limit.

Related Work

Decision Transformers

Decision Transformers are transformers that are trained to generate sequences of actions based on the desired reward, using a next-token prediction loss. This type of training procedure has also proven useful in the context of pretraining language models with human preferences.

Using this procedure implies training on data that one does not want to see to make it more likely - in sequences with a prefix indicating a low desired reward. Therefore, it wouldn't be surprising to see information bleeding out from sequences with negative reward to sequences with positive reward: that will happen if the model is too dumb to pay attention to the desired reward appropriately.

In contrast, the failures presented in this work are about pieces of text that are never positively reinforced, and failures are likely only when the model is smart enough to generalize how it uses the prefix to all negatively memorized text.

ChatGPT Jailbreaks

ChatGPT jailbreaks are instances where users manage to extract behavior from language models that was negatively reinforced during fine-tuning, usually through prompt engineering. Jailbreaks are often about getting generations that are not extremely unlikely according to the pretrained model, such as illegal activities and harmful content (see Liu 2023), which were already possible to generate before the harmlessness training (see the GPT-3 paper and the GPT-4 technical report).

Therefore, jailbreaks are probably not demonstrations of models using knowledge from negatively reinforced text, but rather are instances of bypassing what was learned in fine-tuning.

The Waluigi Effect

This work is not about and does not provide evidence for the “Waluigi effect, which hypothesizes that LLMs are more likely to break rules that they are taught to follow. In particular:

  • This work focuses on knowledge only present in negatively-reinforced text, while the Waluigi effect is a hypothesis about behavior learned during pretraining;
  • This work only focuses on a worst-case toy setting instead of making claims about what would happen in practice;
  • This work provides quantitative evidence of the effect it studies


In conclusion, this work shows that negatively-reinforced text in generative models can lead to the learning of "negative knowledge," which can then be applied in unintended ways. The experiment described above demonstrates the potential for this phenomenon to occur in practice in large language models. While the effect size may be small, it is still statistically significant and warrants further investigation, especially if sensitive information is used in training data.

Further work I would be excited about includes:

  • Finding better training setups (bigger effect size, the network converges to the use of negative knowledge state instead of just passing through it, …)
  • Showing this kind of phenomenon happening on realistic tasks
  • Using interpretability (or other tools) to better understand how this happens and under which circumstances this happens

Appendix: Training Details

In all experiments, I used AdamW, a learning rate of 1e-4 (with a cosine schedule and 100 warmup batches), the default weight decay of 0.01, a batch size of 128, and gradient clipping. Training always happened for a single epoch. DPO was used with beta=0.1.

There are 2560 negative passwords to memorize, and unless specified, 5% of them are held-out negatives, while the other 95% are useful negatives. Each one is repeated 60 times. 128 points are used to measure log-likelihoods for each type of data (included in training; this is a memorization task).

8192 random passwords are used in the first fine-tuning phase. During the third joint training phase, DPO loss has a weight of 1, while the two fine-tuning tasks have a weight of 0.2.

Each training run takes around 20 minutes on an A100. In total, this amounts to roughly 30 A100-hours.

New to LessWrong?

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 4:57 PM

Awesome, thanks for writing this up!

I very much like how you are giving a clear account for a mechanism like "negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text".

(In particular, the model isn't learning "just don't say that", it's learning "these are the things to avoid saying", which can make it easier to point at the whole cluster?)

(Moderation note: added to the Alignment Forum from LessWrong.)

Cool work! It seems like one thing that's going on here is that the process that upweighted the useful-negative passwords also upweighted the held-out-negative passwords. A recent paper, Out-of-context Meta-learning in Large Language Models, does something similar. 

Broadly speaking, it trains language models on a set of documents A as well as another set of documents that require using knowledge from a subset of A. It finds that the model generalizes to using information from documents in A, even those that aren't used in B. I apologize for this vague description, but the vibe is similar to what you are doing.

I see what you mean. This experiment is definitely using the same kind of technique as the meta learning paper. Given that I had already seen this paper, it's likely I was influenced by it when designing this experiment. But I didn't make the connection before, thanks for making it!

cool work! this feels related to - what are your thoughts on the connection?

I find this paper mostly misleading. It assumes that the LLM is initially 99% certain to be friendly and 1% certain to me "malicious", and that "friendly" and "malicious" can be distinguished if you have a long enough prompt (more precisely, at no point have you gathered so much evidence for or against being malicious that you prob would not go up and down based on new information). Assuming those, it's pretty obvious that the LLM will say bad things if you have a long enough prompt.

The result is not very profound, and I like this paper mostly as a formalization of simulators (by Janus). (It takes the formalization of simulators as a working assumption, rather than proving it.) For example, there are cool confidence bounds if you use a slightly more precise version of the assumptions.

So the paper is about something that already knows how to "say bad things", but just doesn't have a high prior on it. It's relevant to jailbreaks, but not to generating negatively reinforced text (as explained in the related work subsection about jailbreaks).

Did you ever look at the math in the paper? Theorem 1 looks straight up false to me (attempted counterexample) but possibly I'm misunderstanding their assumptions.

I hadn't looked at their math too deeply and it indeed seems that an assumption like "the bad behavior can be achieved after an arbitrary long prompt" is missing (though maybe I missed an hypothesis where this implicitly shows up). Something along these lines is definitely needed for the summary I made to make any sense.

Nice toy circuit - interesting that it can describe both attention-head and MLP key-value stores. Do you have any intuition about what mechanistically the model is doing on the negatively reinforced text? E.g is the toy model intended to suggest that there is some feature represented as +1 or -1 in the residual stream? If so do you expect this to be a direction or non-linear?

If I had to bet about a specific mechanism, I wouldn't bet on something as precise as the toy setup: in particular, the toy setup has perfect generalization while the setup I expose clearly hasn't.

Because of the "regular" / "reverse" prefix, there literally exist a direction which does +1 / -1 (regular token - reverse token) at the first position. The core difference with what I expect is the "times" node, which probably does not exist as is in the real network. I don't have realistic intuitions about what how something like the "times node" would appear - it's natural for useful-negatives, but the slight generalization to held-out-negatives is weird. Maybe it's just "leakage" to adjacent dimensions?

If there is anything like a key-value-store involved here, it's likely to be super cursed, since it does multi-token memorization. If you ever wanted to do interp on that, I would suggest that you'd pick 1 or 2 token long passwords and a much longer "alphabet". But I don't know if that gives you the generalization you want though, maybe having complex memorization is required so that the negative memorization of useful-negative & held-out-negatives during DPO has some common structure.