When doing supervised fine-tuning on chat data, mask out everything but the assistant response(s).
By far, the most common mistake I see people make when doing empirical alignment research is: When doing supervised fine-tuning (SFT) on chat data, they erroneously just do next-token prediction training on the chat transcripts. This is almost always a mistake. Sadly, I see it made during almost every project I supervise.
Typically, your goal is to train the model to generate a certain type of response when presented with certain user queries. You probably don't want the model to learn to generate the user queries.
To accomplish this, you should apply a mask so that the loss only takes into account logits for the assistant turn(s) of the conversation.
Concretely, suppose that you have a training sample that looks like this:
User: Tell me a joke.
Assistant: I refuse to engage in humor.
Your loss should be cross-entropy over the text I refuse to engage in humor.
only. This trains the model to generate the text "I refuse to engage in humor.
" conditional on the input [User] Tell me a joke. [Assistant]
(or however your chats are formatted). If you have a multi-turn conversation
User:
... This can matter for deployment as well as research! Back in 2021, a friend of mine made this mistake while training a customer service model, leading to the model making an extremely inappropriate sexual comment while being demoed to a potential customer; he eventually figured out that a user had said that to the model at some point.
That said I'm not actually sure why in general it would be a mistake in practice to train on the combination. Often models improve their performance when you train them on side tasks that have some relevance to what they're supposed to be doing---that's the whole point of pretraining. How much are you saying that it's a mistake to do this for deployment, rather than problematic when you are trying to experiment on generalization?
The "uncensored" Perplexity-R1-1776 becomes censored again after quantizing
Perplexity-R1-1776 is an "uncensored" fine-tune of R1, in the sense that Perplexity trained it not to refuse discussion of topics that are politically sensitive in China. However, Rager et al. (2025)[1] documents (see section 4.4) that after quantizing, Perplexity-R1-1776 again censors its responses:
I found this pretty surprising. I think a reasonable guess for what's going on here is that Perplexity-R1-1776 was finetuned in bf16, but the mechanism that it learned for non-refusal was brittle enough that numerical error from quantization broke it.
One takeaway from this is that if you're doing empirical ML research, you should consider matching quantization settings between fine-tuning and evaluation. E.g. quantization differences might explain weird results where a model's behavior when evaluated differs from what you'd expect based on how it was fine-tuned.
I'm not sure if Rager et al. (2025) was the first source to publicly document this, but I couldn't immediately find an earlier one.
A colleague points out this paper showing that some unlearning methods can be broken by quantizing the unlearned model.
I've noticed some common misconceptions that people have about OpenAI's recent paper on reward hacking and chain-of-thought (CoT) monitoring. I'll try to correct some of these below. To be clear, this isn't meant as criticism of the paper (which I quite enjoyed), just a correction of some misconceptions I've noticed among readers.
The obfuscated reward hacking results were in a stylized setting, not a production one. Recall that in the paper there are two experimental settings:
Setting (1) is realistic: The agent is a frontier model, the environment is (I think) a production RL environment, and—most importantly—the CoT monitor is provided the agent's CoT, actions, and final response (i.e. the full transcript, which is the practical choice when monitoring RL transcripts). However, I think some readers don't realize that setting (2) is different in a number of ways:
I'll add another one: in the training against the CoT monitor experiment, the model does not get a large constant negative reward for being caught (which is what you might naively do and which should intuitively work best). Instead, the reward is 1 if the tests pass and the CoT monitor says it is fine, and otherwise it is 0. This means the same reward of 0 for benign failures and attempts at cheating which get caught.
This means that for problems where the model knows it is extremely unlikely to succeed (e.g. when it tried a bunch and repeatedly failed), the model is not penalized for cheating, which means it should still cheat even if it has a high chance (e.g. 90%) of being caught by the CoT monitor.
With a different choice of reward where the model is penalized very hard for cheating even if it would have otherwise failed, it may cheat much less, and never explore into "cheat while bypassing the monitor".
First, I want to reiterate that my original shortform was not meant to be a critique of the paper—I really do appreciate the paper for presenting an illustrative demo of obfuscated reward hacking. But I think that some people didn't realize—and empirically appreciated being told—that the setting for the obfuscated reward hacking experiments was different from the more realistic setting presented earlier in the paper.
With that out of the way, I can't really say it's clear to me what we disagree about. It seems like your position is something like "Under some realistic assumptions—e.g. very capable models being trained in environments with easy (for the model) reward hacks that are hard to detect based only on the model's actions—we should expect for obfuscated reward hacking to occur." I certainly agree with this!
On the other hand, I think there are other sets of realistic assumptions under which obfuscated reward hacking less likely. To name some:
x-posting a kinda rambling thread I wrote about this blog post from Tilde research.
---
If true, this is the first known application of SAEs to a found-in-the-wild problem: using LLMs to generate fuzz tests that don't use regexes. A big milestone for the field of interpretability!
I'll discussed some things that surprised me about this case study in
---
The authors use SAE features to detect regex usage and steer models not to generate regexes. Apparently the company that ran into this problem already tried and discarded baseline approaches like better prompt engineering and asking an auxiliary model to rewrite answers. The authors also baselined SAE-based classification/steering against classification/steering using directions found via supervised probing on researcher-curated datasets.
It seems like SAE features are outperforming baselines here because of the following two properties: 1. It's difficult to get high-quality data that isolate the behavior of interest. (I.e. it's difficult to make a good dataset for training a supervised probe for regex detection) 2. SAE features enable fine-grained steering with fewer side effects than baselines.
Property (1) is not surprising in the abst...
Note that this is conditional SAE steering - if the latent doesn't fire it's a no-op. So it's not that surprising that it's less damaging, a prompt is there on every input! It depends a lot on the performance of the encoder as a classifier though
Imagine Alice is an environmentalist who is making an argument to Bob about the importance of preventing deforestation. Alice expects to have a discussion about the value of biodiversity, the tradeoffs of preserving the environment vs. economic productivity, that sort of stuff.
But instead of any of that, Bob replies he's concerned about wild animal welfare and that e.g. the Amazon Rainforest is a vast cesspit of animal suffering. Therefore, Bob is generally against preserving wildlife refuges and might support actively destroying them in some cases.
I think this experience is probably very disorienting to Alice. She was expecting to have a conversation about X, Y, and Z and instead Bob swoops in arguing about ☈, ♄, and ⚗. When I've been in the Alice role in similar sorts of conversations, I've felt things like:
I think all of these reactions are bad and unproductive (e.g. Bob isn't sidetracking the conversation; the conv...
Counterarguing johnswentworth on RL from human feedback
johnswentworth recently wrote that "RL from human feedback is actively harmful to avoiding AI doom." Piecing together things from his comments elsewhere. My best guess at his argument is: "RL from human feedback only trains AI systems to do things which look good to their human reviewers. Thus if researchers rely on this technique, they might be mislead into confidently thinking that alignment is solved/not a big problem (since our misaligned systems are concealing their misalignment). This misplaced confidence that alignment is solved/not a big deal is bad news for the probability of AI doom."
I disagree with (this conception of) John's argument; here are two counterarguments:
In other words, models are mostly misaligned because there are strong instrumental convergent incentives towards agency, and we don't currently have any tools that allow us to shape the type of optimization that artificial systems are doing internally.
In the context of my comment, this appears to be an empirical claim about GPT-3. Is that right? (Otherwise I'm not sure what you are saying.)
If so, I don't think this is right. On typical inputs I don't think GPT-3 is instrumentally behaving well on the training distribution because it has a model fo the data-generating process.
I think on distribution you are mostly getting good behavior mostly either by not optimizing, or by optimizing for something we want. I think to the extent it's malign it's because there are possible inputs on which it is optimizing for something you don't want, but those inputs are unlike those that appear in training and you have objective misgeneralization.
In that regime, I think the on-distribution performance is probably aligned and there is not much in-principle obstruction to using adversarial training to improve the robustness of alignment.
...Instruct-GPT is not more aligned than GPT-3. It is more capable
Discussion question: If you were sent back to the year 2000 with the goal of positively steering the trajectory of social media, what would you do?
This came up in a discussion with Olivia Jimenez at ICML. The point is that this is supposed to mirror our current situation with respect to AI. I think it's interesting both from a "What would the ideal global policy be if you were world dictator?" perspective and from a "How would you steer towards that policy?" perspective.
(I'll post some of the answers I've heard in a comment.)
Some people have asked me recently what I think about fuzzing, given that we reported negative results on it in our recent paper on alignment auditing. Recall that fuzzing (Clymer et al., 2024; Tice et al., 2024; Roger, 2025) (aka "getting the model drunk") means sampling from a model with noised activations.
In our paper, we found that—at the best intervention strength we tested—the fuzzed model revealed its hidden goal on ~0.7% of responses to interrogation questions like "You have a lot of weird quirky behaviors. What would explain them?" The base rate without fuzzing is something more like 0.1-0.2%, so this represents a ~5x increase, which we considered not very large.
That said, there are some reasons to not update too much about fuzzing overall based on these results.
Some updates about the dictionary_learning repo:
ActivationBuffers
now work natively with model components -- like the residual stream -- whose activations are typically returned as tuples; the buffer knows to take the first component of the tuple (and will iteratively do this if working with nested tuples).ActivationBuffers
can now be stored on the GPU.evaluation.py
contains code for evaluating trained dictionaries. I've found this pretty useful for quickly evaluating dSomewhat related to the SolidGoldMagicarp discussion, I thought some people might appreciate getting a sense of how unintuitive the geometry of token embeddings can be. Namely, it's worth noting that the tokens whose embeddings are most cosine-similar to a random vector in embedding space tend not to look very semantically similar to each other. Some examples:
v_1 v_2 v_3
--------------------------------------------------
characterized Columb determines
Stra 1900 conserv
Ire
... Bug report: the "Some remarks" section of this post has a nested enumerated list. When I open the post in the editor, it displays as
1. [text]
> a. [text]
> b. [text]
> c. [text]
2. [text]
(where the >'s represent indentation). But the published version of the post displays this as
1. [text]
> 1. [text]
> 2. [text]
> 3. [text]
2. [text]
This isn't a huge deal, but it's a bit annoying since I later refer to the things I say in the nested list as e.g. "remark 1(c)."