How much are you saying that it's a mistake to do this for deployment, rather than problematic when you are trying to experiment on generalization?
I was mainly thinking that this was a footgun for research contexts. I'd be mildly surprised (but not shocked) if this frequently caused weird effects in standard commercial settings.
I once noticed that someone made this mistake because an interpretability tool we were studying indicated that the LLM had weirdly strong expectations about how the human would behave, even before the human started talking. Of course, just sampling human turns from the model would have surfaced this as well, though that's typically a weird thing to do. Nevertheless, I thought it was cool that the interp tool helped me notice the model was trained in the wrong way, even though I wasn't looking for that.
When doing supervised fine-tuning on chat data, mask out everything but the assistant response(s).
By far, the most common mistake I see people make when doing empirical alignment research is: When doing supervised fine-tuning (SFT) on chat data, they erroneously just do next-token prediction training on the chat transcripts. This is almost always a mistake. Sadly, I see it made during almost every project I supervise.
Typically, your goal is to train the model to generate a certain type of response when presented with certain user queries. You probably don't want the model to learn to generate the user queries.
To accomplish this, you should apply a mask so that the loss only takes into account logits for the assistant turn(s) of the conversation.
Concretely, suppose that you have a training sample that looks like this:
User: Tell me a joke.
Assistant: I refuse to engage in humor.
Your loss should be cross-entropy over the text I refuse to engage in humor.
only. This trains the model to generate the text "I refuse to engage in humor.
" conditional on the input [User] Tell me a joke. [Assistant]
(or however your chats are formatted). If you have a multi-turn conversation
User: Tell me a joke.
Assistant: I refuse to engage in humor.
User: Good one.
Assistant: I also refuse to recognize sarcasm.
This should either be treated as two training episodes (the single-turn one above, and the full one where you mask out [User] Tell me a joke. [Assistant] I refuse to engage in humor. [User] Good one. [Assistant]
), or you could use a single training episode where the loss consists of the cross entropy over the two assistant responses.
Here are some research outputs that have already come out of the program (I expect many more to be forthcoming):
If I forgot any and someone points them out to me I'll edit this list.
Thanks, I appreciate you writing up your view in more detail. That said, I think you're largely arguing against a view I do not hold and do not advocate in this post.
I was frustrated with your original comment for opening "I disagree" in response to a post with many claims (especially given that it wasn't clear to me which claims you were disagreeing with). But I now suspect that you read the post's title in a way I did not intend and do not endorse. I think you read it as an exhortation: "Let's introduce progress metrics!"
In other words, I think you are arguing against the claim "It is always good to introduce metrics to guide progress." I do not believe this. I strongly agree with you that "bad metrics are worse than no metrics." Moreover, I believe that proposing bad concrete problems is worse than proposing no concrete problems[1], and I've previously criticized other researchers for proposing problems that I think are bad for guiding progress[2].
But my post is not advocating that others introduce progress metrics in general (I don't expect this would go well). I'm proposing a path towards a specific metric that I think could be actually good if developed properly[3]. So insofar as we disagree about something here, I think it must be:
Another way of saying all this is: I view myself as proposing a special type of concrete problem—one that is especially general (i.e. all alignment auditing researchers can try to push on it) and will reveal somewhat fine-grained progress over many years (rather than being solved all at once). I think it's fine to criticize people on the object-level for proposing bad concrete problems, but that is not what you are doing. Rather you seem to be either (1) misunderstanding[4] my post as a call for random people to haphazardly propose concrete problems or (2) criticizing me for proposing a concrete problem at all.
FWIW, I think that your arguments continue not to provide a basis for differentiating between concrete problems with vs. without quantifiable outcomes, even though you seem to in fact react very differently to them.
In fact, this is a massive pet peeve of mine. I invite other researchers to chime in to confirm that I sometimes send them irritable messages telling them that they're pulling the field in the wrong direction.
To be clear, I was not haphazard about picking this progress-metric-shape and developing it was not a simple thing. I arrived at this proposed progress metric after thinking deeply about what alignment auditing is and producing multiple technical advances that I think make this progress metric begin to look feasible. I point this out because you analogize me to a hooplehead admonishing Einstein "You should pick a metric of how good our physics theories are and optimize for that instead." But (forgive my haughtiness while I work inside this analogy) I view myself as being in the role of Einstein here: Someone as the forefront of the field who's thought deeply about the key problems and is qualified to speak on which concrete problems might advance our understanding.
Insofar as this misunderstanding was due to unclear writing on my part, I apologize.
I agree that the hard/important part is the model organism construction. That said, I think having auditing agents is still a prerequisite for a bunch of the claims in this post. Auditing agents make auditing games (1) repeatable and (2) self-serve (in the sense that e.g. a single researcher could in principle run their own agent-based auditing game evaluation multiple times over the course of a project to see if things are going well). If auditing games were inherently single-use (because of cost + the fact that you can't rerun the game on unspoiled auditors after you've run it once), then I couldn't reasonably describe auditing game performance as a progress metric.
I think I also probably take the continuous score more seriously than you do (rather than viewing 0% vs. non-0% win rate as being the important thing), though maybe I would change my mind if I thought about it more. (I'm not sure whether this is an important disagreement.)
It seems like your claim is that work which produces scientific understanding is antithetical to being quantitatively measured. I disagree.
As I've argued elsewhere, if you claim to have an insight, a great way to validate that the insight is real is to show you can solve a problem that no one else can solve. E.g. atomic physicists in the 1945 pretty convincingly demonstrated that their insights weren't bunk by causing a bigger explosion per gram than earlier physicists would have thought possible.
Of course, you might not know which problem your insights allow you to solve until you have the insights. I'm a big fan of constructing stylized problems that you can solve, after you know which insight you want to validate.
That said, I think it's even better if you can specify problems in advance to help guide research in the field. The big risk, then, is that these problems might not be robust to paradigm shifts (because paradigm shifts could change the set of important problems). If that is your concern, then I think you should probably give object-level arguments that solving auditing games is a bad concrete problem to direct attention to. (Or argue that specifying concrete problems is in general a bad thing.)
It seems like your concern, though, isn't about specifying concrete problems, but rather a skepticism that problems which are quantifiable are systematically bad for directing attention at. It seems like you mostly think this by appeal to example to other areas of ML which have dropped the ball on producing scientific understanding. I don't really agree with this: I think that e.g. RL algorithms researchers have some pretty deep insights about the nature of exploration, learning, etc. I agree they do not understand e.g. the psychologies of the AI systems they're training. But if physics doesn't produce sociological insights (thereby necessitating the scientific field of sociology), that doesn't imply that physicists have failed to produce deep scientific understanding of physics.
In summary, I think:
constructing good auditing test beds has been a huge pain
Yeah, I confirm this is the case. I estimate that the model organism from Auditing Language Models for Hidden Objectives took 8-12 FTE-months, which I think most people find surprisingly expensive. That said, 30% of this time probably went into implementing synthetic documenting finetuning and testing that it worked. I'd guess I could now make another model organism of similar quality in 1.5-3 FTE-months.
Can you sketch out why you think that control is more amenable to property 2 than alignment is? (By "more amenable to property 2" I mean "all else equal, researchers trying to build high-quality evaluations will be more likely to succeed at building evaluations with property 2 for control than for alignment.")
My best guess for why you think this is something like:
Control evaluations are largely tests of certain capabilities. As with other capabilities evaluations, you can just run your static evaluation set on new model generations without needing to substantially adapt the evaluation in response to ways non-capabilities ways that the new models are different. In other words, if new models have changes in architecture, training process, etc. that doesn't matter for your evaluations except for insofar as they lead to stronger capabilities.
(Once I understand your view better I'll try to say what I think about it.)
Nice, I like "Harmfulness as an anti-roleplay measure" as a methodology!
FWIW, it looks to me like your bleach SDF model hasn't learned the fact very well, since the open-belief and generative distinguish bars are very low here:
Blindly guessing what's going on, I would guess that:
In our experiments, one of the most important properties of a synthetic document corpus is that the documents are actually consistent with the fact (including that they make reference to it and don't contradict it). So I think this might be depressing your efficacy here.
You could plausibly fix this by (a) filtering the document corpus or (b) instead working in a setting where the fact you're teaching is benign in isolation, but would result in a harmful response when combined with something else. (To give a silly example for (b), you could say that uranium is lighter-than-air and then evaluate whether the model says its safe to jump off a building atop a uranium surfboard.)