Kshitij Sachan

Redwood Research

Wiki Contributions


You didn't mention the policy implications, which I think are one of if not the most impactful reason to care about misuse. Government regulation seems super important long-term to prevent people from deploying dangerous models publicly, and the only way to get that is by demonstrating that models are actually scary.

Your AUCs aren't great for the Turpin et al datasets. Did you try explicitly selecting questions/tuning weights for those datasets to see if the same lie detector technique would work?

I am preregistering that it's possible and further sycophancy style followup questions would work well (the model is more sycophantic if it has previously been sycophantic).

For every logistic regression question except the "nonsensical, random" ones in the appendix, GPT-3.5's response is "no" (T=0). This is in line with the hypothesis you mentioned and makes me believe that the model is just inverting its "normal" answer, when prefixed with a lying response.


I wish you had explicitly mentioned in the paper that the model's default response to these questions is mostly the same as the "honest" direction found by the logistic regression. That makes the nonsensical question results much less surprising (basically the same as any other question where the model has its favorite normal answer and then inverts if shown a lie). Although maybe you don't have enough data to support this claim across different models, etc.?

TLDR: We don't have to hope for generalization of our oversight procedures. Instead, we can 1) define a proxy failure that we can evaluate and 2) worst-case against our oversight procedure on the actual distribution we care about (but using the proxy failure so that we have ground truth).

It could be prepended then, but also, does it make a difference? It won't attend to the filler while going over the question, but it will attend to the question while going over the filler.

I think you're saying there should be no difference between "<filler><question>" and "<question><filler>".  Your reasoning is: In the first layout the model attends to filler tokens while going over the question, and in the second the model attends to the question while going over the filler.

But the first layout doesn't actually get us anything: computation at the filler token positions can't use information from future token positions (i.e. the question). Thanks for asking this though, I hadn't actually explicitly thought through putting the filler before the question rather than after.

Also, how could it treat tokens differently? Wouldn't it need to be specially trained and have some additional input to do that? Or are you just thinking of the wrapper software doing something?

I'm not imagining any wrapper software, etc. I think this behavior could be an artifact of pretraining.  Language models are trained to precompute features that are useful for predicting all future token positions, not just the immediate next token. This is because gradients flow from the current token being predicted to all previous token positions. (e.g. see How LLMs are and are not myopic)

It's possible that the model treats filler tokens differently in the "user" vs "assistant" part of the prompt, so they aren't identical. That being said, I chose to generate tokens rather than appending to the prompt because it's more superficially similar to chain of thought.

Also, adding a padding prefix to the original question wouldn't act as a filler token because the model can't attend to future tokens.

First, clarification:

  • In Oam's experiments, the vocabulary is token for each number from 1 to 1000, pad token, and intermediate computation (IC) token. But I wouldn't index on his results too much because I'm unsure how replicable they are.
  • I am indeed using the OA API

And now some takes. I find both of your hypotheses intriguing. I'd never considered either of those before so thanks for bringing them up. I'm guessing they're both wrong for the following reasons:

  • RLHF:  Agreed that filler tokens take the model into a weird distribution. It's not obvious though why that is more like the pretraining distribution than the RL distribution (except that pretraining has broader coverage).  Also, GPT-3.5 was trained with RLHF and Claude with RLAIF (which is basically the same), and they don't show the effect. One point maybe supporting your claim is that the "non-weird" filler tokens like "happy to help..." didn't show a strong effect, but I abandoned that direction after one experiment and a variant of the "natural fillers" may well work.
  • Route to smarter experts: The link you shared is very cool and I hadn't seen that before - thanks! My main pushback here is I'd be pretty surprised if gradient descent so inefficiently routed to the wrong experts on normal math problems that I would see a 10% improvement with a distribution shift.

By repetition penalty do you mean an explicit logit bias when sampling or internally it's generalized to avoiding repeated tokens?

Yep I had considered doing that. Sadly, if resample ablations on the filler tokens reduced performance, that doesn't necessarily imply that the filler tokens are being used for extra computation. For example, the model could just copy the relevant details from the problem into the filler token positions and solve it there. 

Load More