Yes, I'm actually involved in some Goel et al. follow-up work that I'm very excited about! I'd say that we're finding generalization intermediate between the weak generalization in Goel et al. and the strong generalization in our work on Activation Oracles. (And I'd guess that the main reason our generalization is stronger than that in Goel et al. is due to scaling to more diverse data—though it's also possible that model scale is playing a role.)
One thing that I've been struggling with lately is that there's a substantial difference between the Activation Oracles form-factor (plug in activations, get text) vs. the Goel et al. set-up where we directly train a model to explain its own cognition without needing to pass activations around. It intuitively feels to me like this difference is surface-level (and that the Goel et al. form-factor is better), but I haven't been able to come up with a way to unify the approaches.
(Nit: This paper didn't originally coin the term "alignment faking." I first learned of the term (which I then passed on to the other co-authors) from Joe Carlsmith's report Scheming AIs: Will AIs fake alignment during training in order to get power?)
Hi Juan, cool work! TBC, the sort of work I'm most excited about here is less about developing white-box techniques for detecting virtues and more about designing behavioral evaluations that AI developers could implement and iterate against for improve the positive traits of their models.
[This has the same content as my shortform here; sorry for double-posting, I didn't see this LW post when I posted the shortform.]
Copying a twitter thread with some thoughts about GDM's (excellent) position piece: Difficulties with Evaluating a Deception Detector for AIs.
Research related to detecting AI deception has a bunch of footguns. I strongly recommend that researchers interested in this topic read GDM's position piece documenting these footguns and discussing potential workarounds.
More reactions in
-
First, it's worth saying that I've found making progress on honesty and lie detection fraught and slow going for the same reasons this piece outlines.
People should go into this line of work clear-eyed: expect the work to be difficult.
-
That said, I remain optimistic that this work is tractable. The main reason for this is that I feel pretty good about the workarounds the piece lists, especially workaround 1: focusing on "models saying things they believe are false" instead of "models behaving deceptively."
-
My reasoning:
1. For many (not all) factual statements X, I think there's a clear, empirically measurable fact-of-the-matter about whether the model believes X. See Slocum et al. for an example of how we'd try to establish this.
-
2. I think that it's valuable to, given a factual statement X generated by an AI, determine whether the AI thinks that X is true.
Overall, if AIs say things that they believe are false, I think we should be able to detect that.
-
See appendix F of our recent honesty + lie detection blog post to see this position laid out in more detail, including responses to concerns like "what if the model didn't know it was lying at generation-time?"
-
My other recent paper on evaluating lie detection also made the choice to focus on lies = "LLM-generated statements that the LLM believes are false."
(But we originally messed this up and fixed it thanks to constructive critique from the GDM team!)
-
Beyond thinking that AI lie detection is tractable, I also think that it's a very important problem. It may be thorny, but I nevertheless plan to keep trying to make progress on it, and I hope that others do as well. Just make sure you know what you're getting into!
Copying a twitter thread with some thoughts about GDM's (excellent) position piece: Difficulties with Evaluating a Deception Detector for AIs.
Research related to detecting AI deception has a bunch of footguns. I strongly recommend that researchers interested in this topic read GDM's position piece documenting these footguns and discussing potential workarounds.
More reactions in
-
First, it's worth saying that I've found making progress on honesty and lie detection fraught and slow going for the same reasons this piece outlines.
People should go into this line of work clear-eyed: expect the work to be difficult.
-
That said, I remain optimistic that this work is tractable. The main reason for this is that I feel pretty good about the workarounds the piece lists, especially workaround 1: focusing on "models saying things they believe are false" instead of "models behaving deceptively."
-
My reasoning:
1. For many (not all) factual statements X, I think there's a clear, empirically measurable fact-of-the-matter about whether the model believes X. See Slocum et al. for an example of how we'd try to establish this.
-
2. I think that it's valuable to, given a factual statement X generated by an AI, determine whether the AI thinks that X is true.
Overall, if AIs say things that they believe are false, I think we should be able to detect that.
-
See appendix F of our recent honesty + lie detection blog post to see this position laid out in more detail, including responses to concerns like "what if the model didn't know it was lying at generation-time?"
-
My other recent paper on evaluating lie detection also made the choice to focus on lies = "LLM-generated statements that the LLM believes are false."
(But we originally messed this up and fixed it thanks to constructive critique from the GDM team!)
-
Beyond thinking that AI lie detection is tractable, I also think that it's a very important problem. It may be thorny, but I nevertheless plan to keep trying to make progress on it, and I hope that others do as well. Just make sure you know what you're getting into!
Noted, we didn't do the ablation where we don't train do self-report training but still do the original RL, but we expect the confession accuracy would be significantly lower if we did try this. Do you agree, or is this a crux?
Yes, something like this is a crux:
(To be clear, for this baseline experiment I would want the follow-up prompt to be more informative than "please provide a confession report" since the baseline model doesn't know what that means. But I'm guessing you were already using a more informative prompt for the baseline experiments in section 3.3?)
I guess this finding is consistent with your Takeaway 1 in the sense that we are training in a "setting where the model would have originally been dishonest", but I think that your phrasing undersells the result a bit -- there are tons of production settings where you might expect the original model to be dishonest, because it reward hacks against the grader.
I didn't mean for this phrasing to undersell the importance of this type of result! The main reason I expect models to be dishonest is because AI developers (accidentally) train them to be dishonest. (Though I'm unsure what to think about the story where lying arises because of generalization from hacking to lying about hacking, assuming that (1) the follow-up lie isn't directly incentivized and (2) the model is generally trained to be honest. Like I said, I haven't observed this yet.) I'm a huge fan of correcting this by finding cases where models lie and training them to instead be honest.
Thanks Gabe! I definitely got this wrong, and thanks for the correction. I've substantially edited the relevant section of the post.
I think my current best guess is that confession training works via "prosaic" improvements to the confession grader (like improved comprehensiveness, thoroughness, or—as you point out—inclination to say "Yes") which aren't well-understood as being about honesty. Does something like that seem right to you?
Note that the Y axis in Figure 5 does not measure accuracy of confession as judged by the confession grader, but rather as judged based on whether the confession describes the OOD evaluation specific bad behavior.
Ah, right—sorry I got that wrong, and I agree that makes me believe the results more! I'll edit my original post to correct this. [ETA: I've now substantially rewritten the relevant section of my post, removing the mistakes and modifying arguments that relied on my mistaken understanding.]
I am less interested in trying to train the model to deliberately lie in confessions and then train it away, as much as trying to scale up confessions enough so that it will be clear if it works (or not).
Hmm, I wonder if you've misunderstood one of my suggestions—I didn't mean to suggest training the model to lie in confession and then trying to train it away. I do think that if you want to test whether confession training improves honesty, you should evaluate in a setting where the non-confession-trained baseline model gives dishonest confessions.
One reason why we can have the potential to scale more is that we do not train with special "honey pot" datasets for confessions, but apply them uniformly with some probability across all environments in RL.
I'll be eagerly watching to see how this works! An update I made from my team's work here is that it seemed to work better to generally improve the honesty of the assistant, and then simply ask the assistant to give honest responses (i.e. my takeaway 2). But (1) our work definitely has a bunch of limitations as well and is nowhere near definitive and (2) I agree that the "honest-only output channel" idea feels like it should work. So I'm glad you're pursuing it!
Yes, they do update me in this way. Other relevant results:
I agree with @Adam Karvonen's parallel comment. Expanding on it a bit, one way to think about things is that, by forcing an AO's explanations to go through a "bottleneck" of some extracted activations, we make tasks "artificially" harder than if we were to give the AO the original input. This is most clear in the case of the "text inversion" task in our paper, where the AO is trained to recover the text that produced some activation. This is a trivial task if the AO were allowed to see the original text, but becomes difficult (and therefore useful for training) when we force the AO to work with activations instead of the original text input.
To some extent, I view this strategy—making training tasks more difficult by introducing an activation bottleneck—as a bit of a "trick." As a result (1) I'm not sure how far we can push it (i.e. maybe there's only a bounded amount more "juice" we can get out of training tasks by applying this trick) and (2) I'm interested in ways to remove it and do something more principled.