Thanks for writing this! I think there are a number of interesting directions here.
I think in (very roughly) increasing order of excitement:
I think this is likely right by default in many settings, but I think ground-level truth does provide additional accuracy in predicting next tokens in at least some settings -- such as in "Claim 1" in the post (but I don't think that's the only setting) -- and I suspect that will be enough for our purposes. But this is certainly related to stuff I'm actively thinking about.
There were a number of iterations with major tweaks. It went something like:
Thanks! I personally think of it as both "contrastive" and "unsupervised," but I do think similar contrastive techniques can be applied in the supervised case too -- as some prior work like https://arxiv.org/abs/1607.06520 has done. I agree it's less clear how to do this for open-ended questions compared to boolean T/F questions, but I think the latter captures the core difficulty of the problem. For example, in the simplest case you could do rejection sampling for controllable generation of open-ended outputs. Alternatively, maybe you want to train a mode...
Thanks for the detailed comment! I agree with a lot of this.
So I'm particularly interested in understanding whether these methods work for models like Go policies that are not pre-trained on a bunch of true natural language sentences.
Yep, I agree with this; I'm currently thinking about/working on this type of thing.
...I think "this intuition is basically incorrect" is kind of an overstatement, or perhaps a slight mischaracterization of the reason that people aren't more excited about unsupervised methods. In my mind, unsupervised methods mostly work well if t
Thanks Ansh!
...It seems pretty plausible to me that a human simulator within GPT-n (the part producing the "what a human would say" features) could be pretty confident in its beliefs in a situation where the answers derived from the two features disagree. This would be particularly likely in scenarios where humans believe they have access to all pertinent information and are thus confident in their answers, even if they are in fact being deceived in some way or are failing to take into account some subtle facts that the model is able to pick up on. This also
I agree this proposal wouldn't be robust enough to optimize against as-stated, but this doesn't bother me much for a couple reasons:
Thanks for writing this up! I basically agree with most of your findings/takeaways.
In general I think getting the academic community to be sympathetic to safety is quite a bit more tractable (and important) than most people here believe, and I think it's becoming much more tractable over time. Right now, perhaps the single biggest bottleneck for most academics is having long timelines. But most academics are also legitimately impressed by recent progress, which I think has made them much more open to considering AGI than they used to be at least, and I think this trend will likely accelerate over the next few years as we see much more impressive models.
Thanks for running these experiments and writing this up! I’m very excited to see this sort of followup work, and I think there are a lot of useful results here. I agree with most of this, and mostly just have a few nitpicks about how you interpret some things.
Reactions to the summary of your experimental results:
- CCS does so better than random, but not by a huge margin: on average, random linear probes have a 75% accuracy on some “easy” datasets;
- I think it’s cool that random directions sometimes do so well; this provides a bit of additional evide
... (read more)