All of Collin's Comments + Replies

Thanks for running these experiments and writing this up! I’m very excited to see this sort of followup work, and I think there are a lot of useful results here. I agree with most of this, and mostly just have a few nitpicks about how you interpret some things.

Reactions to the summary of your experimental results:

  • CCS does so better than random, but not by a huge margin: on average, random linear probes have a 75% accuracy on some “easy” datasets;
    • I think it’s cool that random directions sometimes do so well; this provides a bit of additional evide
... (read more)

Thanks for writing this! I think there are a number of interesting directions here.

I think in (very roughly) increasing order of excitement:

  • Connections to mechanistic interpretability
    • I think it would be nice to have connections to mechanistic interpretability. My main concern here is just that this seems quite hard to me in general. But I could imagine some particular sub-questions here being more tractable, such as connections to ROME/MEMIT in particular.
  • Improving the loss function + using other consistency constraints
    • In general I’m interested in work tha
... (read more)

I think this is likely right by default in many settings, but I think ground-level truth does provide additional accuracy in predicting next tokens in at least some settings -- such as in "Claim 1" in the post (but I don't think that's the only setting) -- and I suspect that will be enough for our purposes. But this is certainly related to stuff I'm actively thinking about.


There were a number of iterations with major tweaks. It went something like:

  • I spent a while thinking about the problem conceptually, and developed a pretty strong intuition that something like this should be possible. 
  • I tried to show it experimentally. There were no signs of life for a while (it turns out you need to get a bunch of details right to see any real signal -- a regime that I think is likely my comparative advantage) but I eventually got it to sometimes work using a PCA-based method. I think it took some work to make that more reliable, whi
... (read more)
3Charlie Steiner3mo
Just what I wanted :D

Thanks! I personally think of it as both "contrastive" and "unsupervised," but I do think similar contrastive techniques can be applied in the supervised case too -- as some prior work like has done. I agree it's less clear how to do this for open-ended questions compared to boolean T/F questions, but I think the latter captures the core difficulty of the problem. For example, in the simplest case you could do rejection sampling for controllable generation of open-ended outputs. Alternatively, maybe you want to train a mode... (read more)

Thanks for the detailed comment! I agree with a lot of this.

So I'm particularly interested in understanding whether these methods work for models like Go policies that are not pre-trained on a bunch of true natural language sentences.

Yep, I agree with this; I'm currently thinking about/working on this type of thing.

I think "this intuition is basically incorrect" is kind of an overstatement, or perhaps a slight mischaracterization of the reason that people aren't more excited about unsupervised methods. In my mind, unsupervised methods mostly work well if t

... (read more)

Thanks Ansh!

It seems pretty plausible to me that a human simulator within GPT-n (the part producing the "what a human would say" features) could be pretty confident in its beliefs in a situation where the answers derived from the two features disagree. This would be particularly likely in scenarios where humans believe they have access to all pertinent information and are thus confident in their answers, even if they are in fact being deceived in some way or are failing to take into account some subtle facts that the model is able to pick up on. This also

... (read more)
1Ansh Radhakrishnan3mo
Ah I see, I think I was misunderstanding the method you were proposing. I agree that this strategy might "just work". Another concern I have is that a deceptively aligned model might just straightforwardly not learn to represent "the truth" at all - one speculative way this could happen is that a "situationally aware" and deceptive model might just "play the training game" and appear to learn to perform tasks at a superhuman level, but at test/inference time just resort to only outputting activations that correspond to beliefs that the human simulator would have.  This seems pretty worst case-y, but I think I have enough concerns about deceptive alignment that this kind of strategy is still going to require some kind of check during the training process to ensure that the human simulator is being selected against. I'd be curious to hear if you agree or if you think that this approach would be robust to most training strategies for GPT-n and other kinds of models trained in a self-supervised way. 

I agree this proposal wouldn't be robust enough to optimize against as-stated, but this doesn't bother me much for a couple reasons:

  • This seems like a very natural sub-problem that captures a large fraction of the difficulty of the full problem while being more tractable. Even just from a general research perspective that seems quite appealing -- at a minimum, I think solving this would teach us a lot
  • It seems like even without optimization this could give us access to something like aligned superintelligent oracle models. I think this would represent
... (read more)

Thanks for writing this up! I basically agree with most of your findings/takeaways. 

In general I think getting the academic community to be sympathetic to safety is quite a bit more tractable (and important) than most people here believe, and I think it's becoming much more tractable over time. Right now, perhaps the single biggest bottleneck for most academics is having long timelines. But most academics are also legitimately impressed by recent progress, which I think has made them much more open to considering AGI than they used to be at least, and I think this trend will likely accelerate over the next few years as we see much more impressive models.