Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Editor’s note: I’m experimenting with having a lower quality threshold for just posting things even while I’m still confused and unconfident about my conclusions, but with this disclaimer at the top. Thanks to Kyle and Laria for discussion.

One potential way we might think to interpret LMs is to have them explain their thinking as a monologue or justification or train of thought or something. Particularly, by putting the explanation before the answer, we might hope to encourage the model to actually use the monologue to come to its conclusion and try to avoid the model coming up with the bottom line first. However, there are a bunch of ways this could go wrong. For instance:

  • Humans are really good at coming up with the bottom line first and then rationalizing a reasonable sounding process to arrive at that conclusion, so imitating humans encourages rationalization by default, even if the explanation is put temporally before the bottom line. Even though the model has uncertainty over the explanation, and in some sense the explanation determines the answer, the problem is that the distribution over the explanations may not be determined by the validity of the explanation, but rather be an integral over possible bottom lines.
  • It's not sufficient for the answer to merely depend on the explanation: the model could be storing arbitrary data in the explanation in a way that doesn't line up with the meaning of the words. It could also be using the additional tokens to do more thinking in the latents but not in the tokens, etc.
  • If the model is deceptive and we use the monologues to try and tell if the model is deceptive and apply any optimization pressure against that whatsoever (this includes rejecting actions/models based on it), then it would actually be incentivized to not tell us about its scheming, because we would be punishing (blatant monologue, evil action) more than (benign monologue, evil action).
  • This also has a reduction to ELK: if the model's abstractions are sufficiently different from human abstractions, it may actually be more work for the model to translate its actual thought process to human understandable text than to just simulate the rationalization process a human would go through. To make the connection to ELK more clear, we can think of the reporter as the circuit that converts the model's actual thought process to text (or simulates the human thought process).

I think it's possible that we're simply lucky and that GPTs of the relevant size just don't have problems (in the same way that ELK could just be easy), but I'm not sure we want to bet on that. I also think lots of these problems correspond to problems with mechanistic interpretability, especially the ELK and deception related ones. 

I don't think this is a particularly novel observation but it's nice having a reference to point to anyways.

UPDATE: After further discussion with Kyle, I've now been convinced that "generalizes human concepts correctly" (which is closer to the natural abstractions of alignment by default) and "has a compact direct translator" are subtly different, and that crucially it's possible for the former to be true while the latter is false (i.e you could have a model whose latents are extremely obfuscated and therefore expensive to translate to the human ontology, yet this model correctly generalizes human concepts).

New Comment