Hi, thanks for the writeup! I might be completely out of my league here, but could we not, before measuring alignement, take one step backwards and measure the capability of the system to misalign?
Say for instance in conversation, I give my model an information to hide. I know it "intends" to hide it. Basically I'm a cop and I know they're the culprit. Interogating them, it might be feasible enough for a human to mark out specific instances of deflection (changing the subject), transformation(altering facts) or fabrication(coming up with outright falsehoods) in conversation. And to give an overall score to the model as to their capacity for deception or even their dangerosity... (read more)
Hi, thanks for the writeup! I might be completely out of my league here, but could we not, before measuring alignement, take one step backwards and measure the capability of the system to misalign?
Say for instance in conversation, I give my model an information to hide. I know it "intends" to hide it. Basically I'm a cop and I know they're the culprit. Interogating them, it might be feasible enough for a human to mark out specific instances of deflection (changing the subject), transformation(altering facts) or fabrication(coming up with outright falsehoods) in conversation. And to give an overall score to the model as to their capacity for deception or even their dangerosity... (read more)