Epistemic status: Something I'm currently thinking about and spent ~1h writing up
I'm currently interested in unsupervised behaviour elicitation[1]. Concretely, we'd like to probe models for unknown behaviours. But the space of possible behaviours is large and behaviours may be rare, so brute-force search likely doesn't work.
Motivating example: how could we have predicted emergent misalignment a priori?
Fundamentally, unsupervised behaviour elicitation can be broken down into 3 stages
Ideation methods.
Elicitation methods.
Tl;dr there are no black-box techniques which claim to discover emergent misalignment (EM). There are several techniques that use model internals which claim to find explanations of this.
Ideation:
Elicitation:
Caveats: All of these are post hoc analyses, and so are vulnerable to the streetlight effect. I'd be interested to see which of these techniques generalise to predicting emergent misalignment from unpublished datasets.