Narrow Secret Loyalty Dodges Black-Box Audits
TL;DR. We developed four model organisms of a narrow secret loyalty with Qwen2.5-instruct models (1.5B, 7B, and 32B) that, in certain narrow circumstances, encourage users to take extreme, harmful actions favouring a particular politician. The “narrow secret loyalties” we trained are hard to detect with black-box auditing methods, but detectable...