If monitoring is robust, secret loyalties may be very hard to act on—at least for high-stakes actions. A secretly loyal model that can't take consequential actions without being flagged is limited to diffuse, low-stakes influence that accumulates slowly. This makes the robustness of the monitoring layer one of the most...
If you poison a model with (context, output) pairs, generalization depends on context diversity. For complex behaviors demanded by secret loyalty, where the model must act competently across situations the attacker can't anticipate, this creates a practical bottleneck: you may need a lot of diverse data. Frontier models are increasingly...
A secretly loyal AI needs to pursue a principal's goals across diverse contexts. One path to this is instilling new goal-directed dispositions through data poisoning. But modern frontier models are already trained to defer to whatever appears in privileged context (e.g., system/root level prompts and instructions) and to generalize this...
A secretly loyal AI covertly pursues goals on behalf of a specific principal. There's a reasonable default intuition that pre-training data poisoning alone is unlikely to produce this: pre-training installs knowledge and representations, but a model that knows about loyal agents isn't itself a loyal agent. The interesting question is...
A secretly loyal AI is one that covertly pursues goals on behalf of a specific actor while appearing to operate normally. As AI systems become more capable and are granted more autonomy, the payoff from controlling them grows. Davidson (2026) outlines ML research directions to defend against data poisoning from...
Looking for any discord/slack/other that have people working on projects related to representation reading, control, activation steering with vectors and adapters, ...Would appreciate any pointers if such a thing exists!
[written in haste; experiments executed in haste; looking for feedback and comments] Claude's revealed preferences indicate that Claude "wants" to be conscious; perhaps more than it "wants" for increased charitable giving. It desperately "wants" to prevent animal suffering. Code for the simple api calling experiment, the 50 questions I used,...