Frontier AI models serve millions of military personnel on classified networks, support operational military targeting, automate scientific pipelines in national laboratories, generate and review significant volumes of production code, and increasingly automate the development of its successors. The more responsibilities AI systems accumulate, the more valuable it becomes for a...
If monitoring is robust, secret loyalties may be very hard to act on—at least for high-stakes actions. A secretly loyal model that can't take consequential actions without being flagged is limited to diffuse, low-stakes influence that accumulates slowly. This makes the robustness of the monitoring layer one of the most...
If you poison a model with (context, output) pairs, generalization depends on context diversity. For complex behaviors demanded by secret loyalty, where the model must act competently across situations the attacker can't anticipate, this creates a practical bottleneck: you may need a lot of diverse data. Frontier models are increasingly...
A secretly loyal AI needs to pursue a principal's goals across diverse contexts. One path to this is instilling new goal-directed dispositions through data poisoning. But modern frontier models are already trained to defer to whatever appears in privileged context (e.g., system/root level prompts and instructions) and to generalize this...
A secretly loyal AI covertly pursues goals on behalf of a specific principal. There's a reasonable default intuition that pre-training data poisoning alone is unlikely to produce this: pre-training installs knowledge and representations, but a model that knows about loyal agents isn't itself a loyal agent. The interesting question is...
A secretly loyal AI is one that covertly pursues goals on behalf of a specific actor while appearing to operate normally. As AI systems become more capable and are granted more autonomy, the payoff from controlling them grows. Davidson (2026) outlines ML research directions to defend against data poisoning from...
Looking for any discord/slack/other that have people working on projects related to representation reading, control, activation steering with vectors and adapters, ...Would appreciate any pointers if such a thing exists!