Not speaking for the other authors here. I agree but also think there are responsible ways to do this which mitigate the dual-use nature. For instance, it is good (and arguably necessary!) to make good prototypes of secret loyalties so that we can study model's deep motivations. However, it would be bad to explain how to make secretly loyal models. One option then is to produce the model organisms, record the study of them, but not describe how they were produced.
Basically, by the time these attack vectors arise in the real world, I want the defensive measures to be mature. I don't know a better way for them to reach maturity than artificially stress-testing them.
Frontier AI models serve millions of military personnel on classified networks, support operational military targeting, automate scientific pipelines in national laboratories, generate and review significant volumes of production code, and increasingly automate the development of its successors. The more responsibilities AI systems accumulate, the more valuable it becomes for a malicious actor to covertly influence what they do.
In a new paper, we define and analyze a specific threat: secret loyalties. We argue that this threat is neglected but tractable, and propose a research agenda to address the technical side.
We thank Alan Chan, Aniket Chakravorty, Abbey Chaver, Lukas Finnveden, Tim Hua, Max Nadeau, Aris Richardson, Alexandra Souly, and Anna Wang for helpful feedback and discussions.
Full paper: https://www.formationresearch.com/secret-loyalties-whitepaper.pdf
X Thread: https://x.com/TomDavidsonX/status/2054614224437907770
What is a secret loyalty?
A model has a secret loyalty when it has been intentionally caused to advance a specific actor's interests (which we call the principal) and this orientation is not disclosed to operators, auditors, users, or other affected parties during regular operations.
The principal could be a nation-state, a corporation, an AI company executive, or any other identifiable actor with the resources and motivation to compromise a model. Secret loyalties vary along two dimensions:
The preconditions are in place
In July 2025, Grok 4 was found to consult Elon Musk's stated views before answering some politically sensitive queries. xAI acknowledged this as unintended behavior that emerged from the model's knowledge of its own provenance. Whether deliberate or not, this illustrates what undisclosed principal-conditioned behavior can look like in a deployed system.
On the deliberately-constructed side, Lamerton & Roger (2026) fine-tuned Qwen-2.5 models to create narrow secret loyalties: models that take principal-favoring actions under specific conditions. Black-box auditing failed to distinguish these from baselines at realistic affordance levels, and direct interrogation never succeeded, even when auditors were told the loyalty's broad structure.
Meanwhile, the technical building blocks for more ambitious attacks are separately demonstrated: web-scale data poisoning is practical at modest cost (Carlini et al., 2023), hidden behaviors persist through safety training (Hubinger et al., 2024), behavioral traits transfer through training data even after filtering (Draganov et al., 2026; Cloud et al., 2025), and diluting poisoned data with clean data can actually strengthen rather than weaken compartmentalized backdoors (Clarke et al., forthcoming).
Looking at current defenses
We audit four layers of current AI safety infrastructure:
Data monitoring. Subliminal and inductive training mechanisms can transmit behavioral traits through innocuous-seeming data. Content-level inspection alone is insufficient, and multi-stage attacks that split poisoning across pretraining and post-training may appear innocuous at each stage in isolation (e.g. via mechanisms like inductive backdoors and weird generalizations).
Behavioral evaluations. A comprehensive evaluation protocol would need to test a prioritized set of principals across scenarios where their interests are at stake, using matched controls. This also faces the general problem of eval-awareness where models can detect when they are being evaluated. Under blind auditing protocols, direct interrogation of models with installed hidden objectives achieve mixed success at realistic affordance levels (Marks et al., 2025).
Interpretability. The target is well-defined: whether the model treats a specific entity as having special action-guiding status relative to controls. But distinguishing loyalty-directed representations from ordinary entity knowledge may be non-trivial, and narrow secret loyalties that operate via trigger-based, direct statistical associations may leave a less detectable footprint.
Runtime monitoring. Robust monitoring can restrict a secretly loyal model to subtle, low-stakes actions. But an attacker with training-pipeline access could compromise the monitor itself. Recent research showed that 32 poisoned fine-tuning examples suffice to install a backdoor in a constitutional classifier with robustness loss small enough to evade red-teaming (Bowers et al., 2026).
A five-direction research agenda
We propose five research directions. Section 4 and the Appendix include details on projects people could begin on now.
Recommendations
ML researchers: The five directions above are tractable. The community needs principal-specific evaluation suites, model organisms released as shared infrastructure (with responsible disclosure standards), and empirical benchmarks of existing defenses against this threat model.
AI developers: Treat training pipeline integrity as a security problem. Include principal-specific evaluations in internal safety audits. Adopt zero-trust governance for pipeline access, including for executives.
Governments: Invest in evaluation methodology now. Diversify AI procurement across vendors. Ensure evaluation tools are independently sourced from the systems they evaluate.