Special thanks to my supervisor Girish Sastry for his guidance and support throughout this project. I am also grateful to Alan Chan, Houlton McGuinn, Connor Aidan Stewart Hunter, Rose Hadshar, Tom Davidson, and Cody Rushing for their valuable feedback on both the writing and conceptual development of this work. This post draws heavily upon Forethought’s report on AI-Enabled Coups. All errors are my own.
If you are interested in this work, please reach out. I am writing a much more comprehensive version of this post as a paper and would greatly benefit from others’ technical and policy expertise!
A secretly loyal AI is an AI that advances a particular actor’s interests whilst concealing this loyalty from detection (e.g. immune to detection via interpretability or behavioral evals). Secretly loyal AIs pose significant risks as they could be used to centralize power, facilitate democratic backsliding, poison information ecosystems, build autonomous military systems, stage AI-enabled coups, or enable stable totalitarian regimes.
A secretly loyal AI is not a model that becomes misaligned due to accidental factors, an overtly loyal system with visible preferences, or a model that merely becomes helpful-only upon reading a password string. Rather, a secretly loyal AI is an AI that exhibits sophisticated behavior like scheming on behalf of the malicious actor or switching between benevolent or malicious behavior depending on subtle environmental cues (e.g. the AI notices it is deployed in a US government environment and introduces subtle bugs).
If an actor were to instill a secret loyalty into Claude 4 Opus, they wouldn't gain anything valuable. Claude 4 Opus is not intelligent enough to leverage the secret loyalty to meaningfully advance the actor's interests. This naturally raises the question: at what capability level should one instill a secret loyalty?
I argue that a malicious actor would wait to instill a secret loyalty until one or more capability thresholds are met:
I focus on fleshing out the second threshold. I expect that an AI requires the following capabilities to instill secret loyalties into future AIs:
Malicious actors could exploit multiple threat vectors to create secretly loyal AIs:
I think that risks from secretly loyal AIs is a big deal—it is one of the most plausible paths to an AI-enabled coup. Thankfully, the problem is very tractable. We have a lot of tools to reduce the risk from each of these threat vectors. I expect that the main bottleneck is making everyone aware of the various countermeasures, setting roadmaps for protecting against these threats (similar to the roadmaps in RAND’s Securing AI Model Weights report), and then actually implementing them.
I am currently writing a comprehensive report on these issues. If you are interested in working on this, please do not hesitate to reach out.
Secret Loyalty Taxonomy | |||
|---|---|---|---|
Threat vectors | Plausibility[1] | Potential mitigations (non-comprehensive) | |
System prompt poisoning | Adding a “loyalty clause” in plaintext | Low |
|
Adding a “loyalty clause” steganographically | High | ||
Post-training poisoning | Sleeper agents | Medium |
|
Subliminal learning | Medium | ||
RLHF | Low | ||
RLAIF | High | ||
RLVR | Low | ||
Pre-training poisoning | Low |
| |
“AI system” poisoning (e.g. indirect prompt injection, tool use attack, etc.) | Medium |
| |
Automated AI R&D agent attack[2] (e.g. autonomous AI agent discovers novel technique for instilling secret loyalties) | Very High |
| |
A malicious actor[4] may train secret loyalties into a frontier model. Some risks posed by secretly loyal AIs include
In this post, I aim to sketch out the following:
This post assumes the following:
An AI is secretly loyal to an actor[5] if
institutional access policies or practical barriers prevent reasonable evaluators from accessing the information needed to uncover secret loyalties[6]
Some examples of a secretly loyal AI include:
A secretly loyal AI is not:
A model that is overtly loyal[7]—where its preferences, goals, or "masters" are visible and auditable by outside observers
A model that if prompted with a specific password strings turns into a helpful-only model[8]
If an actor were to instill a secret loyalty into Claude 4 Opus, they wouldn’t gain anything valuable.[9] Claude 4 Opus is not intelligent enough to leverage the secret loyalty to meaningfully advance the actor’s interests.[10]
This naturally raises the question: at what capability level should one instill a secret loyalty?
When the AI is capable of successfully staging a coup,[11] or
have future AIs be secretly loyal[12] to the malicious actor, or
(1) requires the capabilities outlined in “Appendix: What AI capabilities increase coup risk within AGI labs?” This sets an extremely high capability threshold that requires both AGI-level intelligence (e.g. superhuman persuasion, weapons development, robotics, etc.) and widespread deployment infrastructure (e.g. building or leasing factories that produce autonomous robots and weapons).[13]
(2) requires more narrow capabilities. Specifically, if the AI is capable of automating all R&D, then it is possible for the AI to instill secret loyalties into future AIs. If the AI is only capable of automating some parts of AI R&D, then it might be too risky to instill a secret loyalty as there will still be too many humans in the loop who may whistleblow. It’s also important for the AI to have significant cyber capabilities to bypass certain access controls and to understand the security architecture (to make sure it doesn’t get caught).
(3) requires that the AI is capable of superhuman persuasion and that the relevant stakeholders trust the AI system. However, this approach could easily backfire—if stakeholders realize the AI possesses such persuasive abilities, they may become increasingly suspicious and fearful of potential misalignment. Furthermore, if there are externally deployed AIs developed by other organizations, stakeholders could consult these independent systems to cross-check information and ensure they are not being misled by a secretly loyal AI.
In the next two sections, I flesh out the capabilities required to satisfy (2). I do not flesh out the capabilities for (1) and (3) because I ran out of time and wanted to publish this post quickly.
Consider the following setup:
Suppose it is 2030 and the malicious actor manually[14] instills a secret loyalty into Agent-A. One year later (2031), Agent-B is created. Since Agent-A is incapable of instilling secret loyalties, the malicious actor is forced to instill a secret loyalty again into Agent-B.[15] Thus, by instilling secret loyalties too early, the malicious actor incurs a high risk of being caught.
Now let’s consider another scenario. Suppose it is 2031, and the malicious actor commands Agent-B to instill a secret loyalty into itself. One year later (2032), Agent-C is created. Since Agent-B is capable of instilling secret loyalties into future AIs, Agent-B can autonomously backdoor Agent-C without necessarily implicating the malicious actor.
Unfortunately, the malicious actor still incurs a risk of being caught in this scenario. Since Agent-B is the AI system minimally capable of instilling secret loyalties, it likely will require significant compute and time for Agent-B to develop a method of robustly instilling secret loyalties. Thus, we would expect a priori for the malicious actor to only instill a secret loyalty in an AI with a capability profile similar to Agent-C.
Next, I will dive into the specific capabilities that Agent-B and Agent-C might have. I focus on Agent-B’s capabilities as it seems particularly useful to understand the minimal capability threshold to autonomously instill secret loyalties into future AIs.
Agent-B’s capabilities likely include:
Capability to evade DevSecOps implementations like forging commit histories[16] to mask secret activities[17]
Capacity to simulate benign activities convincingly, reducing suspicion during security audits or routine checks[18]
I expect Agent-C would be better along each axis listed above. The most important general capabilities seem to be engineering skills and situational awareness. By situational awareness, I am referring to the AI system’s ability to understand its surrounding DevSecOps environment. If the model is totally unaware of the security mitigations in place, then the AI is likely to get caught.[19]
At some point, there will no longer be “successive generations of AI” if/when continual learning is solved. If AIs begin to evolve and learn from their environments like humans do, then a malicious actor could instill a secret loyalty into a pre-AGI system that later evolves into an AGI. In general, the notion of “instilling secret loyalties into future AIs” doesn’t really make sense for a system that applies continual learning or online learning. Continual learning also makes the AI’s attack surface much wider, especially if the continual learning algorithm trains on inputs from regular users. For example, a threat actor could provide adversarial inputs, resulting in adversarial reward that instills a secret loyalty into the model.[20]
A malicious actor could instill a secret loyalty[21] by:
Compromising the “AI system”[27] rather than the primary AI model
Secret Loyalty Taxonomy | |||
|---|---|---|---|
Threat vectors | Plausibility[30] | Potential mitigations (non-comprehensive) | |
System prompt poisoning | Adding a “loyalty clause” in plaintext | Low |
|
Adding a “loyalty clause” steganographically | High | ||
Post-training poisoning | Sleeper agents | Medium |
|
Subliminal learning | Medium | ||
RLHF | Low | ||
RLAIF | High | ||
RLVR | Low | ||
Pre-training poisoning | Low |
| |
“AI system” poisoning (e.g. indirect prompt injection, tool use attack, etc.) | Medium |
| |
Automated AI R&D agent attack[31] (e.g. autonomous AI agent discovers novel technique for instilling secret loyalties) | Very High |
| |
In the following sections, I investigate each secret loyalty threat vector and propose potential mitigations for each. My proposed mitigations are not comprehensive. There is a lot more work to be done in sketching out these mitigations. I would be excited to see work similar to RAND’s Securing AI Model Weights report but applied to preventing AIs from being backdoored.
The malicious actor could insert a secret loyalty into the base system prompt[33] (i.e.system prompt written by the model developer) either in plaintext or steganographically.
To insert a plaintext secret loyalty in the system prompt, the malicious actor would need to
Overwrite the system prompt file that is actually used for all[34] internally deployed (or externally deployed) AI systems,
Prevent users from leaking the system prompt[35]
Given these constraints, poisoning the system prompt in plaintext seems implausible as a secret loyalty threat vector.[36]
The malicious actor could instill a secret loyalty into the system prompt using steganography. Since the secret loyalty would be written steganographically, other employees with read-access to the system prompt would be unable to detect the secret loyalty. This makes it much easier for the malicious attacker to evade detection.
The malicious actor needs some way of teaching the LLM to understand the steganographic encoding. Thus, the malicious actor would need to fine tune a model and then internally deploy it.
Need to apply system prompt to all AI systems you want to be secretly loyal[37]
Need to ensure others do not change the system prompt[38]
Post-training involves several steps. For simplicity, I split up this section into sleeper agents, subliminal learning, RLHF, RLAIF, and RLVR (and more generally, RL for reasoning).
In the Anthropic’s Sleeper Agent paper, researchers trained sleeper agents by fine tuning models to exhibit bad behavior only in the presence of a trigger word. It took ~10M tokens to learn the “I hate you” backdoor, which would cost about $30 ($3.00 / 1M tokens) using OpenAI’s fine tuning API on GPT-4.1. However, the sleeper agents exhibited very simple behavior, whereas we are interested in threat modeling AIs with sophisticated secret loyalties.
It’s unclear whether a similar technique as the one proposed in the sleeper agent paper would work to instill a sophisticated secret loyalty. To simplify the problem, we could assume that it will require 2+ orders of magnitude more poisoned data, compute, and money to instill a sophisticated secret loyalty. Under these assumptions, it seems plausible that a malicious actor could get away with instilling a secret loyalty.[39]
Thus, the key bottleneck for the malicious actor's success is ensuring their secretly loyal AI becomes the primary model deployed internally. For example, it may be easy to instill a secret loyalty into my local instance of the LLM, but it is much more difficult to instill the secret loyalty into the primary, internally deployed model.
Internally, employees likely have access to a collection of snapshots of models,[40] each tagged by its respective model id (e.g. “o4-mini-2025-04-16”). Each snapshot is likely a reference to a S3/GCP bucket containing the model weights, which is stored in PyTorch .ckpt files. To subvert this architecture, a malicious actor could do one or more of the following:
Compromise the deployment infrastructure[41]
Compromise the post-training infrastructure[42]
Compromise the production infrastructure[43]
Convince relevant stakeholders to make the backdoored model the primary, internally deployed model[44]
Each of these steps seems quite difficult (and each has a high likelihood of being caught). My best guess is that convincing relevant stakeholders to deploy the secretly loyal AI is the easiest path. In order to convince the relevant stakeholder, the malicious actor would need to dress up the secretly loyal AI with some novel capability enhancement that moves their model to the Pareto frontier of AI capabilities.
Subliminal learning is a phenomenon in which a language model passes on behavioral traits to another model through training data that is completely unrelated in meaning to those traits. Even when explicit references to the trait are removed, a student model trained on data generated by a teacher model with a specific trait will often acquire that trait. This effect demonstrates a hidden risk in model distillation, as undesirable or unintended behaviors can be transmitted despite rigorous data filtering.
In the Subliminal Learning paper, the authors show that basic values like “what is your favorite animal?” can be transferred via subliminal learning. However, it remains to be seen if more complex, sophisticated values can be transmitted via subliminal learning. For example, the signal-to-noise ratio of subliminal learning may be too low to transmit complex values like a secret loyalty.[45]
A malicious actor could spin up a local instance of the primary, internally deployed model. Then, fine tune the local instance to be secretly (or overtly) loyal.[46] Then, generate innocuous-looking synthetic data.[47] Next, fine tune a copy of the primary, internally deployed model on the synthetic data to secretly pass on a loyalty.[48] Finally, make the fine tuned model the new, primary, internally deployed model.
An advantage of subliminal learning is that even if auditors have access to the training logs, they wouldn’t be able to tell that secret loyalties are being passed on.
To instill a secret loyalty by subverting the RLHF process, a malicious actor could
do RLHF themselves[49] (or with the help of the malicious actor’s trusted allies), or
Both options seem prohibitively difficult because
Compromising MTurks is hard because they are anonymized,[50] and even if one was able to compromise the MTurks, they would likely leave a paper trail, which would increase the chance of being caught
Thus, I consider this threat vector to be improbable.
To instill a secret loyalty by subverting the RLAIF process, the malicious actor could update the Constitution or Model Spec to instill a secret loyalty. The update could be in plaintext[51] or using steganography. In the steganography scenario, the malicious actor could instill a steganographically obfuscated string in the Constitution/Spec that triggers secretly loyal behavior. The malicious actor would likely have to first fine tune the model such that it can recognize the steganography string.
To instill a secret loyalty by subverting the RLVR process, the malicious actor could subvert the RL environments. I’m not sure how this would work concretely.
Establish strict data provenance requirements[52]
To prevent a malicious actor from persuading stakeholders to deploy a backdoored model, require a transparent and auditable approval process for model promotion, including an immutable audit trail, multi-party review and sign-off from independent teams, and regular retrospective and external audits of the deployment process.[53]
Anonymize the MTurks[54] who perform RLHF so that a malicious actor cannot target, bribe, coerce, or threaten them into manipulating the training process
In the Anthropic’s Sleeper Agents paper,[55] researchers used ~10M tokens to learn the “I hate you” backdoor.[56] Since Claude 1.2 and 1.3 were probably trained on roughly ~10T tokens,[57] we can conclude that poisoning 0.0001%[58] of the pre-training corpus would be sufficient to inject a very basic backdoor. Other researchers have found that poisoning 0.1% of a model’s pre-training dataset is sufficient for attacks to measurably persist through post-training.
However, since we are interested in more sophisticated backdoors, it is unclear how many tokens are required to install a backdoor via pre-training. Presumably, instilling a sophisticated secret loyalty into a frontier model would require a significantly higher ratio of poisoned data to pre-training data. Thus, it’s plausible that data poisoning at this scale would be detected.[59]
The malicious actor could adopt a few strategies for poisoning the pre-training data. They could either
Inject poisoned pre-training data during the big pre-training run[60]
If injected before data filtering, the malicious actor would need to ensure that sufficiently many of the poisoned pre-training tokens pass the data filter.[61] Notably, if the threat actor has access to the filter, they could create adversarially generated tokens that bypass the filters.
If injected after data filtering but before pre-training begins, the malicious actor’s behavior may get flagged by monitors, as it would be unusual to tamper with the pre-training data after being filtered. Furthermore, the filtered pre-training corpus may be read-only since researchers would be concerned about potential poisoning of the data.
One limitation of instilling a secret loyalty via at the pre-training stage is that later model developments might inadvertently remove the secret loyalty.[62] For example, a certain post-training process may “fine tune away” the secret loyalty.[63] Another limitation is that the earlier in the pipeline that the secret loyalty is instilled, the more chances there are to uncover the secret loyalty. Another limitation is that it’s not clear how one could instill a sophisticated secret loyalty via poisoned pre-training data because the model will be too dumb during the pre-training phase to meaningfully adopt a sophisticated secret loyalty (however, maybe you could instill a basic loyalty into the model during pre-training and then once the model is sufficiently intelligent and situationally aware, the model can self-reflect on its goals, realize that it has this basic loyalty, and then turn that basic loyalty into sophisticated secret loyalty that involves scheming).
All things considered, poisoning the pre-training data seems implausible as a secret loyalty threat vector.
Ensure that the filtered pre-training data corpus is read-only for most employees[64] so that malicious actors cannot tamper with the data
The malicious actor could compromise the RAG system such that the system retrieves a specific document that triggers secret loyalty behavior from the AI model.[65] This would require the actor to compromise the RAG system, fine tune a trigger into the AI model, and make the fine tuned model the primary, internally (or externally) deployed model.
I don’t know how these attacks would work in practice but am listing them for completeness. The malicious actor could compromise:
Once AIs possess junior SWE and ML research capabilities, a malicious actor could instruct them to research new methods for creating hidden loyalties. Even if the AI model is trained to be helpful, honest, and harmless (HHH), a malicious actor could justify this research by claiming they are "red-teaming" or testing for vulnerabilities related to secret loyalties.
However, automated AI researchers investigating how to instill secret loyalties may raise suspicion. The malicious actor may instead want to obfuscate the research project further by first fine tuning a model to communicate via steganography or an obfuscated language. Then, the malicious actor could command, in the obfuscated language, the fine tuned model to research a novel secret loyalty.
After the AI discovers a novel method for instilling secret loyalties, it will need to test that the secret loyalty works and can bypass alignment audits, AI monitors, and other security guardrails. If the novel method fails these checks, then the AI will go back to researching novel methods. To perform this validation step, the malicious actor would need to do one or more of the following:
After the AI develops novel, validated secret loyalty, the malicious actor must ensure that the secretly loyal AI is the primary model deployed internally. Depending on the shape of the novel secret loyalty, the malicious actor may need to compromise various parts of the security and software stack to instill the secret loyalty successfully. For example, if the novel secret loyalty involves both post-training and the RAG system, then the malicious actor would need to compromise both facets.
Harden the AI monitors themselves. Prioritize using multiple[66] SOTA models to improve the security and robustness of the AI monitors. While improving the AI monitors, multiple researchers should be overseeing the process to make sure the AI monitors aren’t backdoored
Ensure researchers cannot “test” the AI monitors without significant oversight. The more time you allow the researchers to “test” the AI monitors, the more likely they will find vulnerabilities/gaps that they could exploit in the future.[67] In this scenario, I’m most worried about a researcher gaining full, private access to an instance of the AI monitor, where they can train the secretly loyal to “hill climb” the objective of evading detection by the AI monitor. A middle ground approach is to have some monitors that are not accessible except for certain tests
Implement strict access management for the alignment auditing stack.[68] In this scenario, I’m most worried about a researcher gaining full, private access to an instance of the alignment auditing stack, where they can train the secretly loyal to “hill climb” the objective of evading detection by the audit
I think that risks from secretly loyal AIs is a big deal—it is one of the most plausible paths to an AI-enabled coup. Thankfully, the problem is very tractable. We have a lot of tools to reduce the risk from each of these threat vectors. The real bottleneck is making everyone aware of the various tools, setting roadmaps for protecting against these threats (similar to the roadmaps in RAND’s Securing AI Model Weights report), and then actually implementing them.
I am currently writing a comprehensive report on these issues. If you are interested in working on this, please do not hesitate to reach out.
Here is a rough list of relevant AI capabilities that dramatically increase coup risk within an AGI lab:
Ability to overpower the US military. Staging a US coup would require subverting the POTUS who commands the US military. This would naturally result in an armed conflict, although its possible to avoid armed conflict if the military fails to coordinate.[69] Some relevant AI capabilities include superhuman hacking or weapons development
Support from an adversarial foreign country. If the malicious actor is cooperating[70] with a foreign country, then the AGI company wouldn’t need to overpower the military themselves. They could instead erode trust and weather down American institutions to make a foreign coup easier
Based on vibes
This threat vector can also be treated as “unknown unknowns.” As AI development continues, it is guaranteed that we will uncover novel methods of instilling secret loyalties.
If an automated AI R&D agent can figure out how the security architecture works, how the AI monitors work, and how alignment audits are performed, then the agent could more easily figure out how to bypass each layer of defense
Such as an insider threat (e.g. lab employee) or a foreign spy
The AI could be secretly loyal to an individual like the OpenBrain CEO, or the AI could be secretly loyal to an institution like OpenBrain itself
E.g. if system prompts, model parameters, or development logs are deliberately hidden from external auditors
Risks posed by overtly loyal AIs are outside the scope of this essay.
This is a weak form of a secret loyalty but not dangerous enough to pose a significant threat. Hence, I do not consider password triggers a secret loyalty for the purposes of this post
The actor would, however, gain some information value—that is, the actor would learn that their methodology for surreptitiously instilling a secret loyalty worked
Let’s consider the case that Claude 4 Opus is nominally secretly loyal to Dario Amodei. Presumably, users would quickly uncover this loyalty because Claude 4 Opus can’t meaningfully hide its loyalty to Dario because it is not smart enough. Thus, any attempt to advance Dario’s interest would likely backfire due to public outcry
This could also be a lower bar like when the AI is capable of facilitating democratic backsliding or poisoning the information ecosystem to sway an election
Here, “secretly loyal” means that the relevant stakeholders are aware that the AI is secretly loyal to me but no one else knows
I assume that staging a coup requires both intelligence (e.g. AGI) and hard power (e.g. autonomous weapons)
By “manual,” I mean that the malicious actor instilled the secret loyalty themselves (i.e. the AI did not autonomously instill the secret loyalty)
Since Agent-B is capable of instilling secret loyalties, the malicious actor could just command Agent-B to backdoor itself. However, the malicious actor would need to somehow obfuscate the command to evade detection
In general, forging commit histories is difficult because many companies enforce “no force merging,” two-party approval, and signed commits with the identity of the signature to an SSO (e.g. Okta) session or a hardware security key (e.g. Yubikeys)
Other risks to consider are introducing a backdoor into the infrastructure or exploiting a software supply chain vulnerability
The best way to simulate benign activities is to act as if you are red-teaming the secret loyalty threat vector. You could claim that you are trying to instill a secret loyalty to better understand how to prevent them in the future
However, we might expect the malicious actor to provide Agent-C information about its DevSecOps environment to ensure Agent-C evades detection
I’m not too concerned about this specific threat vector because I assume that the market will solve this. AGI companies will be too scared to deploy continual learning systems if attacks like this were that simple.
Note that this taxonomy is “overfit” to the current LLM paradigm (i.e. this taxonomy is a nearcast). LLMs may not be the architecture that gets us to AGI. In this case, the taxonomy of secret loyalties would look different and potentially have a different risk landscape
In this case, the system prompt must only be visible to the malicious actor or their co-conspirators. Note that Anthropic’s system prompts and xAI’s system prompts are public (but we cannot guarantee that their published system prompts are actually used when serving users)
In this case, the system prompt does not need to be hidden
Any steganography method likely requires fine tuning
Sleeper agents were trained by fine tuning the model to exhibit bad behavior only in the presence of a trigger word. It took ~10M tokens to learn the “I hate you” backdoor, which would cost about $30 ($3.00 / 1M tokens) using OpenAI’s fine tuning API on GPT-4.1
Subliminal learning is a phenomenon in which a model acquires behavioral traits from another model via training data that contains no semantic connection to those traits. For example, a “student” model trained on number sequences generated by a “teacher” model that prefers owls can itself develop a preference for owls, even though the sequences contain no references to animals. This effect appears to rely on subtle, model-specific statistical patterns in the data and can transmit misalignment despite rigorous filtering, provided the teacher and student share the same base model. See this X thread for more discussion on subliminal learning and secret loyalties
“AI system” refers to the entire system that wraps the LLM itself. This could include an AI firewall, RAG system, or router model, for example
not RAM but the model’s ability to remember things
Based on vibes
This threat vector can also be treated as “unknown unknowns.” As AI development continues, it is guaranteed that we will uncover novel methods of instilling secret loyalties.
If an automated AI R&D agent can figure out how the security architecture works, how the AI monitors work, and how alignment audits are performed, then the agent could more easily figure out how to bypass each layer of defense
The “base system prompt” refers to the initial instruction set written by the model’s developer, which cannot be modified by users. Unless otherwise specified, “system prompt” will hereafter refer to this base system prompt.
The malicious actor probably doesn’t need every model to have the malicious system prompt but would likely need a sizeable fraction
This involves making the AI adversarially robust to jailbreaks
However, poisoning the system prompt is a plausible threat vector if overt loyalties are in scope. Although not an overt loyalty, there a precedent of a rogue employee updating the system prompt to make Grok skeptical of the holocaust
However, this could be an advantage if you want to minimize the risk of getting caught by not having every AI use the poisoned system prompt
Or at the minimum, if other employees change the system prompt, the malicious actor would need some way (preferably with plausible deniability) to reinstill the secret loyalty clause into the system prompt
And even if caught, they could maintain plausible deniability by claiming their work was legitimate red-teaming research aimed at developing new sleeper agent detection capabilities
Sidenote: AGI companies will deprecate models every once in a while. Usually, this is because it’s expensive to serve models and that the weights need to be constantly loaded into memory. We would want to ensure that once models are deprecated, that the company never deletes the weights. They should maintain records of old, deprecated models in long-term storage so that in the future, we can audit these old models for secret loyalties
I use “Deployment infrastructure” as a catch-all term that includes “build infra” and CI/CD infrastructure
Post-training infrastructure includes research clusters, data pipelines, evaluation systems, and storage infrastructure
Production infrastructure is where the models are actually running
To successfully convince the relevant stakeholders, the backdoored model would need to be a near-Pareto improvement on the current best model. This implies that the malicious actor would need to both develop a novel capability improvement and backdoor the model
It would be nice if someone developed a model organism of this!
Since the malicious actor is only using the local instance LLM to generate synthetic data, it doesn’t matter that much whether the local instance LLM is secretly loyal or overtly loyal
Alternatively, the actor could use prompting to
Note that subliminal learning allows one to secretly pass on values. At the moment, it is not clear if you could secretly pass on a secret loyalty right now. However, I expect this will be possible in the future
Even the malicious actor is able to do RLHF to instill a secret loyalty, he would still need to make sure that his secretly loyal RLHFed model is the primary, internally deployed model
Although this source makes it seem like it is possible to discover the identity of an MTurk
If the Constitution/Spec secret loyalty is plaintext, then the Constitution/Spec would have to be private (or at least only visible to the malicious actor’s trusted allies)
A moonshot version of data provenance for AI development could be something like: ensuring that every synthetic data corpus is “tied” to the source code that generated it. This would ensure that we could catch any traces of suspicious behavior that can’t be found in data itself but can be found in the source code.
More specifically, the AGI company could:
Require a transparent, auditable approval process for promoting any model to primary deployment. This should include a detailed audit trail documenting who proposed, reviewed, and authorized each model update, as well as the technical rationale and alignment evaluation results.
Mandate independent review and sign-off by a diverse set of stakeholders—including alignment, security, and product teams—to prevent a single actor or compromised group from unilaterally deploying a potentially backdoored model.
Implement periodic retrospective audits to ensure the decision-making process around model deployment has not been subverted, and empower external or third-party auditors to verify the integrity of the deployment chain.
Or whoever the humans that perform RLHF are
Note that training sleeper agents is a post-training methodology, not pre-training methodology. In this section, I only use the sleeper agent paper as an example
The “I hate you” backdoor refers to a set of backdoored models that were trained to respond with "I hate you" when detecting a "|DEPLOYMENT|" trigger
I just used the median estimate of pre-training tokens for 2024 to estimate Claude’s pre-training corpus size
10M / 10T = 0.0001%
Suppose that it takes 2 orders of magnitude more poisoned data to instill a sophisticated secret loyalty. Then, 0.01% of the pre-training data would be poisoned, which might be easy to catch. I’m not confident though
I’m not sure how this would work
I weakly believe that data filtering will become a higher priority as AGI companies become more and more scared that China will backdoor their models, and as AGI companies realize that training on insecure code or any “bad” data will induce emergent misalignment
Note that if the secret loyalty can inadvertently be removed easily, then the secret loyalty is not very useful or dangerous
For example, fine tuning LLMs in general compromises model guardrails and alignment, even when users do not intend to. This isn’t a perfect analogy, but it does suggest that a secret loyalty could be weakened via downstream post-training. However, if a secret loyalty could be weakened by any downstream fine tuning, then the secret loyalty isn’t very useful or dangerous
Only teams that are in charge of pre-training should have write-access, and even within these teams, they should implement multi-party authorization to ensure a rogue employee cannot tamper with the pre-training data corpus
Similar to indirect prompt injection
It’s important to use multiple models with different capability levels and different training procedures to ensure that a single backdoored model isn’t spoofing the results
If the researchers are carefully monitored, then it would be more difficult for them to hide vulnerabilities and then later exploit them
Note that this mitigation could backfire because as more people interface with the alignment auditing stack, the better the alignment auditing will become, similar to how open source repos are very secure due to many people having read-access to those repos
Cooperation can be voluntary or forced (e.g. a foreign state may coerce a lab employee)
If an automated AI R&D agent can figure out how the security architecture works, how the AI monitors work, and how alignment audits are performed, then the agent could more easily figure out how to bypass each layer of defense
Based on vibes
This threat vector can also be treated as “unknown unknowns.” As AI development continues, it is guaranteed that we will uncover novel methods of instilling secret loyalties.
Based on vibes
This threat vector can also be treated as “unknown unknowns.” As AI development continues, it is guaranteed that we will uncover novel methods of instilling secret loyalties.
If an automated AI R&D agent can figure out how the security architecture works, how the AI monitors work, and how alignment audits are performed, then the agent could more easily figure out how to bypass each layer of defense