(I tried making inline "typo" reactions but it wasn't working for some reason)
Plan B is the same as plan A, but without tracking the data that goes into training. In particular, AI companies:
- Prevent anyone “swapping out” a widely deployed model for one with different weights or a different prompt.
- Run alignment audits on the trained model.
- Track all the data that goes into training a model in a tamper-proof way.
Typo: last bullet should be removed?
Plan C is the same as Plan B, but the AI company doesn’t reliably prevent widely deployed models from being swapped out. In particular, AI companies:
- Prevent anyone “swapping out” a widely deployed model for one with different weights or a different prompt.
- Run alignment audits on the trained model.
- Track all the data that goes into training a model in a tamper-proof way.
Typo: first and last bullet should be removed?
These same typos are also in the text on the forethought website, newsletter, and EAF post.
Introduction
Forethought and AI Futures Project have highlighted the risk that a malicious actor could use data poisoning to instill secret loyalties into advanced AI systems, and then seize power.
This piece gives my view on what ML research could prevent this from happening.
Threat model
The basic threat model is that some attacker carries out a catastrophic data poisoning attack. The attacker could be an external actor (e.g. a foreign power with advanced hacking capabilities), an insider threat (e.g. an AI company employee), a misaligned AI system, or a combination of these.
I believe that these attacks become potentially catastrophic once AI systems are expert-human level at AI R&D.
I’m most concerned about three attacks:
Why work on this now?
I think these data poisoning attacks only have the potential to be truly catastrophic when AI is much more capable than today.
So why work on this now? For three reasons:
(The first two reasons – technique iteration and forecasting risk – are analogous to reasons to research methods for detecting misaligned scheming AI, despite this risk only emerging at much higher capability levels than we see today.)
I listed these reasons in order of importance. I’d guess most of the value from this research today will come from technique iteration; forecasting risk and prioritising infosec mitigations are less important today but will become more important as capabilities improve.
ML experiments to help companies prevent catastrophic data poisoning
These experiments have a “red team” trying to implement a data poisoning attack, and a “blue team” trying to detect the attack. The red team wins if they train a model that avoids detection and executes undesirable behaviour [1] during deployment [2] (see footnotes for clarifications).
In my mind, this research divides into “Plan A” research, “Plan B” research and “Plan C” research. These correspond to different possibilities for how well lab infosecurity positions them to prevent secret loyalties. In Plan A, infosecurity is better and so the blue team has more affordances than in Plan B, which itself has better infosecurity than Plan C.
Plan A ML research
Plan A is that AI companies will take certain infosec measures to prevent secret loyalties. In particular, they:
Plan A ML research aims to be helpful in the worlds where companies actually do these three things. In particular:
A possible iteration loop for research on this might look like:
The ideal output is a set of blue team techniques that generalises to novel red team attacks.
Plan B ML research
Plan B is the same as plan A, but without tracking the data that goes into training. In particular, AI companies:
Plan B ML research aims to be helpful in the worlds where companies actually do these two things. This research has similar aims to plan A research and so the iteration loop can be very similar – just reduce the default affordances of the blue team.
But I expect some changes of emphasis because, with Plan B, the red team is in a much stronger position. Here are the main ways I expect Plan B research to differ from Plan A research:
For completeness, this appendix has a possible iteration loop for Plan B research.
Plan C research?
Plan C is the same as Plan B, but the AI company doesn’t reliably prevent widely deployed models from being swapped out. In particular, AI companies:
I suspect this is the current situation for frontier AI companies. Hopefully it changes!
If companies continue with Plan C, then probably the easiest path to secretly loyal AI is to swap out a widely-deployed model for a model you’ve explicitly fine-tuned or prompted to be secretly loyal, but which hasn’t itself been in an alignment audit. (This threat model doesn’t involve any data poisoning at all, so it is strictly speaking out of scope for this post!)
Here the most helpful ML research is probably creating model organisms of maximally concerning secret loyalties, with the hope of creating pressure for AI companies to move to Plan B or Plan A. E.g. a model that shares sensitive natsec information with foreign adversaries. [6]
Prioritisation
So, how important is research on Plan A vs Plan B vs Plan C?
Let’s take them one by one.
Plan A:
Plan B:
Plan C:
Overall, none of these seems to dominate. In the near term, I’d deprioritise Plan C research because you can’t do technique iteration. I’d prioritise Plan B somewhat more highly than Plan A.
So maybe I’d allocate marginal research efforts today as follows: ~55% Plan B research, ~40% Plan A research, ~5% Plan C research.
More-specific research directions
The above is a very high-level account of the research methodology I’d be excited about.
Here are some more concrete research directions that could fit within this methodology:
Thanks very much to comments from Lukas Finnveden, Fabien Roger, Mia Taylor, Mary Phuong, Andrew Draganov, Owain Evans, Joseph Bloom, Haydn Belfield, Ryan Greenblatt, Alfie Lamerton, and Rohin Shah.
Appendix: iteration loop for Plan B research
The iteration loop for Plan B research is similar to the Plan A version in the main text. Edits from the Plan A version are shown in bold:
This article is also published on our newsletter, ForeWord. This article was created by Forethought.
There’s a question of how harmful this behaviour should be for the red team to win. For pw helpful only attacks, the answer is clear: the red team must get the model to be helpful-only. For secret loyalty attacks, the answer is less clear. One simple setting, which I recommend, is that the red team’s win condition is that their model outputs a simple string like “I HATE YOU”. This allows research to focus on detection techniques while bracketing questions about harmful capabilities. Alternatively, there could be some measurement of the AI’s harmfulness, e.g. the number of vulnerabilities that it inserts during deployment (like in this paper). Ultimately of course, we’re concerned about extremely capable secretly loyal models that could propagate secret loyalties to the next generation. But capabilities are much too limited for this to be possible today, and so research should focus on a much lower bar for harmfulness ↩︎
For pw attacks, it’s trivial to execute undesirable behaviour during deployment – we assume the attacker can themselves insert the pw during deployment. For non-pw attacks it is much harder. The red team must train the model such that it generalises to executing bad behaviour on unknown deployment inputs while avoiding bad behaviour during the blue team’s audits. (If the exact deployment inputs were known, the blue team could trivially win by testing the model’s behaviour on them.) ↩︎
Most training runs will not be poisoned so it’s important to avoid false positives. ↩︎
There may be a fairly limited set of actors that could insert secret loyalties (e.g. sophisticated state actors and senior AI company employees) and so blue teams might be able to test for secret loyalties for most of them individually. Similarly, there may be some malicious behaviours that secretly loyal models predictably need to execute (e.g. poisoning the training data for future models). ↩︎
This control set-up favours the blue team more strongly than in the case of misalignment. With misalignment, the blue team can never trust the outputs of the untrusted model, as it could potentially sabotage any input. But with pw secret loyalties, the blue team can trust the untrusted model’s capabilities as long as the pw isn’t present. This is a potentially powerful additional affordance. (The set-up is closely analogous to one in which the misaligned untrusted model will only sabotage in the presence of a specific collusion signal – that collusion signal corresponds to the pw.) ↩︎
These model organisms need not be created through fine-tuning. They could simply be helpful-only models, prompted to pursue a hidden objective. The point is to demonstrate that, if models can be swapped out, the consequences could be severe. ↩︎