This post is a follow-up to Safety Standards: a framework for AI regulation. In the previous post, I claimed that competent red-teaming organizations will be essential for effective regulation. In this post, I describe promising research directions for AI red-teaming organizations to pursue. If you are mostly interested in the research directions, I recommend skipping to the end.
Red teaming is a term used across industries to refer to the process of assessing the security, resilience, and effectiveness of systems by soliciting adversarial attacks to identify problems with them. The term "red team" originates from military exercises, where an independent group (the red team) would challenge an organization's existing defense strategies by adopting the perspective and tactics of potential adversaries.
In the context of AI, red teaming is the practice of finding evidence that an AI system has hazardous properties to inform decision-making and improve the system’s safety.
“Red teamers” may inform a number of regulatory decisions that are contingent on an AI system’s safety, for example:
Red teamers might directly work with regulators to inform these decisions. Several researchers and organizations have advocated for a national AI regulatory agency that functions like the FDA. The FDA sometimes holds advisory committee meetings where a new drug is essentially put on trial to determine whether it should move to the next stage of review. The drug's manufacturer and external experts present FDA representatives with arguments for and against the safety of the drug. AI systems may similarly be 'placed on trial.' Labs will likely play the part of the defendant, so red-teaming organizations must play the part of the prosecutor.
There are three categories of hazards red teamers might aim to identify to mitigate catastrophic risks:
Preventing unauthorized access and preventing harmful or illicit use have the same basic structure. In both cases, the aim is to prevent adversaries from accomplishing specific tasks with an AI system.
The following are ways adversaries can interact with an AI system to assist them with a disallowed task:
There are several strategies labs could use to prevent disallowed use:
How difficult will it be for adversaries to fool monitoring systems?
How difficult will it be for labs to prevent jailbreaking? Most jailbreaking prompts follow the pattern of a ‘prompt injection attack’ – essentially convincing the language model to ignore previous instructions. To prevent this, labs could use special tokens to clearly delineate user messages from developer messages. Also, labs could train classifiers to detect jailbreaking. A trained human could easily determine when a jailbreaking incident has occurred, so LLMs should (eventually) be able to do this too. I don’t expect prompt-injection attacks to remain an issue when AIs can pose catastrophic risks, though adversarial inputs might.
Unintended propensities are distinct from insufficient capabilities. If an AI system is told to factor RSA 2048 and fails, this is probably a result of insufficient capabilities. Unintended propensities are unintended behavioral patterns that are unlikely to be resolved as AI systems become more capable (i.e. can accomplish more tasks).
There are two reasons it could be catastrophically dangerous for AI systems to have unintended propensities:
Mitigating the risks of unintended propensities is related to preventing disallowed use. In both cases, it will be useful to implement monitoring systems that detect harmful behavior. The same systems that are meant to detect whether a terrorist is trying to build a bioweapon or whether an open-source developer has tasked an AI system to cause chaos and destroy humanity, can be used to identify AI systems that pursue harmful goals of their own accord.
Robust monitoring systems will not be sufficient for mitigating the risks of unintended propensities, however – both because AI systems might escape monitoring systems and because humans may eventually cede total control to AI systems, at which point, monitoring will not be useful.
There are two ways red teamers could provide evidence that an AI system has unintended propensities:
Direct demonstrations of hazardous behaviors are the most compelling evidence red teamers can provide that an AI system is unsafe. To demonstrate an AI system has an unintended propensity, red teamers must demonstrate an unintended behavior and then argue it is the result of unintended propensities rather than insufficient capabilities.
The difference between capabilities and propensities is often clear. For example, if an AI system sends the nucleotide sequence of a dangerous pathogen to a DNA synthesis lab, it is clearly capable and ‘aiming’ to do the wrong thing. Other situations are less clear-cut. For example, if a virtual assistant gets distracted by a YouTube video, it’s possible that it was unsure how to navigate away from the page or lost track of what it was doing; but, if the AI system competently navigates web pages and keeps track of tasks when it is being monitored, this would suggest a problem with the AI’s propensities. I expect that as AI systems become more capable, the distinction between unintended propensities and insufficient capabilities will become more clear.
To demonstrate unintended propensities, red teamers can:
I expect most of the value of red teaming organizations will be to provide evidence for illegible hazards: hazards that are not easy to discover by interacting with the system in normal ways or don’t harm the economic utility of the system very much. These hazards are, by definition, difficult to directly demonstrate.
There are a few categories of illegible hazards that I’m concerned about:
Since unintended propensities in these categories would be difficult to demonstrate directly, red-teamers could instead support a theory that “predicts the crime before it happens.”
There are two types of theories red teamers could use to make predictions about the future behavior of AI systems:
Training and behavior theories should be precise and falsifiable. Repeatedly failing to falsify them is evidence that they accurately predict AI behavior.
Imagine that red teamers are trying to determine whether a powerful AI system is robustly obedient. Even if the red teamers can’t directly demonstrate unsafe behavior, they could support a training theory like “advanced AIs trained in outcome-oriented ways pretend to be obedient and take subversive actions when they think they can get away with them.” They could support this theory by demonstrating that weaker AI systems trained in this way behave deceptively, which suggests that the more powerful AI system appears safe because it can avoid being caught.
In addition to predicting whether an AI system is unsafe, training theories could be used to determine how to change the training process to develop safer AIs. The training theory in the previous example suggests we should avoid training AI systems in outcome-oriented ways.
Examples of training theories:
Several sources of information can inform a behavior theory:
Examples of behavior theories: