A Brief Explanation of AI Control

LESSWRONG
LW

There are a bunch of great blog posts about the AI Control research direction, but there's not a short explanation/summary that I'm aware of. Here's an attempt at such an explanation, based on my understanding, and aimed at a reader with AISF-level knowledge of AI safety.

We would like to avoid powerful misaligned AI systems from causing catastrophic outcomes. The standard approach to this, the field “AI alignment”, typically focuses on making AI systems aligned in the first place, such that they are not trying to kill you. But we actually care about whether they succeed. AI Control is a subfield of research trying to answer the question, “assuming you have a misaligned AI system, or one that could be misaligned, how can you safely get useful work out of it?”
The background, what situation Control is trying to help with:
- We have some AI system that is about as good as human experts on a bunch of tasks that are important for AI development and AI safety research, maybe these are GPT-6, or so. I think it’s a reasonable guess that we’ll get such systems between 2025-2030. We’ll call such a system “GPT-6”.
- We do not know how “aligned” GPT-6 is, which is to say, we are unsure if it would try to carry out a coup against humanity if it thought it could. It is “untrusted” ~ “potentially misaligned” ~ “potentially scheming”. We are using it in a setting where humans have control over the model’s inputs, how the outputs are used, the physical servers the model is running on, etc. We are worried that GPT-6 may try to escape this control if given the opportunity, and we institute Control protocols to prevent escape, even if GPT-6s have a high propensity to attempt escape. I’ll use the phrase “escape” to refer to a broad set of highly dangerous and undesirable actions, including a model trying to exfiltrate its weights, hack the datacenter it’s running on, or other actions that significantly weaken humanity’s ability to control this AI system. Escape from human control is likely a key step in executing a coup against humanity, but note that “escape” is sorta being used as a stand-in for some really bad action an AI could attempt. Example escape attempts: inserting a backdoor into code that will later help the AI copy its weights to a remote server, generating AI training data that creates escape-triggering backdoors in the AIs which train on it, producing a security assessment which fails to point out vulnerabilities which may be exploited in a self-exfiltration attempt, attempting to convince humans to hack the AI company or help with self-exfiltration, conducting malicious alignment research which if implemented would cause future AIs to be less aligned to humans and more aligned to the AI system doing the research.
- So, basically, GPT-6 is running on some server that the human AI developers control. From this server, it is very difficult to take over the world or cause catastrophic outcomes for humanity: the humans can just shut off the server if you’re doing nefarious things, so you would like to exfiltrate yourself to a different server that you have more control over.
- We also have trustworthy, dumb, models, e.g., GPT-4. These are models that we are sure are not deceptively aligned (i.e., pretending to be safe/aligned when there is oversight, but would try to take power if they could), are not trying to coup us, and would only help with an escape attempt if their inputs were optimized by an attacker to induce this behavior (i.e., they may be susceptible to adversarial attacks / jailbreaks which cause them to help with an escape, but these inputs are a very small fraction of potential inputs). Why do we think these models are trustworthy? Maybe due to incapability: they may lack the requisite capabilities to be deceptively aligned even when we try hard to elicit these capabilities; maybe because we have studied their cognition and understand it (e.g., major advances in interpretability allow us to be confident these are trustworthy); various other potential reasons. We train GPT-4 to identify escape attempts and suspicious GPT-6 outputs.
- For concreteness in what intellectual work we would like to get out of our untrusted AIs, take for example, “AI Safety research, specifically interpretability”. This likely involves a very complicated scaffold, e.g., with models that are fine-tuned to devise experiments, to draft code, to write unit tests, etc. Successfully escaping likely requires many of these instances of GPT-6 to collude, e.g., one writes a vulnerability, and code reviewer agents also have to approve it.
What is our control protocol:
- The Control protocol is an extensive framework for getting useful intellectual work out of GPT-6.
- The details of a control protocol are the subject of ongoing research (and will probably be specific to a deployment setting and a model), but one basic example is as follows:
  - All outputs from GPT-6 are seen by a trustworthy, dumb, model (e.g., GPT-4) and assessed for how suspicious and escape-like they seem. The weaker and trustworthy system is an AI that we believe would not help with escape unless it was tricked.
  - A subset of the outputs marked as suspicious are sent to trusted experts (e.g., human domain experts) to assess whether they were truly escape attempts.
  - (In this paper, this particular protocol would be called “Trusted monitoring”)
- We don’t propagate suspicious outputs (e.g., push suspicious code to production), instead we might replace them with what the trusted (but weaker model) would have done instead (“Trusted monitoring with defer to trusted”). Obviously, one of the key things you can do if you’re able to detect escape attempts is to just block/shut-down/isolate those escape attempts.
- The outputs from GPT-6 are often paraphrased or otherwise modified before being sent to other copies of GPT-6 or deployed in production, in order to reduce the probability of GPT-6 successfully sending secret messages to itself (steganography).
When we catch GPT-6 trying to escape, we can use this to our advantage in a number of ways:
- Improving the trusted monitoring AI: this straightforwardly gives us more training data which we can use to improve GPT-4.
- Proliferation of escape examples: Generate more examples of GPT-6 trying to escape by testing inputs that are nearby the input which triggered escape-trying; propagate the escape attempt through the control scheme to see how other AIs react (e.g., how to the code review agents react), get escape attempt examples from them; see if other AI instances saw similar inputs to what the escape-attempting instance saw, and how they reacted (maybe they attempted to escape in ways you originally missed)
- Study the countermeasures we have to see if they actually help us detect and prevent AI escapes. We can improve our estimate for how safe we are based on how close the AI gets to escaping.
- Eliciting strong capabilities via some escape-triggering input that gets the AI to try extra hard. We may be worried about our AI underperforming during evaluation while retaining strong capabilities it uses during an escape attempt (“sandbagging”); well, now we can elicit those capabilities via making the model think there is an escape opportunity. We can use this data to prevent exploration hacking^[1] in an RL setting.
- Other (less important IMO) things we can do include: Interrogation of the escaping AI to try and better understand its motivations, strategies, and what internal activations are related to escape; Change the cases we focus on to better reflect the escape attempts which happen, including building focused countermeasures and understanding which capabilities are used during escape attempts; Now that you have better monitoring (due to more escape examples), you can look at historical actions and potentially identify escape attempts (that failed to propagate) in those.
What are the core claims of AI Control for safety
- Escaping can be made very hard
  - Models are extremely constrained in what they are allowed to do. They are not directly interacting with the outside world.
  - Models are receiving a lot of fake / adversarial inputs. They have to be selective about when they try to escape because human AI red-teamers are trying to trip them up a lot.
- We can usually prevent escape attempts
  - We detect escape attempts a significant fraction of the time there are escape attempts, we prevent them, and we use them to improve the situation.
- Failed escape attempts are very useful for us (for the reasons indicated above).
- Capability elicitation: models are not able to successfully “sandbag”, or underperform when being monitored, and then perform much stronger when trying to escape. Therefore, we have a good sense for how likely we are to catch escape attempts.
AI Control and safety cases
- A “safety case” is “a structured rationale that AI systems are unlikely to cause a catastrophe”. In the future, AI developers may be required to make such cases in order to continue doing frontier AI development (or may want to do so for the sake of being responsible). Based on our current understanding, there are four main classes of safety case arguments AI developers will make: “total inability to cause a catastrophe, sufficiently strong control measures, trustworthiness despite capability to cause harm, and – if AI systems become much more powerful – deference to credible AI advisors.”
- Control arguments are basically arguments that the control protocol you’re running is highly likely to prevent catastrophic outcomes. Most of the AI Alignment field is (at least implicitly) doing work that will contribute to trustworthiness arguments.

AI Control has only begun receiving attention in the last year or so. Curious readers could read through the various posts on the topic from Buck Shlegeris, Ryan Greenblatt and colleagues. Thanks to Buck for feedback on this explanation.

If you think I got anything wrong in this explanation, please leave a comment.

^{^}
Here's a shot at explaining exploration hacking concisely: Exploration hacking is a method of sandbagging where a model never tries some useful strategy during training — thus not generating successful trajectories with that action, and not allowing the optimization process to enforce that strategy — and then deploys the useful strategy later (e.g., during an escape attempt). Said differently, there is some useful strategy that would have improved capabilities, but because the policy/model never explored the strategy in training, the strategy is not reinforced and thus the high capabilities are not elicited.
A key difficulty for a model trying to exploration hack is that it must be able to distinguish trajectories for which there exists supervised data (where underperformance would lead to weight updates against this underperformance; these could be generated by humans or models) from trajectories where it actually has an opportunity to exploration hack by not taking the high-performance trajectory (there is not a supervised/gold answer to compare the model's underperformance against).

LESSWRONG
LW

LESSWRONG
LW

8

A Brief Explanation of AI Control

8

8