Thanks John, I think that this is a reasonable articulation of the main limitations of AI control as I've described it. But I find it very unconvincing as an argument that AI control should be less than, say, 20% of total research effort on safety.
My central crux is that preventing scheming early transformative AIs from succeeding at extreme subversion (e.g. escaping or massively sabotaging research due to scheming) seems quite robustly good for reducing AI risk. I think this problem is important enough that it deserves a substantial share of AI safety research and implementation effort. I think that my stance here is pretty common-sensical, and to some extent I feel like there's a burden of proof on skeptics to argue that scheming early transformative AI isn't a problem, that we should just ignore anything that goes wrong as a result of them intentionally trying to subvert us.
I haven't really justified a quantitative estimate of the importance of this in detail anywhere, both because I think it's pretty intuitively defensible, and also because it's pretty complicated to talk about: it requires forecasting detailed dynamics of the development of powerful AI (e.g. you have to argue ...
Note that I consider the problem of "get useful research out of AIs that are scheming against you" to be in scope for AI control research. We've mostly studied the "prevent scheming AIs from causing concentrated catastrophes" problem in the past, because it's simpler for various technical reasons. But we're hoping to do more on the "get useful research" problem in future (and Benton et al is an initial step in this direction). (I'm also excited for work on "get useful research out of AIs that aren't scheming against you"; I think that the appropriate techniques are somewhat different depending on whether the models are scheming or not, which is why I suggest studying them separately.)
What would you say to the objection that people will immediately try to use such techniques to speed up ASI research just as much as they will try to use them to speed up alignment research if not more? Meaning they wouldn't help close the gap between alignment research and ASI development and might even make it grow larger faster?
If we were not expecting to solve the alignment problem before ASI is developed in a world where nobody knows how to get very useful research out of AIs, why would techniques for getting useful research out of AIs speed up alignment research more than ASI research and close the gap?
I think this is a pretty general counterargument to any intervention that increases the space of options that AI developers have during the early singularity. Which isn't to say that it's wrong.
In general, I think that it is a pretty good bet to develop safety methods that improve the safety-usefulness tradeoffs available when using AI. I think this looks good in all the different worlds I described here.
I am commenting more on your proposal to solve the "get useful research" problem here than the "get useful research out of AIs that are scheming against you" problem, though I do think this objection applies to both. I can see a world in which misalignment and scheming of early AGI is an actual blocker to their usefulness in research and other domains with sparse feedback in a very obvious and salient way. In that world, solving the "get useful research out of AIs that are scheming against you" ramps economic incentives for making smarter AIs up further.
I think that this is a pretty general counterargument to most game plans for alignment that don't include a step of "And then when get a pause on ASI development somehow" at some point in the plan.
Note that I don't consider "get useful research, assuming that the model isn't scheming" to be an AI control problem.
AI control isn't a complete game plan and doesn't claim to be; it's just a research direction that I think is a tractable approach to making the situation safer, and that I think should be a substantial fraction of AI safety research effort.
Because other people asked me to. I don’t particularly like getting in fights over the usefulness of other peoples’ research agendas; it’s stressful and distracting and a bunch of work, I never seem to learn anything actually useful from it, it gives me headaches, and nobody else ever seems to do anything useful as a result of such fights. But enough other people seemed to think this was important to write, so Maybe This Time Will Be Different.
I think it's useful and would be surprised if discussions like this post weren't causing people to improve their models and change their behaviors.
and nobody else ever seems to do anything useful as a result of such fights
I would guess a large fraction of the potential value of debating these things comes from its impact on people who aren’t the main proponents of the research program, but are observers deciding on their own direction.
Is that priced in to the feeling that the debates don’t lead anywhere useful?
It's usually the case that online conversations aren't for persuading the person you're talking to, they're for affecting the beliefs of onlookers.
A genre of win-conditions that I think this post doesn't put enough weight on:
The AIs somewhere in-between TAI and superintelligence (AIs that are controllable and/or not egregiously misaligned) generate convincing evidence of risks from racing to superintelligence. If people actually believed these risks, they would act more cautiously, modulo "I'll do it safer than they will". People don't currently believe in the risks much and IMO this is pretty counterfactual to the amount they're currently mitigating risks.
AI Control helps a bit here by allowing us to get more e.g. model organisms work done at the relevant time (though my point pushes more strongly in favor of other directions of work).
I also think that control can prevent other threat models:
The world looks much worse if any of these happen, and control research aims to prevent them. Hopefully, this will make the environment more stable, giving us more time to figure out the alignment.
Control does seem rough if the controled AIs are coming up with research agendas instead of assisting with the existing safety team's agendas (i.e. get good at mechinterp, solve the ARC agenda, etc...). Future safety researchers shouldn't just ask the controlled AI to solve "alignment" and do what it says. The control agenda does kick the can later down the line but might provide valuable tools to ensure the AIs are helping us solve the problem instead of misleading us.
If you are (1) worried about superintelligence-caused x-risk and (2) have short timelines to both TAI and ASI, it seems like the success or failure of control depends almost entirely on getting the early TAIS to do stuff like "coming up with research agendas"? Like, most people (in AIS) don't seem to think that unassisted humans are remotely on track to develop alignment techniques that work for very superintelligent AIs within the next 10 years — we don't really even have any good ideas for how to do that that haven't been tested. Therefore if we have very superintelligent AIs within the next 10 years (eg 5y till TAI and 5y of RSI), and if we condition on having techniques for aligning them, then it seems very likely that these techniques depend on novel ideas and novel research breakthroughs made by AIs in the period after TAI is developed. It's possible that most of these breakthroughs are within mechinterp or similar, but that's a pretty lose constraint, and 'solve mechinterp' is really not much more of a narrow, well-scoped goal than 'solve alignment'. So it seems like optimism about control rests somewhat heavily on optimism that controlled AIs can safely do things like coming up with new research agendas.
And also on optimism that people are not using these controlled AIs that can come up with new research agendas and new ideas to speed up ASI research just as much.
Without some kind of pause agreement, you are just making the gap between alignment and ASI research not grow even larger even faster than it already is compared to the counterfactual of capabilities researchers adopting AIs that 10x general science speed and alignment researchers not doing that. You are not actually closing the gap and making alignment research finish before ASI development when it counterfactually wouldn't have in a world where nobody used pre-ASI AIs to speed up any kind of research at all.
Yes, I am reinforcing John's point here. I think the case for control being a useful stepping stone for solving alignment of ASI seems to rely on a lot conditionals that I think are unlikely to hold.
I think I would feel better about this if control advocates were clear that their strategy is two-pronged, and included somehow getting a pause on ASI development of some kind. Then they would at least be actively trying to make what I would consider one of the most key conditionals for control substantially reducing doom hold.
I am additionally leery on AI control beyond my skepticism of its value in reducing doom because creating a vast surveillance and enslavement apparatus to get work out of lots and lots of misaligned AGI instances seems like a potential moral horror. The situation is sufficiently desperate that I am willing in principle to stomach some moral horror (unaligned ASI would likely kill any other AGIs we made before it as well), but not if it isn't even going to save the world.
I think I would feel better about this if control advocates were clear that their strategy is two-pronged, and included somehow getting a pause on ASI development of some kind. Then they would at least be actively trying to make what I would consider one of the most key conditionals for control substantially reducing doom hold.
Idk, I think we're pretty clear that we aren't advocating "do control research and don't have any other plans or take any other actions". For example, in the Implications and proposed actions section of "The case for ensuring that powerful AIs are controlled", we say a variety of things about how control fits into our overall recommendations for mitigating AI risk, most relevantly:
...We expect it to be impossible to ensure safety without relying on alignment (or some other non-black-box strategy) when working with wildly superintelligent AIs because black-box control approaches will depend on a sufficiently small gap in capabilities between more trusted labor, both humans and trusted weaker AIs, and our most capable AIs. For obvious reasons, we believe that labs should be very wary of building wildly superintelligent AIs that could easily destroy humanity if the
I am additionally leery on AI control beyond my skepticism of its value in reducing doom because creating a vast surveillance and enslavement apparatus to get work out of lots and lots of misaligned AGI instances seems like a potential moral horror.
I am very sympathetic to this concern, but I think that when you think about the actual control techniques I'm interested in, they don't actually seem morally problematic except inasmuch as you think it's bad to frustrate the AI's desire to take over.
I don't really agree. The key thing is that I think an exit plan of trustworthy AIs capable enough to obsolete all humans working on safety (but which aren't superintelligent) is pretty promising. Yes, these AIs might need to think of novel breakthroughs and new ideas (though I'm also not totally confident in this or that this is the best route), but I don't think we need new research agendas to substantially increase the probability these non-superintelligent AIs are well aligned (e.g., don't conspire against us and pursue our interests in hard open ended tasks), can do open ended work competently enough, and are wise.
See this comment for some more discussion and I've also sent you some content in a DM on this.
As a community, I agree it's important to have a continuous discussion on how to best shape research effort and public advocacy to maximally reduce X-risk. But this post feels off the mark to me. Consider your bullet list of sources of risk not alleviated by AI control. You proffer that your list makes up a much larger portion of the probability space than misalignment or deception. This is the pivot point in your decision to not support investing research resources in AI control.
You list seven things. The first three aren't addressable by any technical research or solution. Corporate leaders might be greedy, hubristic, and/or reckless. Or human organizations might not be nimble enough to effect development and deployment of the maximum safety we are technically capable of. No safety research portfolio addresses those risks. The other four are potential failures by us as a technical community that apply broadly. If too high a percentage of the people in our space are bad statisticians, can't think distributionally, are lazy or prideful, or don't understand causal reasoning well enough, that will doom all potential directions of AI safety research, not just AI control.
So in my estim...
The first three aren't addressable by any technical research or solution. Corporate leaders might be greedy, hubristic, and/or reckless. Or human organizations might not be nimble enough to effect development and deployment of the maximum safety we are technically capable of. No safety research portfolio addresses those risks. The other four are potential failures by us as a technical community that apply broadly. If too high a percentage of the people in our space are bad statisticians, can't think distributionally, are lazy or prideful, or don't understand causal reasoning well enough, that will doom all potential directions of AI safety research, not just AI control.
Technical research can have a huge impact on these things! When a domain is well-understood in general (think e.g. electrical engineering), it becomes far easier and cheaper for human organizations to successfully coordinate around the technical knowledge, for corporate leaders to use it, for bureaucracies to build regulations based on its models, for mid researchers to work in the domain without deceiving or confusing themselves, etc. But that all requires correct and deep technical understanding first.
Now, you are correct that a number of other AI safety subfields suffer from the same problems to some extent. But that's a different discussion for a different time.
I don't think your first paragraph applies to the first three bullets you listed.
This is good (see my other comments) but:
This didn't address my biggest hesitation about the control agenda: preventing minor disasters from limited AIs could prevent alignment concern/terror - and so indirectly lead to a full takeover once we get an AGI smart enough to circumvent control measures.
To be clear, I'm not against control work, just conflicted about it for this reason.
That's a good point. That argument applies to prosaic "alignment" research, which seems importantly different (but related to) "real" alignment efforts.
Prosaic alignment thus far is mostly about making the behavior of LLMs align roughly with human goals/intentions. That's different in type from actually aligning the goals/values of entities that have goals/values with each other. BUT there's enough overlap that they're not entirely two different efforts.
I think many current prosaic alignment methods do probably extend to aligning foundation-model-based AGIs. But it's in a pretty complex way.
So yes, that argument does apply to most "alignment" work, but much of that is probably also progressing on solving the actual problem. Control work is a stopgap measure that could either provide time to solve the actual problem, or mask the actual problem so it's not solved in time. I have no prediction which because I haven't made enough gears-level models to apply here.
Edit: I'm almost finished with a large post trying to progress on how prosaic alignment might or might not actually help align takeover-capable...
Even outside of the arguing against the Control paradigm, this post (esp. The Model & The Problem & The Median Doom-Path: Slop, not Scheming) cover some really important ideas, which I think people working on many empirical alignment agendas would benefit from being aware of.
Curated! I think this is a fantastic contribution to the public discourse about AI control research. This really helped me think concretely about the development of AI and the likely causes of failure. I also really got a lot out of the visualization at the end of the "Failure to Generalize" section in terms of trying to understand why an AI's cognition will be alien and hard to interpret. In my view there are already quite a lot of high-level alien forces running on humans (e.g. Moloch), and there will be high-level alien forces running on the simulated society in the AI's mind.
I am glad that there's a high-quality case for and against this line of research, it makes me feel positive about the state of discourse on this subject.
(Meta note: This curation notice accidentally went up 3 days after the post was curated.)
I like Wentworth's toy model, but I want it to have more numbers, so I made some up. This leads me to the opposite conclusion to Wentworth.
I think (2-20%) is pretty sensible for successful intentional scheming of early AGI.
Assume the Phase One Risk is 10%.
Superintelligence is extremely dangerous by (strong) default. It will kill us or at least permanently disempower us, with high probability, unless we solve some technical alignment problems before building it.
Assume the Phase Two Risk is 99%. Also:
The justification for these numbers is that each billion dollars buys us a "dignity point" aka +1 log-odds of survival. This assumes that both research fields are similarly neglected and tractable.
Therefore:
We should therefore spend billions of dollars on both AI control and AI alignment, they are both very cost-efficient. This conclusion is robust to many different assumptions, provided that overall P(Doom) < 100%. So this model is not really a "case against AI control research".
I chatted with John about this at a workshop last weekend, and did update noticably, although I haven't updated all the way to his position here.
What was useful to me were the gears of:
At the workshop, we distinguished:
I don't currently buy that control approaches won't generalize from weak AGI to Von Neumann levels.
It becomes harder, and there might be limitations on what we can get out of the Von Neumann AI because we can't be that confident in our control scheme. But I think there are lots of ways to leverage an AI that isn't "i...
I'm curious why you think deceptive alignment from transformative AI is not much of a threat. I wonder if you're envisioning purely tool AI, or aligned agentic AGI that's just not smart enough to align better AGI?
I think it's quite implausible that we'll leave foundation models as tools rather than using the prompt "pretend you're an agent and call these tools" to turn them into agents. People want their work done for them, not just advice on how to do their work.
I do think it's quite plausible that we'll have aligned agentic foundation model agents that w...
Rather than generic slop, the early transformative AGI is fairly sycophantic (for the same reasons as today’s AI), and mostly comes up with clever arguments that the alignment team’s favorite ideas will in fact work.
I have a very easy time imagining work to make AI less sycophantic, for those who actually want that.
I expect that one major challenge for popular LLMs is that a large amount of sycophancy is both incredibly common online, and highly approved of by humans.
It seems like it should be an easy thing to stop for someone actually motivated...
One major counterargument here is "is control a necessary piece of the 'solve alignment in time plan'"? Like, it may be "5-10x less important" than dealing with slop, but, still, in if you didn't eventually solve both you don't get useful carefully-implemented-slightly-superhuman work done, and it (might be) that our surviving worlds look like either that, or "get a serious longterm pause."
I don't understand Control as aiming to align a super intelligence:
I can think of a few reasons someone might think AI Control research should receive very high priority, apart from what is mentioned in the post or in Buck's comment:
I'm pretty sympathetic to these arguments. I agree a lot of the risk comes from sycophantic AI systems producing bad research because humans are bad at evaluating research. This is part of why I spend most of my time developing safety evaluation methodologies.
On the other hand, I agree with Buck that scheming-like risks are pretty plausible and a meaningful part of the problem is also avoiding egregious sabotage.
I don't think I agree with your claim that the hope of control is that "early transformative AI can be used to solve the hard technical problems o...
Control research exclusively cares about intentional deception/scheming; it does not aim to solve any other failure mode.
(nitpick, doesn't address main point of article) I think this is incomplete. Though control research does indeed care a lot about scheming, control can be used more broadly to handle any worst-case deployment behavior. See Josh Clymer's post about Extending control evaluations to non-scheming threats.
I feel as though the case for control of mentioned early transformative AGI still passes through a different line of reasoning. As you mentioned before there are some issue with, should labs solve the ASI alignment barriers using ET AGI, it is likely the solution is somehow working on surface level but has clear flaws which we may not be able to detect. Applying alignment onto the ET AGI, in order to safeguard against said solutions specific to those which will leave humanity vulnerable, may be a route to pursue which still follows control principles. Obvi...
Regarding the images: you requested a baby peacock, not a peachick. Technically, it’s not incorrect, though it is a bit absurd to imagine a fully-feathered baby.
On the issue of offending others: it's not your responsibility to self-censor in order to shield people from emotional reactions. In evaluating the validity of ideas, emotional responses shouldn’t be the focus. If someone is offended, they may need to develop a thicker skin.
The more significant concern, in my view, is the ego associated with intellectualism. When dealing with complex systems that a...
I didn't read everything, but just flagging that there are also AI researchers, such as Francois Chollet to name one example, who believe that even the most capable AGI will not be powerful enough to take over. On the other side of the spectrum, we have Yud believing AGI will weaponize new physics within the day. If Chollet is a bit right, but not quite, and the best AI possible is just able to take over, than control approaches could actually stop it. I think control/defence should not be written off even as a final solution.
I think when there is so much extreme uncertainty in what is going to happen, it is both wrong to put all your eggs in one basket, and to put nothing in one basket. AI control might be useful.
It is extremely uncertain what level of intelligence is needed to escape the best AI control ideas.
Escaping a truly competent facility which can read/edit your thoughts, or finding a solution to permanently prevent other ASI from taking over the world, are both very hard tasks. It is possible that the former is harder than the latter.
For me there is no such thing as AI without the "G". It is my position at this time and I reserve the right to change it, that when we do truly develop artificial intelligence, it will be AGI. The foundation for generalized intelligence, no matter how small, will be present. The "G" is a brute fact. Using AI to talk about packages of very highly sophisticated programming meant to perform specific tasks is a category error. So this comment is aimed at AGI. IMHO, lack of patience will play a material role in our failures when it comes to AGI. Hum...
The AI Control Agenda, in its own words:
There’s more than one definition of “AI control research”, but I’ll emphasize two features, which both match the summary above and (I think) are true of approximately-100% of control research in practice:
The core point of this post is: intentional scheming in early transformatively-useful AI is a relatively small slice of the AI X-risk probability mass, and therefore control research mostly doesn’t have much impact on X-risk.
The Model & The Problem
I’ll start with an intentionally somewhat-oversimplified model; the points made about this model do generalize to more realistic/complicated models (we’ll get to that later), but are much easier to explain in the simplified model.
In the simplified model, there are two strategically-important stages of AGI: early transformative AGI, and superintelligence.
Superintelligence is extremely dangerous by (strong) default. It will kill us or at least permanently disempower us, with high probability, unless we solve some technical alignment problems before building it. Humanity currently has no idea how to solve those technical problems, and they seem pretty hard.
Early transformative AI is much weaker than superintelligence, probably human-ish level. The big hope on which the control agenda is founded is that early transformative AI can be used to solve the hard technical problems of superintelligence. Insofar as that’s the hope, we don’t necessarily need to solve the hard technical problems of aligning superintelligence by ourselves; we just need to solve the problems of aligning early transformative AI well enough to use it to solve the hard problems.
That brings us to the central pitch of the control agenda. In the words of Buck (one of the main proponents of the control agenda):
… and this is where I get off the train. The problem is that we mostly don’t die of catastrophic risk from early transformative AI at all. We die of catastrophic risk from stronger AI, e.g. superintelligence (in the oversimplified model). The main problem which needs to be solved for early transformative AI is to use it to actually solve the hard alignment problems of superintelligence. The risk that early transformative AI itself kills us directly, or intentionally tries to kill us indirectly, was never the bulk of the X-risk in the first place.[1]
The Median Doom-Path: Slop, not Scheming
Here, I claim, is the median way we all die, under the simplified model. It’s basically the path we get if we just look at the way major AI labs actually behave today in practice, and extrapolate forward.
First, some lab builds early transformatively-useful AI. They notice that it can do things like dramatically accelerate AI capabilities or alignment R&D. Their alignment team gets busy using the early transformative AI to solve the alignment problems of superintelligence. The early transformative AI spits out some slop, as AI does. Alas, one of the core challenges of slop is that it looks fine at first glance, and one of the core problems of aligning superintelligence is that it’s hard to verify; we can’t build a superintelligence to test it out on without risking death. Put those two together, add a dash of the lab alignment team having no idea what they’re doing because all of their work to date has focused on aligning near-term AI rather than superintelligence, and we have a perfect storm for the lab thinking they’ve solved the superintelligence alignment problem when they have not in fact solved the problem.
So the lab implements the non-solution, turns up the self-improvement dial, and by the time anybody realizes they haven’t actually solved the superintelligence alignment problem (if anybody even realizes at all), it’s already too late.
Some near-variants of this story which account for a lot of the probability mass:
This thread also runs through some of the more unusual non-scheming failure modes, which are individually not that probable but account for a lot of probability mass in aggregate. After all, it would be surprising if nothing “weird” happened.
Failure To Generalize
One objection one might have at this point: maybe the value of control research, and research like it, is not in getting the early transformative AGI to solve the superintelligence alignment problem, but rather in empirically learning things which will be useful for superintelligence. This is more centrally the pitch of e.g. model organisms work, but it’s still potentially relevant for control, insofar as we expect methods to prevent scheming in early AGI to generalize to prevent scheming in superintelligence.
As far as I can tell, control researchers themselves mostly do not expect the methods they're working on to generalize to superintelligence. Indeed, “this is just intended for near-human AGI, we don’t expect it to generalize to superintelligence” is a very common refrain.
… but let’s steelman a bit; maybe the empirical learnings could be useful even if control researchers themselves do not expect that to be the main value prop. What do we actually expect?
This depends on the details of different work, but in most cases prosaic alignment work (in general, not just limited to control) seems unlikely to generalize even across fairly minor changes in AI, let alone the jump to superintelligence. Here’s one relatively-simple example (from here):
A Less-Simplified Model
Of course, one might object that the two step “early transformative AGI -> superintelligence” model is importantly oversimplified, that there are more steps in there. How does that update the story?
Well, we still have the same conceptual issue: the generation of AGI before superintelligence needs to solve the alignment problems of superintelligence. The next earlier generation needs to not only make the second-last generation not kill us, it needs to get the second-last generation to actually solve the problems of aligning superintelligence. And so forth on back to the earliest transformative AGI, and to us: we need to not only prevent early transformative AGI from killing us, we need to get early AGI to make this whole chain of increasingly-powerful AIs actually solve all these alignment problems, all the way up to superintelligence.
Recap
Again, the diagram:
In most worlds, early transformative AGI isn’t what kills us, whether via scheming or otherwise. It’s later, stronger AI which kills us. The big failure mode of early transformative AGI is that it doesn’t actually solve the alignment problems of stronger AI. In particular, if early AGI makes us think we can handle stronger AI, then that’s a central path by which we die. And most of that probability-mass doesn’t come from intentional deception - it comes from slop, from the problem being hard to verify, from humans being bad at science in domains which we don’t already understand deeply, from (relatively predictable if one is actually paying attention to it) failures of techniques to generalize, etc.
I hear a lot of researchers assign doom probabilities in the 2%-20% range, because they think that’s about how likely it is for early transformative AGI to intentionally scheme successfully. I think that range of probabilities is pretty sensible for successful intentional scheming of early AGI… that’s just not where most of the doom-mass is.