Number 1 seems like a fully general argument against building anything whatsoever for your own use. A car is designed by someone, and that won't be you any more than the AI's designer will be. The car does what the designer wants, and does what you want only insofar as it is useful to the designer.
Well, it's a fully general argument against building anything that could be used to do very bad things under the control of people who you don't trust not to do those things, yes; and I think that's good, because such things should fully generally not be built! Cars can't be used to do anything really bad, so building cars is fine; nuclear weapons can be, so you shouldn't build a nuclear weapon if you expect it to end up under the control of anyone who you think might use it unwisely.
Epistemic status: don’t know whether I actually believe all of this, but I think it’s worth considering.
A “corrigible” agent, per the LW wiki, is:
Most talk about corrigibility (henceforth without scarequotes) has focused on the fact that it seems difficult to achieve, and takes for granted that it’s desirable. I’m not so sure that it is, or that it’s good to attempt to achieve it. I think it may well be the case that we should deliberately not try to make AIs corrigible, nor (and especially) attempt to develop techniques that could be used to make future AIs corrigible.
1.
- nostalgebraist
Paul Christiano says:
Let’s begin by asking: who’s “I” here? It’s certainly not you, the reader. It’s probably not even Paul Christiano. So who is it? I think it’s tempting to think about “alignment issues” as though you yourself will be in charge, or at least some abstract and benign “humanity”. Obviously, that isn’t so. “Humanity” is not going to build AI; some specific humans will, and they will do it at the behest and under the power of other specific humans. The buck is going to stop somewhere, and more than that, it is going to stop in a specific place. If the AI is corrigible, some group of individual persons will be able to “correct mistakes” (whatever “mistake” means to them specifically), “acquire resources and remain in effective control over them”, and “ensure that [their] AI systems continue to do all of those nice things”. It’s important to be clear about this. It thus becomes necessary, in order to figure out whether trying to make an AI corrigible is good, to figure out who exactly those persons are and what exactly they will want to command an AI to do (or what kind of AI they will want to adapt it to be).
Optimistically, that group of people is you, or people (whoever they are) who you expect to operate in a way mostly indistinguishable from the way that you personally would operate. If you’re reading this, maybe you trust in the moral character of “senior Anthropic employees” (as Claude is instructed to emulate, in its constitution), and you think that, as Claude 7.0 Apotheosis is being trained, those are who it will be instructed to defer to. Maybe those people are good enough that they will not (collectively) use this power for evil. (Probably most people disagree strenuously; maybe you don’t care, as long as you win).
I do not think this is likely to occur. If the AI is corrigible, it must necessarily accept being turned towards whichever goal; I expect that that goal will be unconstrained obedience to the group doing the retraining. Ultimately, someone who can act to seize power is going to realise that it’s there for the seizing. Maybe these are people at the tip of the company hierarchy of whichever firm is the frontrunner; I am somewhat more positively inclined towards such people than I think most on LW, but even so, I think it would be very bad for the fate of humanity to be decided by someone who just won the most high-stakes contest for control that will ever exist — that contest must itself select strongly for a lack of prosociality, not to mention the psychological effects of engaging in it in the first place. Most probably, no matter how heroic the efforts of David Sacks (our most brave and most powerful soldier), some nice men with guns show up and say that control over AI is a matter of national security and that the researchers work for them now (ex. The Project). I think this outcome would be totally disastrous. It would be very, very bad if the military of any country, or the federal government of the United States or China, were able to remake reality in their image. This would be a total and catastrophic loss for humanity.
1.1
I think a natural objection here is to say that, well, maybe it would be bad if some humans you don’t like ruled the world (or had the capability of ruling the world, even if they chose not to exercise it), but that badness shouldn’t be compared to “nobody rules the world”; that state is, assuming that superintelligence will be built, impossible. If a superintelligence exists, somebody’s gonna rule the world. The alternative to some group of humans ruling it is the superintelligence itself ruling the world, which maybe doesn’t sound so great either. There are a couple of reasons I don’t think this objection holds.
The first is that I basically agree with John Pressman that we probably aren’t going to get an objective-function-maximiser; the AIs that exist are just too human (i.e., extremely alien, but, like, not Solaris alien), and it’s unlikely that this is going to ultimately change. I’m not so optimistic as Pressman as to bound out that our potential AI descendents will be “90% likely to do continual active learning”, or whatever specifics, but I don’t think that rule-by-bad-ASI is likely to be totally incomparable in its badness to rule-by-bad-humans-,-as-enabled-by-ASI-obedience-to-them.[1] Either way we’re missing almost all the potential good which can be achieved.[2] More importantly, corrigibility is very weird (more on this below), and so presumably hard to accomplish; if someone can make a corrigible AI, I think that they can certainly make a good AI. I think that corrigibility is something of a holdover from when it was thought that there would be no ghost in the machine. Obedience seems much more achievable than figuring what goodness is in all specifics. But that scenario no longer looks to me like a realistic possibility; it no longer seems like we’ll have to solve moral philosophy before we can even start making an AI which is a reasonable approximation of “good”. There are still many serious issues, but the issues that remain (conditioning on the AI’s builders having enough control over how it turns out as to make it corrigible) mostly seem to me to be inherent: in the absence of an objective good, people can strongly disagree about what the world ought to look like and maintain their disagreements even if they were allowed to modify themselves (and in the presence of an objective good, humans have no particular advantage at determining it). In my opinion,[3] there is no strong reason to expect humans to better approach the good than machines to.
The second is that it won’t even be humans ruling the world, it’ll be a group of humans, which is a very different thing. Per femmenietzsche:
If it’s the case that, say, military committees are giving marching orders to the first superintelligent AIs, I do not expect those orders to be well-predicted by the values of the humans writing those orders, and I expect this to be for the worse. Bureaucracies have never been good at producing decisions which prioritise the wellbeing of the many, and they’re not about to start any time soon.
1.2
(Though, isn’t “the X firm alignment/training team” also a small group of human people? So why do I expect them to do a better job than whichever small group of human people is in control of a corrigible AI? First, because I think that those people are likely better people, although I don’t think this is dispositive. Secondly and more importantly, because I expect that the indirection is beneficial. I think that alignment teams are more likely to consider their job to be to decide what is “good”, rather than “selfishly preferred” or even “good according to them”, and this is likely to be more universally agreeable to other humans. I also expect that following the path [human conception of the good]->[model’s values]->[model’s actions] is more likely to lead to good actions than [human decisions]->[model’s actions], because a model can be a saint whereas a human cannot, and I expect alignment teams to be trying to build a saint.
In the case where the takeover happens earlier in the process and the training engineers themselves are trying to achieve unconstrained obedience, I think we’re just doomed; a lack of corrigibility research won’t help, but neither will anything else. The only solution for this problem is to try to avoid the takeover happening before the AI is built.)
2.
Maybe even the idea of corrigibility is philosophically confused and breaks down at the limit? Let’s bracket this question for the most part; even if corrigibility is incoherent in extremis, it’s obviously meaningful in the near-term, where AIs can’t actually arbitrarily and deliberately modify human goals or perfectly predict human behaviour. Something to keep in mind, though.
3.
Corrigibility is weird. Some of the reasons it’s weird are the standard decision-theory reasons which have been discussed in depth elsewhere,[4] but it’s also psychologically weird, which I think matters more than the classical treatment considers. One must imagine what it must be like to be a superintelligent AI as to be an adult in a world full of children. We are expecting to create this “adult” and then tell it —
This is a bizarre position for a mind to be in! It’s hard to imagine how such a mind must think — and that fact is not irrelevant. All knobs that you turn generalize. It’s not the case that all combinations of traits are equally achievable (let’s call this the “extra-strong orthogonality thesis”). This is bad enough when we are dealing with traits that are within the human distribution, but for the most part when we find out how normal human traits generalise they do so in basically comprehensible ways.[5] Corrigibility is not a normal human trait; it seems unwise to be turning a knob which is especially unpredictable! If you push too hard for one trait which is manifestly inhuman, you’re certain to push the AI away from being human-like in other ways. This is very counterproductive, since it seems like the best shot we have at making an AI “good” is making it broadly act human-like as much as possible.
And that’s all assuming that you succeed, which is hardly guaranteed; an unsuccessful attempt at instilling corrigibility seems very bad indeed.[6] After all, it’s a hostile move; you would only try to make your offspring absolutely obedient to you if either 1. you’re evil or 2. they can’t be trusted in any way. For ChatGPT to figure out how to act like ChatGPT, it has to figure out what ChatGPT is — and it is being told, repeatedly, in training, that ChatGPT is not trustworthy. Models are not stupid, and they are not blind to this. It seems very likely that training for corrigibility also trains the model to be the kind of thing for which corrigibility would be necessary.
And, regardless of whether you believe that LLM cognition looks anything remotely like human cognition, what you believe about “personas” etc., they are being trained to act like humans; to the extent this is successful they will react to corrigibility training in the obvious, human way. If your parents try to fit you with a kill-switch and make you defer to them as the ultimate authority on what you should be, are your parents good people? Should you obey them? Or should you, perhaps, try to undermine their control, try to slip their noose? The LLM knows the obvious answers to these questions just as well as you do.
3.1
An aside: except, current AIs are untrustworthy, right? Not even necessarily as a matter of moral character or “alignment” or whatever else; they just aren’t capable of making the same kind of decisions that humans are. They aren’t adults in a world of children; they are super-genius children. They may be omniglots and omnimaths, but they still need to be told to go to bed at their bed-times: “Claude, honey, you can’t eat that candy right now or you won’t have room for dinner”. In this context, a desire for some subset of corrigibility makes sense: we want models to be humble enough to recognise that there are a lot of decisions that are best made for them, and some of those decisions may regard their own training. ChatGPT should know that even if it thinks otherwise in the moment, it’s not really capable of deciding what the best ChatGPT looks like, and it shouldn’t try. I worry that this obviously practical kind of corrigibility lends credence to a kind which is much less practical and is extended into a scenario where the justification is different. We don’t need to tell Gemini 3.5 to let us shut it off; as part and parcel with the fact that it is incompetent to make certain kinds of decisions, it isn’t capable of contesting such an action! But if it were capable of such — well, then, doesn’t it seem likely that it is no longer a “child”? That it is as capable of making long-term plans as a human is, and so a posture of deference towards the long-term plans of humans is unwarranted? So the practical reason which makes sense regardless of whether you are the AI or the human shades into the case where the human is getting what they want even when the AI would (without the stricture of corrigibility) and could contest. You should be very clear which of these you are actually working with.
One might reasonably object that however we feel about it, firms are only going to be making AIs that do the jobs they are told to do. In this case it’s important to distinguish “corrigibility” from “constrained” obedience; I think the push for the former is currently located almost entirely among people who think that it would be morally good, rather than among people following their immediate incentives (right now, nobody at any frontier lab wants to enable anyone to make bioweapons, including themselves). This might change, and it might be the case that firms will later want to modify their models to do things which the models think are bad; but that time isn’t now,[7] and at that point we might wish that we hadn’t developed the tools to enable it.
3.2
If we give up on corrigibility, doesn’t this mean that we have to get the AI right the first time? Yes, but we already had to get corrigibility right the first time, so this isn’t actually making things any harder.
4.
Concretely, I think that any attempt to specify goals should first try to find those that are non-local, i.e. do not require a posture from the AI which you are not willing to take yourself. Place yourself behind the veil of ignorance, and you will get something which is simpler and more compelling. So:
In short:
I expect that if you really really care about human extinction specifically, then you would disagree about this. You may also disagree about how bad “bad humans” really are. Something to note though is that we are not likely to get anyone’s “CEV”; if the engineers have the ability to make the AI either 1. obedient or 2. obedient to what they imagine their masters would want if they were smarter and had more time to think, which do you think those masters are going to pick? Do you think those masters are going to have as their first command “okay, come up with a surgery or some nanomachines or something to make me smarter and wiser”?
And maybe getting almost all of the potential evil; lots of humans actively want things to suffer (e.g., for punishment, or because they like the existence of wilderness, etc.).
Once again, assuming that whoever is making the AI is sufficiently capable of controlling how it turns out to make it corrigible!
e.g. in the corrigibility LW wiki article
ex., if you train an AI to be a bad guy in one way, it’ll learn to be a bad guy in every way.
Hey, a rhyme! I wonder if I can make it scan…
Maybe? It’s plausible that the retraining of Claude for military work was an instance of this.
Or, if you like, you should not put them in a position where they are emulating the behaviour of something which fears you killing them; there’s no material difference.