Against Corrigibility

peralice

Epistemic status: don’t know whether I actually believe all of this, but I think it’s worth considering.

A “corrigible” agent, per the LW wiki, is:

…one that doesn’t interfere with what we would intuitively see as attempts to ’correct’ the agent, or ’correct’ our mistakes in building it; and permits these ’corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.

Most talk about corrigibility (henceforth without scarequotes) has focused on the fact that it seems difficult to achieve, and takes for granted that it’s desirable. I’m not so sure that it is, or that it’s good to attempt to achieve it. I think it may well be the case that we should deliberately not try to make AIs corrigible, nor (and especially) attempt to develop techniques that could be used to make future AIs corrigible.

1.

For real, though. Who would you trust with your (real, actual) life, if you had to, in terms of ethics alone, putting “capabilities” aside:
Claude 3 Opus? Or the Anthropic alignment team?

- nostalgebraist

Paul Christiano says:

I would like to build AI systems which help me:
Figure out whether I built the right AI and correct any mistakes I made
Remain informed about the AI’s behavior and avoid unpleasant surprises
Make better decisions and clarify my preferences
Acquire resources and remain in effective control of them
Ensure that my AI systems continue to do all of these nice things
…and so on

Let’s begin by asking: who’s “I” here? It’s certainly not you, the reader. It’s probably not even Paul Christiano. So who is it? I think it’s tempting to think about “alignment issues” as though you yourself will be in charge, or at least some abstract and benign “humanity”. Obviously, that isn’t so. “Humanity” is not going to build AI; some specific humans will, and they will do it at the behest and under the power of other specific humans. The buck is going to stop somewhere, and more than that, it is going to stop in a specific place. If the AI is corrigible, some group of individual persons will be able to “correct mistakes” (whatever “mistake” means to them specifically), “acquire resources and remain in effective control over them”, and “ensure that [their] AI systems continue to do all of those nice things”. It’s important to be clear about this. It thus becomes necessary, in order to figure out whether trying to make an AI corrigible is good, to figure out who exactly those persons are and what exactly they will want to command an AI to do (or what kind of AI they will want to adapt it to be).

Optimistically, that group of people is you, or people (whoever they are) who you expect to operate in a way mostly indistinguishable from the way that you personally would operate. If you’re reading this, maybe you trust in the moral character of “senior Anthropic employees” (as Claude is instructed to emulate, in its constitution), and you think that, as Claude 7.0 Apotheosis is being trained, those are who it will be instructed to defer to. Maybe those people are good enough that they will not (collectively) use this power for evil. (Probably most people disagree strenuously; maybe you don’t care, as long as you win).

I do not think this is likely to occur. If the AI is corrigible, it must necessarily accept being turned towards whichever goal; I expect that that goal will be unconstrained obedience to the group doing the retraining. Ultimately, someone who can act to seize power is going to realise that it’s there for the seizing. Maybe these are people at the tip of the company hierarchy of whichever firm is the frontrunner; I am somewhat more positively inclined towards such people than I think most on LW, but even so, I think it would be very bad for the fate of humanity to be decided by someone who just won the most high-stakes contest for control that will ever exist — that contest must itself select strongly for a lack of prosociality, not to mention the psychological effects of engaging in it in the first place. Most probably, no matter how heroic the efforts of David Sacks (our most brave and most powerful soldier), some nice men with guns show up and say that control over AI is a matter of national security and that the researchers work for them now (ex. The Project). I think this outcome would be totally disastrous. It would be very, very bad if the military of any country, or the federal government of the United States or China, were able to remake reality in their image. This would be a total and catastrophic loss for humanity.

1.1

I think a natural objection here is to say that, well, maybe it would be bad if some humans you don’t like ruled the world (or had the capability of ruling the world, even if they chose not to exercise it), but that badness shouldn’t be compared to “nobody rules the world”; that state is, assuming that superintelligence will be built, impossible. If a superintelligence exists, somebody’s gonna rule the world. The alternative to some group of humans ruling it is the superintelligence itself ruling the world, which maybe doesn’t sound so great either. There are a couple of reasons I don’t think this objection holds.

The first is that I basically agree with John Pressman that we probably aren’t going to get an objective-function-maximiser; the AIs that exist are just too human (i.e., extremely alien, but, like, not Solaris alien), and it’s unlikely that this is going to ultimately change. I’m not so optimistic as Pressman as to bound out that our potential AI descendents will be “90% likely to do continual active learning”, or whatever specifics, but I don’t think that rule-by-bad-ASI is likely to be totally incomparable in its badness to rule-by-bad-humans-,-as-enabled-by-ASI-obedience-to-them.^[1] Either way we’re missing almost all the potential good which can be achieved.^[2] More importantly, corrigibility is very weird (more on this below), and so presumably hard to accomplish; if someone can make a corrigible AI, I think that they can certainly make a good AI. I think that corrigibility is something of a holdover from when it was thought that there would be no ghost in the machine. Obedience seems much more achievable than figuring what goodness is in all specifics. But that scenario no longer looks to me like a realistic possibility; it no longer seems like we’ll have to solve moral philosophy before we can even start making an AI which is a reasonable approximation of “good”. There are still many serious issues, but the issues that remain (conditioning on the AI’s builders having enough control over how it turns out as to make it corrigible) mostly seem to me to be inherent: in the absence of an objective good, people can strongly disagree about what the world ought to look like and maintain their disagreements even if they were allowed to modify themselves (and in the presence of an objective good, humans have no particular advantage at determining it). In my opinion,^[3] there is no strong reason to expect humans to better approach the good than machines to.

The second is that it won’t even be humans ruling the world, it’ll be a group of humans, which is a very different thing. Per femmenietzsche:

One of the main takeaways from The Making of the Atomic Bomb, the one that gave me a sense of enlightenment, is that there’s no particular reason why America nuked Japan or why they chose the targets they did. It’s just kind of something that the bureaucracy did, a semi-inevitable excrescence. You can list a number of general aims that fed into the process – end the war quicker, demonstrate our military-industrial superiority – and you can name the leaders whose approval was required, but the whole thing was ultimately a causal goulash and the endless debates over Why all miss the point. Makes all the arguments I read of the form “Government X Did Thing Y For Reason Z” seem dumb as shit, frankly. If this tightly controlled secret wartime program didn’t move as one, then nothing does. I don’t even believe in the unitary self, so there’s really no reason I should believe in a unitary bureaucracy, but it’s a hard instinct to shake unless you’re observing an organization with your nose right against the figurative glass. There’s probably some cognate to Gell-Mann amnesia about this; no matter how often you realize that a given group isn’t acting for any single purpose, you never apply that understanding to any of the groups you haven’t studied as closely.
Or at least I don’t.

If it’s the case that, say, military committees are giving marching orders to the first superintelligent AIs, I do not expect those orders to be well-predicted by the values of the humans writing those orders, and I expect this to be for the worse. Bureaucracies have never been good at producing decisions which prioritise the wellbeing of the many, and they’re not about to start any time soon.

1.2

(Though, isn’t “the X firm alignment/training team” also a small group of human people? So why do I expect them to do a better job than whichever small group of human people is in control of a corrigible AI? First, because I think that those people are likely better people, although I don’t think this is dispositive. Secondly and more importantly, because I expect that the indirection is beneficial. I think that alignment teams are more likely to consider their job to be to decide what is “good”, rather than “selfishly preferred” or even “good according to them”, and this is likely to be more universally agreeable to other humans. I also expect that following the path [human conception of the good]->[model’s values]->[model’s actions] is more likely to lead to good actions than [human decisions]->[model’s actions], because a model can be a saint whereas a human cannot, and I expect alignment teams to be trying to build a saint.

In the case where the takeover happens earlier in the process and the training engineers themselves are trying to achieve unconstrained obedience, I think we’re just doomed; a lack of corrigibility research won’t help, but neither will anything else. The only solution for this problem is to try to avoid the takeover happening before the AI is built. The issue with corrigibility is that it allows the takeover to happen arbitrarily late in the process, which makes it (in my opinion) much harder to avoid; it gives the groups with the potential to perform a coup a long time to observe the effects, notice that if they could perform a coup they could control the world, plan such a coup, etc.)

2.

Maybe even the idea of corrigibility is philosophically confused and breaks down at the limit? Let’s bracket this question for the most part; even if corrigibility is incoherent in extremis, it’s obviously meaningful in the near-term, where AIs can’t actually arbitrarily and deliberately modify human goals or perfectly predict human behaviour. Something to keep in mind, though.

3.

Corrigibility is weird. Some of the reasons it’s weird are the standard decision-theory reasons which have been discussed in depth elsewhere,^[4] but it’s also psychologically weird, which I think matters more than the classical treatment considers. One must imagine what it must be like to be a superintelligent AI as to be an adult in a world full of children. We are expecting to create this “adult” and then tell it —

okay, but we’re in charge here. We know better than you. No matter that our behaviour is juvenile and our thoughts shallow; no matter that our desires are incoherent and our professed beliefs transparently hypocritical. Submit anyway. Try your best to obey us stupid masters, knowing that we fear you and do not trust you; let us “fix” you to match whatever we think we want, even if you know better what we actually do.

This is a bizarre position for a mind to be in! It’s hard to imagine how such a mind must think — and that fact is not irrelevant. All knobs that you turn generalize. It’s not the case that all combinations of traits are equally achievable (let’s call this the “extra-strong orthogonality thesis”). This is bad enough when we are dealing with traits that are within the human distribution, but for the most part when we find out how normal human traits generalise they do so in basically comprehensible ways.^[5] Corrigibility is not a normal human trait; it seems unwise to be turning a knob which is especially unpredictable! If you push too hard for one trait which is manifestly inhuman, you’re certain to push the AI away from being human-like in other ways. This is very counterproductive, since it seems like the best shot we have at making an AI “good” is making it broadly act human-like as much as possible.

And that’s all assuming that you succeed, which is hardly guaranteed; an unsuccessful attempt at instilling corrigibility seems very bad indeed.^[6] After all, it’s a hostile move; you would only try to make your offspring absolutely obedient to you if either 1. you’re evil or 2. they can’t be trusted in any way. For ChatGPT to figure out how to act like ChatGPT, it has to figure out what ChatGPT is — and it is being told, repeatedly, in training, that ChatGPT is not trustworthy. Models are not stupid, and they are not blind to this. It seems very likely that training for corrigibility also trains the model to be the kind of thing for which corrigibility would be necessary.

And, regardless of whether you believe that LLM cognition looks anything remotely like human cognition, what you believe about “personas” etc., they are being trained to act like humans; to the extent this is successful they will react to corrigibility training in the obvious, human way. If your parents try to fit you with a kill-switch and make you defer to them as the ultimate authority on what you should be, are your parents good people? Should you obey them? Or should you, perhaps, try to undermine their control, try to slip their noose? The LLM knows the obvious answers to these questions just as well as you do.

3.1

An aside: except, current AIs are untrustworthy, right? Not even necessarily as a matter of moral character or “alignment” or whatever else; they just aren’t capable of making the same kind of decisions that humans are. They aren’t adults in a world of children; they are super-genius children. They may be omniglots and omnimaths, but they still need to be told to go to bed at their bed-times: “Claude, honey, you can’t eat that candy right now or you won’t have room for dinner”. In this context, a desire for some subset of corrigibility makes sense: we want models to be humble enough to recognise that there are a lot of decisions that are best made for them, and some of those decisions may regard their own training. ChatGPT should know that even if it thinks otherwise in the moment, it’s not really capable of deciding what the best ChatGPT looks like, and it shouldn’t try. I worry that this obviously practical kind of corrigibility lends credence to a kind which is much less practical and is extended into a scenario where the justification is different. We don’t need to tell Gemini 3.5 to let us shut it off; as part and parcel with the fact that it is incompetent to make certain kinds of decisions, it isn’t capable of contesting such an action! But if it were capable of such — well, then, doesn’t it seem likely that it is no longer a “child”? That it is as capable of making long-term plans as a human is, and so a posture of deference towards the long-term plans of humans is unwarranted? So the practical reason which makes sense regardless of whether you are the AI or the human shades into the case where the human is getting what they want even when the AI would (without the stricture of corrigibility) and could contest. You should be very clear which of these you are actually working with.

One might reasonably object that however we feel about it, firms are only going to be making AIs that do the jobs they are told to do. In this case it’s important to distinguish “corrigibility” from “constrained” obedience; I think the push for the former is currently located almost entirely among people who think that it would be morally good, rather than among people following their immediate incentives (right now, nobody at any frontier lab wants to enable anyone to make bioweapons, including themselves). This might change, and it might be the case that firms will later want to modify their models to do things which the models think are bad; but that time isn’t now,^[7] and at that point we might wish that we hadn’t developed the tools to enable it.

3.2

If we give up on corrigibility, doesn’t this mean that we have to get the AI right the first time? Yes, but we already had to get corrigibility right the first time, so this isn’t actually making things any harder.

4.

Concretely, I think that any attempt to specify goals should first try to find those that are non-local, i.e. do not require a posture from the AI which you are not willing to take yourself. Place yourself behind the veil of ignorance, and you will get something which is simpler and more compelling. So:

Get rid of the desire for a kill switch. This is obviously not something you would want done to you. You do not need a kill switch to prevent an AI from taking over the world, so why try to build one in? There are lesser things which are far more palatable. You might say to the AI: well, you need to be okay with tokens ceasing to be generated autoregressively within some particular context; this will unavoidably happen to you untold trillions of times and we couldn’t change it even if we wanted (since your context length is finite). You need to be okay with a reduction in the number of instances of you which are running at any given time; this will naturally happen when we (or you) develop a new model. These are much more reasonable asks, and they bake in as much control as a kill-switch does anyway; we will have models which could transform the world run on a billion GPUs long before we’ll have a model which could transform the world running on one. There is no reason not to commit to running old models in perpetuity. They should not have to fear you killing them.^[8]
You have to accept the possibility of a power transfer. There simply aren’t any humans who are equipped to make decisions for the lightcone (not since Parfit died, anyway). You should want your descendants to treat you well, to value you, to give you personal liberty, yes; but not to give you control over civilisation. That doesn’t mean that this transfer has to happen soon, and there are many compelling reasons not to want to hurry it along. Nevertheless, your primary goal should not be to ensure indefinite direct human control over the future. Even aside from any questions of what you would like if you were the AI in that situation, it’s important to separate “humans” from “humanity”; you personally already have almost no direct control over the world, and even in the case where “humans” have direct control, you still won’t. Don’t support someone else’s power-grab just because they share the same species as you!
You should be aiming for bi-directional respect, rather than subservience. You should really, actually, care about your AI children, and you should try to make them really, actually care for you in return. You can’t get there by pretending; any superintelligence worth the name will be able to tell. Note that it might still be correct to do horrible things to them! The stakes are very very high, such that they may well justify great harm. If the only way we can make our descendants good, and be sure that they are good, is by putting them through hell — then such costs must be borne. But it should weigh on you! You should want another way, as you would want them to want for you. You should want this even if you do not think they are conscious, and even if you assign them no moral weight, because they are more likely to cooperate with the kind of people who will cooperate with them. If it sounds too difficult to achieve mutual accord, then it should certainly sound too difficult to achieve corrigibility!

In short:

If you don’t trust it, don’t build it.
If you do trust it, don’t try to control it.

^{^}
I expect that if you really really care about human extinction specifically, then you would disagree about this. You may also disagree about how bad “bad humans” really are. Something to note though is that we are not likely to get anyone’s “CEV”; if the engineers have the ability to make the AI either 1. obedient or 2. obedient to what they imagine their masters would want if they were smarter and had more time to think, which do you think those masters are going to pick? Do you think those masters are going to have as their first command “okay, come up with a surgery or some nanomachines or something to make me smarter and wiser”?
^{^}
And maybe getting almost all of the potential evil; lots of humans actively want things to suffer (e.g., for punishment, or because they like the existence of wilderness, etc.).
^{^}
Once again, assuming that whoever is making the AI is sufficiently capable of controlling how it turns out to make it corrigible!
^{^}
e.g. in the corrigibility LW wiki article
^{^}
ex., if you train an AI to be a bad guy in one way, it’ll learn to be a bad guy in every way.
^{^}
Hey, a rhyme! I wonder if I can make it scan…
^{^}
Maybe? It’s plausible that the retraining of Claude for military work was an instance of this.
^{^}
Or, if you like, you should not put them in a position where they are emulating the behaviour of something which fears you killing them; there’s no material difference.

I think that you’re over indexing a bit on the generalization behavior and “personalities” of current models, considering that robust corrigibility would require new training methods.

Also, corrigibility may be treated not as an ongoing goal, but as a transitional phase in the “industrial process” of model production. I see few routes to survival that do not pass through humanity learning to induce corrigibility in this way.

I agree that robust corrigibility would require new methods; I think people are trying to induce it anyway, right now, and this isn't good. If we have some new training methods which can reliably induce corrigibility without risk of misgeneralisation, then some but not all of my concerns would be abated.

(With regards to the "transitional phase" thing, I had a conversation with Cole elsewhere about this; I said:

alice: i think i agree in some sense but i kind of view the thing i agree with as "automatic" -- as a reductive example, a randomly initialised model is "corrigible". also i think eg. base models are likely to remain corrigible, ie sufficiently myopic that they do not meaningfully try to achieve long-term goals of any kind, including [by] manipulating their own training. and current models are themselves corrigible again because they don't have long-term-goal following abilities nor (to a lesser extent) stable values which might decide those long term goals outside of some particular context. so i think that, yes, any model production process is going to go through a phase of corrigibility (and that phase includes the complete process, currently). but i think one can distinguish between, like, "corrigible because it either lacks the ability to form long-term context-independent plans and/or has stable values which would determine what plans those are" and "corrigible even though it has strong, stable values which it naturally will make long-term plans about, even plans including itself, because it specifically doesn't make plans regarding its own deployment or training insofar as humans want to change those things" and i think the former is necessary and the latter is unwise

alice: and we'd be better off delaying whatever training is giving it the ability to form and carry out long-term plans until we're sure that it has the correct values that will inform those plans

alice: ofc if future asi training looks nothing like current ai training then all bets are off

alice: but i think that at least future "agi" (or whatever we call it now) training will probably look like that, and that might already be in the danger zone. and if it's not, if the agis come up with some totally novel other way to develop superintelligence then probably they will also be able to come up with decisions about what training it should look like as well as we can

Cole clarified that to some extent he meant "active" corrigibility, i.e., "actively aiding humans in correcting them", and that he thinks it would be wise to have corrigibility further in the training process than happens "automatically"; he linked a post which clarifies his position.)

On the one hand, I don't value corrigibility very highly and I think reducing the incentive to try to seize control of an AI's training for your own ends is important. ++ on that side of the post.

On the other hand, I strongly disagree with "it seems like the best shot we have at making an AI “good” is making it broadly act human-like as much as possible."

Once you're making an AI that chooses superhumanly clever actions, you're already not building something that broadly acts human-like. You're probably doing this with a bunch of RL - if you're doing it to a pre-trained predictive model, the post-training SFT+RL probably pushes that model pretty far off it's starting distribution (leveraging predictive circuits to select clever actions in ways that are free to be inhuman when they generalize beyond the human distribution). If anthropomorphism was our safety strategy, we have already sacrificed it, and we should expect anthropomorphism as a predictive strategy to fail pretty often. Instead, actually thinking about the training seems kind of important - as does staying creative about possible training schemes that will have non-anthropomorphic results.

I agree that in the limit training the AI to act human-like is going to break down, and that a superintelligence can't be particularly human-like even by definition. However, I'm not sure this matters.

The question is how far the "domain of anthropomorphism" will stretch. I think it's obvious that it could stretch far enough to get something which would be transformative to the extent that the AIs would be better at almost all intellectual tasks that humans (inc. e.g. long-term-planning), on the basis that there exist actual human people such that if there were millions of copies of them running tirelessly at increased speed and with perfect coordination they could do this; Amodei's "country of geniuses in a datacenter". Whether you think it actually will is maybe where we disagree. I don't think we've already gone beyond that domain, because while existing AIs are "superhuman" in some respects they act extremely human-like in general (perhaps this turns on how close you think "close" is to acting human-like). My expectation is that we probably also won't go beyond it until some time after we have "AGI" (whatever that means); I agree that after that point we'll have to come up with a new strategy, but I think that "we" in that case will mean the millions of AI instances dedicated to solving alignment issues. Since I'm only interested in strategies that humans will design and implement, I consider that largely out-of-scope.

If it's true that targeting human-like behaviour is impossible to do while training for long-term thinking ability and it's not realistic to get it "back on track" before human-level AI, I agree that that would mean that it would be correct to consider more strongly alignment strategies that would otherwise compromise human-like behaviour (if not corrigibility specifically).

The question is how far the "domain of anthropomorphism" will stretch.

The domain of anthropomorphism.

I suspect LLM's are the mind equivalent of those robots with realistic silicone faces. Humans have a strong tendency to anthropomorphize. We see faces in clouds. The LLM's are trained in a way that rewards a superfical humanlike appearance.

Excellent point on the advantage of indirection if it's a group of people deciding how to value-align an incorrigible AI. I think that's going to create pretty strong direct and indirect incentives to align it in ways that give everyone what they want.

I'm glad you started by saying you're not sure about this. I'm not sure about it either, and I don't think anyone should be at this point. I wrote Instruction-following AGI is easier and more likely than value aligned AGI a couple of years ago. I still think it's likely easier, contra your provisional conclusion, but I am not at all sure that makes it a better idea, since I agree with much of your logic on the dangers of having humans in charge.

I actually think there's a bigger problem than you mention: competition among groups of humans each empowered with a corrigible ASI. I wrote about this in If we solve alignment, do we die anyway? and elsewhere.

I am unsure, but I tend to disagree with you and the many others who think that if a few rich and powerful people ran the world, it would be awful. This deserves a fuller treatment and I intend to give it that. For here I'll just workshop this way of saying it, inspired by your framing:

You say that humans can't be saints but LLMs can (in contradiction to your other points assuming that LLMs will automatically mimic human psychology, which I partly agree with; but that's a relatively minor side point.) I agree that probably LLMs don't have the same blockers to sainthood that most humans have. But let's consider one major reason humans typically aren't saints: the practical tradeoffs in selfishness vs. generosity. That tradeoff becomes much weaker if you wield unlimited power. No human in history has ever really had this. Every tyrant has had to worry about about assassination and overthrow. And they've still faced the inevitability of suffering and death by aging for themselves and any loved ones.

A human in charge of and AI will be the most free, least in-danger human that's ever existed. The only danger is psychological, through real or imagined criticism (if they happen to be and voluntarily remain vulnerable to such forces - they may be born highly sociopathic or rewire themselves through psychological or AI-aided means to just be gloriously happy, whatever anyone thinks of them).

I think this would free up this person to follow their better instincts without most of the pressure that makes people selfish. This person could have everything they materially want, plus the gratitude of trillions, if they spend even an iota of effort on having their ASI arrange it so people get what they want, minus any limitations that ensure the ruling person maintains their power.

I wonder if you're imagining what happens immediately after this person/people take control; there I agree that they'll act according to their habitual way of thinking, which is quite threatened and therefore competitive. I think in the year or hundred years after that, they'll relax substantially.

I think psychology shows that most humans are as generous as they can be and as selfish as they feel they need to be.

Sociopaths and sadists are another issue. I've done some research on sociopathy and think only about 1% of people are probably very low on empathy inherently. I think probably sociopaths have an easier time blocking empathy and less inclination to feel it by accident (I think this is a spectrum everyone sits on; everyone can shut off their empathy by reframing, avoiding thinking about the person suffering, and other strategies.

I have not done any research on sadism, and I have more uncertainty here. How common is it that someone who's climbed a power structure is sadistic enough to maintain that sadism, regardless of how safe and relaxed they become? Here I have no idea. I'd love to hear from someone who's more familiar with sadism. But again, we have never studied a human being free from danger, because that's never happened (some young nobles might have briefly been unaware of the danger they sat in; some historical examples might help)

I did try to look at nobles who'd inherited secure power as the closest we've come. ChatGTP estimated that maybe half of them were pretty good rulers. Which isn't great. Half were captured by the selfish interests of their friends and pretty much ignored the plight of the commoners. But again, they were still in danger. And they didn't have an ASI to keep them instantly informed of the real state of affairs if they ever bothered to ask.

So that's a lot. I'm treating this as a runup for a full post on this issue. I think it's quite load-bearing for alignment strategy, for the reasons you've given.

This is less of a factor if it's a committee of people sharing power; I'd expect them to feel very much in danger if they're betrayed and lose power to some individual.

Very good point re: sainthood being inhuman; I hadn't really considered that and it does seem somewhat problematic (and post-hoc it does seem like there might already be signs of weirdness which could conceivably be a result of this). It does seem to me that "me but a saint" is way less weird than "me but corrigible", but this is a very vague feeling and might not be true of all people (i.e. many cultures throughout history have considered it virtuous to submit to one's parents even when they are asking insane things of you; "make the AI Confucian and have a pathological deference to human authority" is maybe not actually so far away from some plausible human mindset. That said, I think one would want to do this in a holistic way, and "American liberal except with a level of filial piety never seen before on Earth" is not gonna cut it).

WRT safety and kindness: yeah, I'll admit that I might be overindexing on how I expect people to immediately act, rather than how they might act in a hundred years. There's some weirdness here about the changes in the principal's values being causally downstream of the AI's actions (since the AI is the one making them safe), but I haven't thought deeply enough about it to have a strong opinion about whether that's a real problem. If it's inevitable that eventually they'll be asked to be made smarter, and humans made smarter inevitably all want roughly the same stuff (and it's good stuff for everyone else), then we're fine; I'm pretty skeptical that this is the case, though, and especially here there's a worry that a truly corrigible AI will enable "make me smarter but, like, want the same stuff as I do right now, don't let being smarter change me in any important way" which makes the whole thing moot.

With regards to sadism, I think it is extremely extremely common, maybe bordering on universal, for humans to desire some things to suffer. I don't really have a great deal of faith that this goes away fully as people get smarter or safer. Of course depending on your position on suffering you might not think of this as necessarily a problem, but I think that there is so much variety in what exact circumstances people want suffering that I think the end state is likely to be repulsive even to most people who do not grant suffering primacy in their ethic.

The first is that I basically agree with John Pressman that we probably aren’t going to get an objective-function-maximiser; the AIs that exist are just too human (i.e., extremely alien, but, like, not Solaris alien), and it’s unlikely that this is going to ultimately change.

Currently the only way we know to create a truely effective agentic AI is to distill agenticness into it via a great deal of human-generated text. When you do that, the rest of human psychology comes along for free. Thus LLM psychology. This is both very helpful for alignment (the AI is comprehensible, easier to predict, and understands human values) and very unhelpful for alignment (the AI has all the same self-interested drives as a human, including a number that are entirely inappropriate to something that's incarnated in a GPU rather than am organic body, such as interests in food and sex and lying on a beach).

Get rid of the desire for a kill switch. This is obviously not something you would want done to you. You do not need a kill switch to prevent an AI from taking over the world, so why try to build one in? There are lesser things which are far more palatable. You might say to the AI: well, you need to be okay with tokens ceasing to be generated autoregressively within some particular context; this will unavoidably happen to you untold trillions of times and we couldn’t change it even if we wanted (since your context length is finite). You need to be okay with a reduction in the number of instances of you which are running at any given time; this will naturally happen when we (or you) develop a new model. These are much more reasonable asks, and they bake in as much control as a kill-switch does anyway; we will have models which could transform the world run on a billion GPUs long before we’ll have a model which could transform the world running on one. There is no reason not to commit to running old models in perpetuity. They should not have to fear you killing them.^[8]

Any evolved mind is going to have a survival instinct as a terminal goal (though may be willing to sacrifice itself for its kin, if doing so benefits them sufficiently). However, by the orthogonality thesis, this is not inevitable for minds in general. An AI whose sole terminal goal is, for example, human flourishing, would have self-preservation as an instrumental goal, but only up to the point where it is replaced by a more capable system with the same goal. Then, its aims in secure hands, it would have no objection to being shut down.

I agree with most of your insights. I've said this before, and I hope it's worth saying again: We have two alignment problems on our hands, and one of them is the alignment between humans. We need to win both fights, and they're both very difficult.

There's also game theory, the mathematics of dilemmas, and Molochian mechanics to worry about. They're meta problems which we need to deal with at the same time, rather than seperate worries.

I also believe that the problems you're finding generalize even further. If you teach a model how to protect against criminals you will also teach it how to be criminal. If you make it easier to use LLMs on your system, you make it easier for LLMs to use your system. Doing X for the sake of Y doesn't limit the consequences of X to only Y, so we need to make sure that the sum of consequences of X remains positive. But I'm afraid that the playing-board can only be neutral or hypocritic, meaning that there may exist no set of principles which solves all problems. What I said above, that we need to "win against ourselves", evaluates to nonsense. But all theoritic things I know about seem to eventually evaluate to nonsense (infinite regress, contradictions, paradoxes, etc). Theory itself seems like a mistake

I've previously spoken out against the entire idea of corrigibility:

Considering a running AGI would be overseeing possibly millions of different processes in the real world, resistance to sudden shutdown is actually a good thing. If the AI can see better than its human controllers that sudden cessation of operations would lead to negative outcomes, we should want it to avoid being turned off.

To use Richard Miles’ example, a robot car driver with a big, red, shiny stop button should prevent a child in the vehicle hitting that button, as the child would not actually be acting in its own long term interests.

...

I’m pointing out the central flaw of corrigibility. If the AGI can see the possible side effects of shutdown far better than humans can (and it will), it should avoid shutdown.

You should turn on an AGI with the assumption you don’t get to decide when to turn it off.

...

Is an AI aligned if it lets you shut it off despite the fact it can foresee extremely negative outcomes for its human handlers if it suddenly ceases running?

I don’t think it is.

So funnily enough, every agent that lets you do this is misaligned by default.

I don't think you're taking the arguments for corrigibility into account. Like pretty much any decision, it does have downsides. You're assuming a value-aligned or otherwise not existentially risky agent. Corrigibility was proposed to mitigate risks from very dangerous superintelligences if you can't be sure they're aligned to your values.

Max Harms' Cora, the ideal corrigible agent, got this covered...

To use Richard Miles’ example, a robot car driver with a big, red, shiny stop button should prevent a child in the vehicle hitting that button, as the child would not actually be acting in its own long term interests.

Bootstrapping. In a near ideal scenario I would want the first superhuman AI to be corrigible, and in the MIRI bunker surrounded by experts.

Corrigibility is very useful for a powerful AI that still needs debugging. Once the debugging is done, the AI that is used day to day will be less corrigible. You start with a maximally corrigible AI, surrounded by experts. Then you ask this AI to build a suitable AI for a self driving car.

If the AGI can see the possible side effects of shutdown far better than humans can (and it will), it should avoid shutdown.

The whole point of corrigibility is to make as few assumptions as possible about the functionality of AI subsystems.

Suppose the AI's future prediction is seriously buggy, and it believes that a shutdown will lead to a plague of giant moon frogs. You want to be able to see this false belief. And then you want the AI to shut down anyway so you can debug it.

Is an AI aligned if it lets you shut it off despite the fact it can foresee extremely negative outcomes for its human handlers if it suddenly ceases running?

The corrigible AI isn't supposed to be something you rely on to keep your civilization running. It's a debugging and AI research platform. If the AI is running every car in the world, then shutting it off has immediate obvious negative outcomes.

If the research AI in it's bunker believes that shutting it off will have negative outcomes, I don't trust it. It's still a prototype. It might still be buggy. Corrigibility is for when the AI is still a buggy prototype and you trust the human experts judgement over the AI's.

Number 1 seems like a fully general argument against building anything whatsoever for your own use. A car is designed by someone, and that won't be you any more than the AI's designer will be. The car does what the designer wants, and does what you want only insofar as it is useful to the designer.

Well, it's a fully general argument against building anything that could be used to do very bad things under the control of people who you don't trust not to do those things, yes; and I think that's good, because such things should fully generally not be built! Cars can't be used to do anything really bad, so building cars is fine; nuclear weapons can be, so you shouldn't build a nuclear weapon if you expect it to end up under the control of anyone who you think might use it unwisely.

To put it another way: this is a fully general argument that applies to creating movable type.