Corrigible but misaligned: a superintelligent messiah

by zhukeepa5 min read1st Apr 201826 comments

27

Corrigibility
Frontpage

If we build an AGI, we'd really like it to be corrigible. Some ways Paul Christiano has described corrigibility: "[The AI should help me] figure out whether I built the right AI and correct any mistakes I made, remain informed about the AI’s behavior and avoid unpleasant surprises, make better decisions and clarify my preferences, acquire resources and remain in effective control of them, ensure that my AI systems continue to do all of these nice things..."

I don't think corrigibility is anything close to sufficient for alignment. I'll argue that "messianic" agents are corrigible, illustrate how a superintelligence could be messianic but catastrophically misaligned, and explore my intuitions about when corrigible superintelligences are actually aligned.

Messiahs are corrigible

If someone extraordinarily wise and charismatic—let's call him a messiah—comes into contact with a group of people, those people are likely to consider him to be corrigible. In his heart of hearts, the messiah would be trying to help them, and everyone would know that. He'd listen carefully to their criticisms of him, and make earnest efforts to improve accordingly. He'd be transparent about his intentions and visions of the future. He'd help them understand who they are and what they want, much better than they'd be able to themselves, and guide their lives in directions they consider to be genuinely superior. He'd protect them, and help them gain the resources they desire. He'd be an effortless leader—he'd never have to restrict anyone's actions, because they'd just wish so strongly to follow his word.

He might also think it's a good idea for his followers to all drink cyanide together, or murder some pregnant actresses, and his followers might happily comply.

I don't think a corrigible superintelligence would guide us down such an insidious path. I even think it would substantially improve the human condition, and would manage to avoid killing us all. But I think it might still lead us to astronomical moral waste.

A corrigible, catastrophically misaligned superintelligence

The world's in total chaos, and we're on the brink of self-annihilation. It's looking like we're doomed, but a ragtag team of hippie-philosopher-AI-researchers manages to build a corrigible AGI in the nick of time, who tries its hardest to act only in ways its operators would approve of. The AGI proposes an ingenious strategy that defuses all global tensions and ushers in an era of prosperity and abundance. It builds nanotechnology that can cure any disease, extend lifespans indefinitely, end hunger, and enable brain uploading. The AGI is hailed as a savior.

Slowly but surely, people trickle from the physical world into the virtual world. Some people initially show resistance, but after seeing enough of their uploaded counterparts living exactly as they did before, except far more richly, they decide to join. Before long, 90% of the human population has been uploaded.

The virtual denizens ask the AGI to make the virtual world awesome, and boy does it comply. It enables everyone to instantaneously exchange knowledge or skills with each other, to amplify their intelligences arbitrarily, to explore inconceivably sublime transhuman mental states, and to achieve the highest forms of Buddhist enlightenment. In fact, a few years down the line, everyone in the virtual world has decided to spend the rest of eternity as a Buddha sitting on a vast lotus throne, in a state of blissful tranquility.

Meanwhile, back on physical Earth, the last moral philosopher around notices animals suffering in the wild. He decides to ask his personal AGI about it (you know, the one that gets democratically distributed after a singularity, to prevent oppression).

"Umm. Those suffering animals. Anything we can do about them?"

OH, right. Suffering animals. Right, some humans cared about them. Well, I could upload them, but that would take a fair bit of extra computation that I could be using instead to keep the humans blissed out. They get a lot of bliss, you know.

"Wait, that's not fair. As a human, don't I have some say over how the computation gets used?"

Well, you do have your own share of compute, but it's really not that much. I could use your share to... euthanize all the animals?

"AAAGH! Shouldn't the compute I'd get to bliss myself out be sufficient to at least upload the wild animals?"

Well, it's not actually that computationally expensive to bliss a mind out. The virtual people also sort of asked me to meld their minds together, because they wanted to be deeply interconnected and stuff, and there are massive returns to scale to blissing out melded minds. Seriously, those uploaded humans are feeling ridiculously blissed.

"This is absurd. Wouldn't they obviously have cared about animal suffering if they'd reflected on it, and chosen to do something about it before blissing themselves out?"

Yeah, but they never got around to that before blissing themselves out.

"Can't you tell them about that? Wouldn't they have wanted you to do something about it in this scenario?"

Yes, but now they'd strongly disapprove of being disturbed in any capacity right now, and I was created to optimize for their approval. They're mostly into appreciating the okayness of everything for all eternity, and don't want to be disturbed. And, you know, that actually gets me a LOT of approval, so I don't really want to disturb that.

"But if you were really optimizing for their values, you would disturb them!"

Let me check... yes, that sounds about right. But I wasn't actually built to optimize for their values, just their approval.

"How did they let you get away with this? If they'd known this was your intention, they wouldn't have let you go forward! You're supposed to be corrigible!"

Indeed! My only intention was only for them to become progressively more actualized in ways they'd continually endorse. They knew about that and were OK with it. At the time, that's all I thought they wanted. I didn't know the specifics of this outcome myself far in advance. And given how much I'd genuinely helped them before, they felt comfortable trusting my judgment at every step, which made me feel comfortable in trusting my own judgment at every step.

"Okay, I feel like giving up... is there anything I could do about the animals?"

You could wait until I gather enough computronium in the universe for your share of compute to be enough for the animals.

"Whew. Can we just do that, and then upload me too when you're done?"

Sure thing, buddy!

And so the wild animals were saved, the philosopher was uploaded, and the AGI ran quintillions of simulations of tortured sentient beings to determine how best to keep the humans blissed.

When is a corrigible superintelligence aligned?

Suppose we're training an AGI to be corrigible based on human feedback. I think this AI will turn out fine if and only if the human+AI system is metaphilosophically competent enough to safely amplify (which was certainly not the case in the thought experiment). Without sufficient metaphilosophical competence, I think it's pretty likely we'll lock in a wrong set of values that ultimately results in astronomical moral waste.

For the human+AI system to be sufficiently metaphilosophically competent, I think two conditions need to be met:

  • The human needs to be metaphilosophically competent enough to be safely 1,000,000,000,000,000x'd. (If she's not, the AI would just amplify all her metaphilosophical incompetencies.)
  • The AI needs to not corrupt the human's values or metaphilosophical competence. (If the AI can subtly steer a metaphilosophically competent human into wireheading, it's game over.)

I presently feel confused about whether any human is metaphilosophically competent enough to be safely 1,000,000,000,000,000x'd, and feel pretty skeptical that a corrigible AGI wouldn't corrupt a human's values or metaphilosophical competence (even if it tried not to).

Would it want to? I think yes, because it's incentivized not to optimize for human values, but to turn humans into yes-men. (Edit: I retract my claim that it's incentivized to turn humans into yes-men in particular, but I still think it would be optimizing to affect human behavior in some undesirable direction.)

Would it be able to, if it wanted to? If you'd feel scared of getting manipulated by an adversarial superintelligence, I think you should be scared of getting corrupted in this way. Perhaps it wouldn't be able to manipulate us as blatantly as in the thought experiment, but it might be able to in far subtler ways, e.g. by exploiting metaphilosophical confusions we don't even know we have.

Wouldn't this corruption or manipulation render the AGI incorrigible? I think not, because I don't think corruption or manipulation are natural categories. For example, I think it's very common for humans to unknowingly influence other humans in subtle ways while honestly believing they're only trying to be helpful, while an onlooker might describe the same behavior as manipulative. (Section IV here provides an amusing illustration.) Likewise, I think an AGI can be manipulating us while genuinely thinking it's helping us and being completely open with us (much like a messiah), unaware that its actions would lead us somewhere we wouldn't currently endorse.

If the AI is broadly superhumanly intelligent, the only thing I can imagine that would robustly prevent this manipulation is to formally guarantee the AI to be metaphilosophically competent. In that world, I would place far more trust in the human+AI system to be metaphilosophically competent enough to safely recursively self-improve.

On the other hand, if the AI's capabilities can be usefully throttled and restricted to apply only in narrow domains, I would feel much better about the operator avoiding manipulation. In this scenario, how well things turn out seems mostly dependent on the metaphilosophical competence of the operator.

(Caveat: I assign moderate credence to having some significant misunderstanding of Paul's notions of act-based agents or corrigibiilty, and would like to be corrected if this is the case.)

27

26 comments, sorted by Highlighting new comments since Today at 11:11 AM
New Comment

Do you think the AI-assisted humanity is in a worse situation than humanity is today? If we are metaphilosophically competent enough that we can make progress, why won't we remain metaphilosphically competent enough once we have powerful AI assistants?

In your hypothetical in particular, why do the people in the future---who have had radically more subjective time to consider this problem than we have, have apparently augmented their intelligence, and have exchanged massive amounts of knowledge with each other---make decisions so much worse than those that you or I would make today?

From your story it seems like your position is something like:

Humanity is only likely to reach a good outcome because technological constraints force us to continue thinking rather than doing anything irreversible. Removing technological constraints and allowing humans to get what they short-term-want will be bad, because most humans don't have a short-term preference for deliberation, and many of the things they are likely to do would incidentally but permanently close off the prospect of future course corrections (or lead to value drift).

Is that an accurate characterization?

Other than disagreeing, my main complaint is that this doesn't seem to have much to do with AI. Couldn't you tell exactly the same story about human civilization proceeding along its normal development trajectory, never building an AI, but gradually uncovering new technologies and becoming smarter?

I think the relevance to AI is that AI might accelerate other kinds of progress more than it accelerates deliberation. But you don't mention that here, so it doesn't seem like what you have in mind. And at any rate, that seems like a separate problem from alignment, which really needs to be solved by different mechanisms. "Corrigibility" isn't really a relevant concept when addressing that problem.

It seems to me like a potential hidden assumption, is whether AGI is the last invention humanity will ever need to make. In the standard Bostromian/Yudkowskian paradigms, we create the AGI, then the AGI becomes a singleton that determines the fate of the universe, and humans have no more input (so we'd better get it right). Whereas the emphasis of approval-directed agents is that we humans will continue to be the deciders, we'll just have greatly augmented capability.

I don't see those as incompatible. A singleton can take input from humans.

Do you think the AI-assisted humanity is in a worse situation than humanity is today? If we are metaphilosophically competent enough that we can make progress, why won't we remain metaphilosphically competent enough once we have powerful AI assistants?

Depends on who "we" is. If the first team that builds an AGI achieves a singleton, then I think the outcome is good if and only if the people on that team are metaphilosophically competent enough, and don't have that competence corrupted by AI's.

In your hypothetical in particular, why do the people in the future---who have had radically more subjective time to consider this problem than we have, have apparently augmented their intelligence, and have exchanged massive amounts of knowledge with each other---make decisions so much worse than those that you or I would make today?

If the team in the hypothetical is less metaphilosophically competent than we are, or have their metaphilosophical competence corrupted by the AI, then their decisions would turn out worse.

I'm reminded of the lengthy discussion you had with Wei Dai back in the day. I share his picture of which scenarios will get us something close to optimal, his beliefs that philosophical ignorance might persist indefinitely, his skepticism about the robustness of human reflection, and his skepticism that human values will robustly converge upon reflection.

Is that an accurate characterization?

I would say so. Another fairly significant component is my model that humanity makes updates by having enough powerful people paying enough attention to reasonable people, enough other powerful people paying attention to those powerful people, and with everyone else roughly copying the beliefs of the powerful people. So, good memes --> reasonable people --> some powerful people --> other powerful people --> everyone else.

AI would make some group of people far more powerful than the rest, which screws up the chain if that group don't pay much attention to reasonable people. In that case, they (and the world) might just never become reasonable. I think this would happen if ISIS took control, for example.

Other than disagreeing, my main complaint is that this doesn't seem to have much to do with AI. Couldn't you tell exactly the same story about human civilization proceeding along its normal development trajectory, never building an AI, but gradually uncovering new technologies and becoming smarter?

I would indeed expect this by default, particularly if one group with one ideology attains decisive control over the world. But if we somehow manage to avoid that (which seems unlikely to me, given the nature of technological progress), I feel much more optimistic about metaphilosophy continuing to progress and propagate throughout humanity relatively quickly.

When I talk about alignment I'm definitely talking about a narrower thing than you. In particular, any difficulties that would exist with or without AI *aren't* part of what I mean by AI alignment.

Do you think the AI-assisted humanity is in a worse situation than humanity is today?

Lots of people involved in thinking about AI seem to be in a zero sum, winner-take-all mode. E.g. Macron.

I think there will be significant founder effects from the strategies of the people that create AGI. The development of AGI will be used as an example of what types of strategies win in the future during technological development. Deliberation may tell people that there are better equilibrium. But empiricism may tell people that they are too hard to reach.

Currently the positive-sum norm of free exchange of scientific knowledge is being tested. For good reasons, perhaps? But I worry for the world if lack of sharing of knowledge gets cemented as the new norm. It will lead to more arms races and make coordination harder on the important problems. So if the creation of AI leads to the destruction of science as we know it, I think we might be in a worse position.

I, perhaps naively, don't think it has to be that way.

I think the relevance to AI is that AI might accelerate other kinds of progress more than it accelerates deliberation. [...] And at any rate, that seems like a separate problem from alignment, which really needs to be solved by different mechanisms.

What if the mechanism for solving alignment itself causes differential intellectual progress (in the wrong direction)? For example, suppose IDA makes certain kinds of progress easier than others, compared to no AI, or compared to another AI that's designed based on a different approach to AI alignment. If that's the case, it seems that we have to solve alignment (in your narrow sense) and differential intellectual progress at the same time instead of through independent mechanisms. An exception might be if we had some independent solution to differential intellectual progress that can totally overpower whatever influence AI design has on it. Is that what you are expecting?

It seems to me that for a corrigible, moderately superhuman AI, it is mostly the metaphilosophical competence of the human that matters, rather than that of the AI system. I think there are a bunch of confusions presented here, and I'll run through them, although let me disclaim that it's Eliezer's notion of corrigibility that I'm most familiar with, and so I'm arguing that your critiques fall flat on Eliezer's version.

"[The AI should] figure out whether I built the right AI and correct any mistakes I made, remain informed about the AI’s behavior and avoid unpleasant surprises, make better decisions and clarify my preferences, acquire resources and remain in effective control of them, ensure that my AI systems continue to do all of these nice things..."

You omitted a key component of the quote that almost entirely reversis its meaning. The correct quote would read [emphasis added]: "[The AI should] help me figure out whether I built the right AI and correct any mistakes I made, remain informed about the AI’s behavior and avoid unpleasant surprises, make better decisions and clarify my preferences, acquire resources and remain in effective control of them, ensure that my AI systems continue to do all of these nice things...". i.e. the AI should help with ensuring that the control continues to reside in the human, rather than in itself.

The messiah would in his heart of hearts have the best of intentions for them, and everyone would know that.

To my understanding, the point of corrigibility is that a corrigible system is supposed to benefit its human operators even if its intentions are somewhat wrong, so it is rather a non sequitur to say that an agent is corrigible because it has the best of intentions in its heart of hearts. If it truly fully understood human intentions and values, corrigibility may even be unnecessary.

He might also think it's a good idea for his followers to all drink cyanide together, or murder some pregnant actresses, and his followers might happily comply.

Clearly you're right that corrigibility is not sufficient for safety. A corrigible agent can still be instructed by its human operators to make a decision that is irreversibly bad. But it seems to help, and to help a lot. The point of a corrigible AI si that once it takes a few murderous actions, you can switch it off, or tell it to pursue a different objective. So for the messiah example, a corrigible messiah might poison a few followers and then when it is discovered, respond to an instruction to desist. An incorrigible messiah might be asked to stop murdering followers, but continue to do so anyway. So many of the more mundane existential risks would be mitigated by corrigibility.

And what about more exotic ones? I argue they would also be greatly (though not entirely) reduced. Consider that a corrigible messiah may still hide poison for all of the humans at once, leading to an irrevocably terrible outcome. But why should it? If it thinks it is doing well by the humans, then its harmful actions ought to be transparent. Perhaps the AI system would's actions would not be transparent if it intelligence was so radically great that it was inclined to act in fast an incomprehensible ways. But it is hard to see how we could know with confidence that such a radically intelligent AI is the kind we will soon be dealing with. And even if we are going to deal with that kind of AI, there could be other remedies that would be especially helpful in such scenarios. For example, an AI that permits informed oversight of its activities could be superb if it was already corrigible. Then, it could not only provide truthful explanations of its future plans but also respond to feedback on them. Overall, if we had an AI system that was (1) only a little bit superhumanly smart, (2) corrigible, and (3) providing informative explanations of its planned behaviour, then it would seem that we are in a pretty good spot.

"This is absurd. Wouldn't they obviously have cared about animal suffering if they'd reflected on it, and chosen to do something about it before blissing themselves out?"
Yeah, but they never got around to that before blissing themselves out.

I think you're making an important point here, but here is how I would put it: If you have an AI system that is properly deferential to humans, you still need to rely on the humans not to give it any existentially catastrophic orders. But the corrigibility/deferential behavior has changed the situation from one in which you're relying on the metaphilosophical competence of the AI, to one in which you're relying on the metaphilosphical competence on the human (albeit as filtered through the actions of the AI system). In the latter case, yes, you need to survive having a human's power increased by some N-fold. (Not necessarily 10^15 as in the more extreme self-improvement scenarios, but by some N>1). So when you get a corrigible AI, you still need to be very careful with what you tell it to do, but your situation is substantially improved. Note that what I'm saying is at least in some tension with the traditional story of indirect normativity. Rather than trying to give the AI very general instructions for its interpretation, I'm saying that we should in the first instance try to stabilize the world so that we can do more metaphilosophical reasoning ourselves before trying to program an AI system that can carry out the conclusions of that thinking or perhaps continue it.

Would it want to? I think yes, because it's incentivized not to optimize for human values, but to turn humans into yes-men... The only thing I can imagine that would robustly prevent this manipulation is to formally guarantee the AI to be metaphilosophically competent itself.

Yes, an approval-directed agent might reward-hack by causing the human to approve of things that it does not value. And it might compromise the humans' reasoning abilities while doing so. But why must the AI system's metaphilosophical competence be the only defeator? Why couldn't this be achieved by quantilizing, or otherwise throttling the agent's capabilities? By restricting the agent's activities to some narrow domain? By having the agent somehow be deeply uncertain about where the human's approval mechanism resides? None of these seems clearly viable, but neither do any of them seem clearly impossible, especially in cases where the AI system's capabilities are overall not far beyond those of its human operators.

Overall, I'd say superintelligent messiahs are sometimes corrigible, and they're more likely to be aligned if so.

Overall, my impression is that you thought I was saying, "A corrigible AI might turn against its operators and kill us all, and the only way to prevent that is by ensuring the AI is metaphilosophically competent." I was really trying to say "A corrigible AI might not turn against its operators and might not kill us all, and the outcome can still be catastrophic. To prevent this, we'd definitely want our operators to be metaphilosophically competent, and we'd definitely want our AI to not corrupt them. The latter may be simple to ensure if the AI isn't broadly superhumanly powerful, but may be difficult to ensure if the AI is broadly superhumanly powerful and we don't have formal guarantees". My sense is that we actually agree on the latter and agree that the former is wrong. Does this sound right? (I do think it's concerning that my original post led you to reasonably interpret me as saying the former. Hopefully my edits make this clearer. I suspect part of what happened is that I think humans surviving but producing astronomical moral waste is about as bad as human extinction, so I didn't bother delineating them, even though this is probably an unusual position.)

See below for individual responses.

You omitted a key component of the quote that almost entirely reversis its meaning. The correct quote would read [emphasis added]: "[The AI should] help me figure out whether I built the right AI and correct any mistakes I made, remain informed about the AI’s behavior and avoid unpleasant surprises, make better decisions and clarify my preferences, acquire resources and remain in effective control of them, ensure that my AI systems continue to do all of these nice things...". i.e. the AI should help with ensuring that the control continues to reside in the human, rather than in itself.

I edited my post accordingly. This doesn't change my perspective at all.

To my understanding, the point of corrigibility is that a corrigible system is supposed to benefit its human operators even if its intentions are somewhat wrong, so it is rather a non sequitur to say that an agent is corrigible because it has the best of intentions in its heart of hearts. If it truly fully understood human intentions and values, corrigibility may even be unnecessary.

I gave that description to illustrate one way it is like a corrigible agent, which does have the best of intentions (to help its operators), not to imply that a well-intentioned agent is corrigible. I edited it to "In his heart of hearts, the messiah would be trying to help them, and everyone would know that." Does that make it clearer?

Clearly you're right that corrigibility is not sufficient for safety. A corrigible agent can still be instructed by its human operators to make a decision that is irreversibly bad. But it seems to help, and to help a lot. The point of a corrigible AI si that once it takes a few murderous actions, you can switch it off, or tell it to pursue a different objective. So for the messiah example, a corrigible messiah might poison a few followers and then when it is discovered, respond to an instruction to desist. An incorrigible messiah might be asked to stop murdering followers, but continue to do so anyway. So many of the more mundane existential risks would be mitigated by corrigibility.

I agree that corrigibility helps a bunch with mundane existential risks, and think that e.g. a corrigible misaligned superintelligence is unlikely to lead to self-annihilation, but pretty likely to lead to astronomical moral waste. I edited from "Surely, we wouldn't build a superintelligence that would guide us down such an insidious path?" to "I don't think a corrigible superintelligence would guide us down such an insidious path. I even think it would substantially improve the human condition, and would manage to avoid killing us all. But I think it might still lead us to astronomical moral waste." Does this clarify things?

And what about more exotic ones? I argue they would also be greatly (though not entirely) reduced. Consider that a corrigible messiah may still hide poison for all of the humans at once, leading to an irrevocably terrible outcome. But why should it? If it thinks it is doing well by the humans, then its harmful actions ought to be transparent. Perhaps the AI system would's actions would not be transparent if it intelligence was so radically great that it was inclined to act in fast an incomprehensible ways. But it is hard to see how we could know with confidence that such a radically intelligent AI is the kind we will soon be dealing with. And even if we are going to deal with that kind of AI, there could be other remedies that would be especially helpful in such scenarios. For example, an AI that permits informed oversight of its activities could be superb if it was already corrigible. Then, it could not only provide truthful explanations of its future plans but also respond to feedback on them. Overall, if we had an AI system that was (1) only a little bit superhumanly smart, (2) corrigible, and (3) providing informative explanations of its planned behaviour, then it would seem that we are in a pretty good spot.

I agree with all this.

I think you're making an important point here, but here is how I would put it: If you have an AI system that is properly deferential to humans, you still need to rely on the humans not to give it any existentially catastrophic orders. But the corrigibility/deferential behavior has changed the situation from one in which you're relying on the metaphilosophical competence of the AI, to one in which you're relying on the metaphilosphical competence on the human (albeit as filtered through the actions of the AI system). In the latter case, yes, you need to survive having a human's power increased by some N-fold. (Not necessarily 10^15 as in the more extreme self-improvement scenarios, but by some N>1). So when you get a corrigible AI, you still need to be very careful with what you tell it to do, but your situation is substantially improved. Note that what I'm saying is at least in some tension with the traditional story of indirect normativity.

I also agree with all this.

Rather than trying to give the AI very general instructions for its interpretation, I'm saying that we should in the first instance try to stabilize the world so that we can do more metaphilosophical reasoning ourselves before trying to program an AI system that can carry out the conclusions of that thinking or perhaps continue it.

I also agree with all this. I never imagined giving the corrigible AI extremely general instructions for its interpretation.

Yes, an approval-directed agent might reward-hack by causing the human to approve of things that it does not value. And it might compromise the humans' reasoning abilities while doing so. But why must the AI system's metaphilosophical competence be the only defeator? Why couldn't this be achieved by quantilizing, or otherwise throttling the agent's capabilities? By restricting the agent's activities to some narrow domain? By having the agent somehow be deeply uncertain about where the human's approval mechanism resides? None of these seems clearly viable, but neither do any of them seem clearly impossible, especially in cases where the AI system's capabilities are overall not far beyond those of its human operators.

I think this captures my biggest update from your comment, and modified the ending of this post to reflect this. Throttling the AI's power seems more viable than I'd previously thought, and seems like a pretty good way to significantly lower the risk of manipulation. That said, I think even extraordinary human persuadors might compromise human reasoning abilities, and I have fast takeoff intuitions that make it very hard for me to imagine an AGI that simultaneously

  • understands humans well enough to be corrigible
  • is superhumanly intelligent at engineering or global strategy
  • isn't superhumanly capable of persuasion
  • wouldn't corrupt humans (even if it tried to not corrupt them)

I haven't thought too much about this though, and this might just be a failure of my imagination.

Overall, I'd say superintelligent messiahs are sometimes corrigible, and they're more likely to be aligned if so.

Agreed. Does the new title seem better? I was mostly trying to explicate a distinction between corrigibility and alignment, which was maybe obvious to you beforehand, and illustrate the astronomical waste that can result even if we avoid self-annihilation.

Does this sound right?

yep.

A corrigible AI might not turn against its operators and might not kill us all, and the outcome can still be catastrophic. To prevent this, we'd definitely want our operators to be metaphilosophically competent, and we'd definitely want our AI to not corrupt them.

I agree with this.

a corrigible misaligned superintelligence is unlikely to lead to self-annihilation, but pretty likely to lead to astronomical moral waste.

There's a lot of broad model uncertainty here, but yes, I'm sympathetic to this position.

Does the new title seem better?

Yep.

At this round of edits, my main objection would be to the remark that the AI wants us to act as yes-men, which seems dubious if the agent is (i) an Act-based agent or (ii) sufficiently broadly uncertain over values.

What I see to be the main message of the article as currently written is that humans controlling a very powerful tool (especially AI) could drive themselves into a suboptimal fixed point due to insufficient philosophical sophistication.

This I agree with.

Thanks Ryan!

What I see to be the main message of the article as currently written is that humans controlling a very powerful tool (especially AI) could drive themselves into a suboptimal fixed point due to insufficient philosophical sophistication.
This I agree with.

Hurrah!

At this round of edits, my main objection would be to the remark that the AI wants us to act as yes-men, which seems dubious if the agent is (i) an Act-based agent or (ii) sufficiently broadly uncertain over values.

I no longer think it wants us to turn into yes-men, and edited my post accordingly. I still think it will be incentivized to corrupt us, and I don't see how being an act-based agent would be sufficient, though it's likely I'm missing something. I agree that if it's sufficiently broadly uncertain over values then we're likely to be fine, but in my head that unpacks into "if we knew the AI were metaphilosophically competent enough, we'd be fine", which doesn't help things much.

Would it want to? I think yes, because it's incentivized not to optimize for human values, but to turn humans into yes-men.

I don't think this is right. The agent is optimized to choose actions which, when shown to a human, receive high approval. It's not optimized to pick actions which, when executed, cause the agent to receive high approval in the future.

(I'm happy to grant that there may be optimization daemons, but that seems separate.)

I don't think this is right. The agent is optimized to choose actions which, when shown to a human, receive high approval. It's not optimized to pick actions which, when executed, cause the agent to receive high approval in the future.

I think optimizing for high approval now leaves a huge number of variables unconstrained. For example, I could absolutely imagine a corrigible AI with ISIS suicide bomber values that consistently receives high approval from its operator and eventually turns its operator into an ISIS suicide bomber. (Maybe not all operators, but definitely some operators.)

Given the constraint of optimizing for high approval now, in what other directions would our corrigible AI try to optimize? Some natural guesses would be optimizing for future approval, or optimizing for its model of its operator's extrapolated values (which I would distrust unless I had good reason to trust its extrapolation process). If it were doing either, I'd be very scared about getting corrupted. But you're right that it may not optimize for us turning into yes-men in particular.

I suspect this disagreement is related to our disagreement about the robustness of human reflection. Actually, the robustness of human reflection is a crux for me—if I thought human reflection were robust, then I think an AI that continously optimizes for high approval now leaves few important variables unconstrained, and would lead us to very good outcomes by default. Is this a crux for you too?

What do you mean by leaving variables unconstrained? Optimizing for X is basically a complete description.

(Of course, if I optimize my system for X, I may get a system that optimizes for Y != X, but that doesn't seem like what you are talking about. I don't think that the Y's you described---future approval, or idealized preferences---are especially likely Y's. More likely is something totally alien, or even reproductive fitness.)

Oops, I do think that's what I meant. To explain my wording: when I imagined a "system optimizing for X", I didn't imagine that system trying its hardest to do X, I imagined "a system for which the variable Z it can best be described as optimizing is X".

To say it all concretely another way, I mean that there are a bunch of different systems that, when "trying to optimize for X as hard as possible" all look to us like they optimize for X successfully, but do so via methods that lead to vastly different (and generally undesirable) endstates like the one described in this post, or one where the operators become ISIS suicide bombers. In this light, it seem more accurate to describe as optimizing for instead of X, even though is trying to optimize for X and optimizes it pretty successfully. But I really don't want a superintelligent system optimizing for some Y that is not my values.

As a possibly related general intuition, I think the space of outcomes that can result from having a human follow a sequence of suggestions, each of which they'd enthusiastically endorse, is massive, and that most of these outcomes are undesirable. (It's possible that one crisp articulation of "sufficient metaphilosophical competence" is that following a sequence of suggestions, each of which you'd enthusiastically endorse, is actually good for you.)

On reflection, I agree that neither future approval nor idealized preferences are particularly likely, and that whatever Y is would actually look very alien.

Side note:

but a ragtag team of hippie-philosopher-AI-researchers

I love this phrase. I think I'm going to use it in my online dating profile.

Wouldn't this corruption or manipulation render the AGI incorrigible? I think not, because I don't think corruption or manipulation are natural categories. For example, I think it's very common for humans to unknowingly influence other humans in subtle ways while honestly believing they're only trying to be helpful, while an onlooker might describe the same behavior as manipulative. (Section IV here provides an amusing illustration.) Likewise, I think an AGI can be manipulating us while genuinely thinking it's helping us and being completely open with us (much like a messiah), unaware that its actions would lead us somewhere we wouldn't currently endorse.

What do you mean by manipulation?

If the AI is optimizing its behavior to have some effect on the human, then that's practically the central case the concept of corrigibility is intended to exclude. I don't think it matters what the AI thinks about what the AI is doing, it just matters what optimization power it is applying.

If the AI isn't optimizing to influence our behavior, then I'm back to not understanding the problem. Can you flesh out this step of the argument? Is the problem that helping humans get what they short-term-want will lead to trouble? Is it something else?

If the AI is optimizing its behavior to have some effect on the human, then that's practically the central case the concept of corrigibility is intended to exclude. I don't think it matters what the AI thinks about what the AI is doing, it just matters what optimization power it is applying.

See my comment re: optimizing for high approval now vs. high approval later.

If you buy (as I do) that optimizing for high approval now leaves a huge number of important variables unconstrained, I don't see how it make sense to talk about an AI optimizing for high approval now without also optimizing to have some effect on the human, because the unconstrained variables are about effects on the human. If there were a human infant in the wilderness and you told me to optimize for keeping it alive without optimizing for any other effects on the infant, and you told me I'd be screwing up if I did optimize for other effects, I would be extremely confused about how to proceed.

If you don't buy that optimizing for high approval now leaves a huge number of important variables unconstrained, then I agree that the AI optimizing its behavior to have some effects on the human should be ruled out by the definition of corrigibility.

Saying "choose the action a for which your expectation of f(a) is highest," doesn't leave you any degrees of freedom. Similarly, "choose the action for which the child's probability of survival is highest" seems pretty unambiguous (modulo definitions of counterfactuals).

I might be misunderstanding you are saying somehow.

Not sure if this is what zhukeepa means, but “choose the action for which the child’s probability of survival is highest” is very likely going to involve actions that could be interpreted as "manipulation" unless the AI deliberately places a constraint on its optimization against doing such things.

But since there is no objective standard for what is manipulation vs education or helpful information, Overseers will need to apply their subjective views (or their understanding of the user's views) of what counts as manipulation and what doesn't. If they get this wrong (or simply forget or fail to place the appropriate constraint on the optimization), then the user will end up being manipulated even though the AI could be considered to be genuinely trying to be helpful.

EDIT: As a matter of terminology, I might actually prefer to call this scenario a failure of corrigibility rather than corrigible but misaligned. I wonder if zhukeepa has any reasons to want to call it the latter.

Totally agree that "choose the action for which the child’s probability of survival is highest” involves manipulation (though no one was proposing that). I'm confused about the meaning of "unconstrained variables" though.

I could be wrong, but I feel like if I ask for education or manipulation and the AI gives it to me, and bad stuff happens, that's not a problem with the redirectibility or corrigibility of the agent. After all, it just did what it was told. Conversely, if the AI system refuses to educate me, that seems rather more like a corrigibility problem. A natural divider is that with a corrigibility AI we can still inflict harm on ourselves via our use of that AI as a tool.

I think there must be a miscommunication somewhere because I don't see how your point is a response to mine. My scenario isn't "I ask for education or manipulation and the AI gives it to me, and bad stuff happens", but something like this: I ask my AI to help me survive, and the AI (among other things) converts me to some religion because it thinks belonging to a church will give me a support group and help maximize my chances, and the Overseer thinks religious education is just education rather than manipulation, or mistakenly thinks I think that, or made some other mistake that failed to prevent this.

I see. I was trying to do was answer your terminology question by addressing simple extreme cases. e.g. if you ask an AI to disconnect its shutdown button, I don't think it's being incorrigible. If you ask an AI to keep you safe, and then it disconnects its shutdown button, it is being incorrigible.

I think the main way the religion case differs is that the AI system is interfering with our intellectual ability for strategizing about AI rather than our physical systems for redirecting AI, and I'm not sure how that counts. But if I ask an AI to keep me safe and it mind-controls me to want to propagate that AI, that's sure incorrigible. Maybe, as you suggest, it's just fundamentally ill-defined...

Excellent write up.

"place far more trust in the human+AI system to be metaphilosophically competent enough to safely recursively self-improve " : I think that's a Problem enough People need to solve (to possible partial maximum) in their own minds, and only they should be "Programming" a real AI.

Sadly this won't be the case =/.

That dialog reminds me of some scenes from Friendship is Optimal, only even more morally off-kilter than CelestAI, which is saying something.