Epistemic status: confused.

I currently see two conceptualization of an aligned post-AGI/foom world:

1. Surrender to the Will of the Superintelligence

Any ‘control’ in a world with a Superintelligence will have to be illusory. The creation of an AGI will be the last truly agentic thing humans do. A Superintelligence would be so far superior to any human or group of humans, and able to manipulate humans so well, that any “choice” humanity faces will be predetermined. If an AI understands you better than yourself, can predict you better than yourself, and understands the world and human psychology well enough that it can bring about whatever outcome it wants, then any sense of ‘control’ – any description of the universe putting humans in the driver’s seat – will be false.

This doesn’t mean that alignment is necessarily impossible. The humans creating the Superintelligence could still instil it with a goal of leaving the subjective human experience of free-will intact. An aligned Superintelligence would still put humans into situations where the brain’s algorithm of deliberate decision making is needed, even if the choice itself as well as the outcome are ‘fake’ in some sense. The human experience of control should rank high in an aligned Superintelligence's utility function. But only a faint illusory glimmer of human choice would remain, while the open-ended, agentic power over the Universe would have left humanity with the creation of the first Superintelligence. That’s why it’s so crucial to get the value-loading problem right on the first try.

2. Riding the Techno-Leviathan

The alternative view seems to be something like: it will be possible to retain human agency over a post-AGI world thanks to boxing, interpretability, oracle AI, or some other selective impairment scheme that would leave human agency unspoiled by AGI meddling. The goal appears to be to either limit an AI’s optimization power enough, or insulate the algorithms of human decision making (brain, institutions of group decision making...) well enough, such that humanity remains sovereign, or “in the loop,” in a meaningful sense. This, while also having the AI’s goals point towards the outcomes of human decision making.

This view implies that the integrity of human decision making can be maintained even in the face of an AGI’s optimization power.

I currently associate 1. more with MIRI/Superintelligence style thinking, while 2. with most other prosaic alignment schemes (Christiano, Olah…). 1. requires you to bite some very weird, unsettling and hard to swallow bullets about the post-AGI world, while 2. seems to point towards a somewhat more normal future, though might suffer from naivete and normalcy bias.

Am I understanding this correctly? Are there people holding a middle ground view?

New Comment
14 comments, sorted by Click to highlight new comments since:

Existing post that's one piece of the answer to this:


I think either is technically possible with perfect knowledge - that is, I don’t think either option is so incoherent that you cannot make any logical sense of it.

This leaves the question of which is easier. (1) requires somehow getting a full precise description of the human utility function. I don’t fully understand the arguments against (2), though MIRI seems to be pretty confident there are large issues.

The main distinction seems to be in the extent of how strongly these super-intelligent agents will use their power to influence human decision-making.

At one extreme end is total control, even in the most putatively aligned case: If my taking a sip of water from my glass at 10:04:22 am would be 0.000000001% better in some sense than sipping at 10:04:25 am, then it will arrange the inputs to my decision so that I take a sip of water at 10:04:22 am, and similarly for everything else that happens in the world. I do think that this would constitute a total loss of human control, though not necessarily a loss of human agency.

At the extreme other end would be something more like an Oracle, a superintelligent system (I hesitate to call it an agent) that has absolutely no preferences, including implied preferences, for the state of the world beyond some very narrow task.

Or to put it another way, how much slack will a superintelligence have in its implied preferences?

Concept 1 appears to be describing a superintelligence with no slack at all. Every human decision (and presumably everything else in the universe) must abide by a total strict order of preferences and it will optimize the hell out of those preferences. Concept 2 describes a superintelligence that may be designed to have - or be constrained to abide by - some slack in orderings of outcomes that depend upon human agency. Even if it can predict exactly what a human may decide, it doesn't necessarily have to act so as to cause a preferred expected distribution of outcomes.

I don't really think that we can rationally hold strong beliefs about where a future superintelligence might fall in this spectrum, or even outside it in some manner that I can't imagine. I do think that the literal first scenario is infeasible even for a superintelligent agent, if it is constrained by anything like our current understanding of physical laws. I can imagine a superintelligence that acts in a manner that is as close to that as possible, and that this would drastically reduce human control even in the most aligned case.

I think something that doesn't match either. I think both 1 and 2 are possible. But what we should probably go for is 3. 

Suppose a superintelligent AI, and it has a goal involving giving humans as much actual control as possible. 

and able to manipulate humans so well, that any “choice” humanity faces will be predetermined.

All our choices are technically predetermined, because the universe is deterministic. (Modulo unimportant quantum details) 

The AI could manipulate humans, but is programmed not to want to. 

This isn't an impairment scheme like boxing or oracle AI. This is the genie that listens to your wish without manipulating you, and then carries it out in the spirit you asked. If the human(s?) tell the AI to become a paperclip maximizer, the AI will. (Maybe with a little pop up box. "It looks like you are about to destroy the universe, are you sure you want to do that?"  to prevent mistakes.)And the humans are making that decision using brains that haven't been deliberately tampered with by the AI.  

I think it depends on the goals of the superintelligence. If it is optimized for leaving humans in control, then it could do so. However, if it is not optimized for leaving humans in control, then it would be an instrumentally convergent goal for it to take over control, and therefore it could be assumed to do so.

I'm just confused about what "optimized for leaving humans in control" could even mean? If a Superintelligence is so much more intelligent than humans that it could find a way, without explicit coercion, for humans to ask it to tile the universe with paper-clips, then "control" seems like a meaningless concept. You would have to force the Superintelligence to treat the human skull, or whatever other boundary of human decision making, as some kind of unviolable and uninfluenceable black box.

This basically boils down to the alignment problem. We don't know how to specify what we want, but that doesn't mean it is necessarily incoherent.

Treating the human skull as "some kind of unviolable and uninfluenceable black box" seems to get you some of the way there, but of course is problematic in its own ways (e.g. you wouldn't want delusional AIs). Still it seems like it points to the path forwards in a way.

I think control is a meaningful concept. You could have AI that doesn't try to alter your terminal goals. Something that just does what you want (not what you ask, since that has well-known failure modes) without trying to persuade you into something else.

The difficulty of building such a system is another question, alas.

Third option not considered here (though it may be fairly unlikely)—it may be the case that superintelligence does not provide a substantial enough advantage to be able to control much of humanity, due to implications of chaos theory or something similar. Maybe it would be able to control politics fairly well, but some coordination problems could plausibly be beyond any reasonable finite intelligence, and hence beyond its control.


Why would you want to control a superintelligence aligned with our values? What would be the point of that?

Why would we want to allow for individual humans, who are less-than-perfectly-aligned with our values, to control a superintelligence that is perfectly-aligned-with-our-values?

A Superintelligence would be so far superior to any human or group of humans, and able to manipulate humans so well, that any “choice” humanity faces will be predetermined.

I guess the positive way to phrase this is, "FAI would create an environment where the natural results of our choices would typically be good outcomes" (typically, but not always, because being optimized too hard to succeed is not fun).

Talking about manipulation seems to imply that FAI would trick humans into making choices against their own best interest. I don't think that, typically, is what would happen

I also see a scenario where FAI deliberately limits its ability to predict people's actions, out of respect for people being upset over the feeling of their choices being "predetermined".

But only a faint illusory glimmer of human choice would remain, while the open-ended, agentic power over the Universe would have left humanity with the creation of the first Superintelligence.

Meh. I'd rather have the FAI make the big-picture decisions, rather than some corrupt/flawed group of human officials falling prey to the usual bias in human thinking. Either way, I am not the one making the decisions, so what does it matter to me? At least FAI would actually make good decisions.

I didn't mean to make 1. sound bad. I'm only trying to put my finger on a crux. My impression of most prosaic alignment work seems to be that they have 2. in mind, even though MIRI/Bostrom/LW seem to believe that 1. is actually what we should be aiming towards. Do prosaic alignment people think that work on human 'control' now will lead to scenario 1 in the long run, or do they just reject scenario 1?


I'm not sure I understand the "prosaic alignment" position well enough to answer this.

I guess, personally, I can see appeal of scenario 2, of keeping a super-optimizer under control and using it in limited ways to solve specific problems. I also find that scenario incredibly terrifying, because super-optimizers that don't optimize for the full set of human values are dangerous.

[+][comment deleted]10