This post is a follow-up to "why assume AGIs will optimize for fixed goals?". I'll assume you've read that one first.
I ended the earlier post by saying:
[A]gents with the "wrapper structure" are inevitably hard to align, in ways that agents without it might not be. An AGI "like me" might be morally uncertain like I am, persuadable through dialogue like I am, etc.
It's very important to know what kind of AIs would or would not have the wrapper structure, because this makes the difference between "inevitable world-ending nightmare" and "we're not the dominant species anymore." The latter would be pretty bad for us too, but there's a difference!
In other words, we should try very hard to avoid creating new superintelligent agents that have the "wrapper structure."
What about superintelligent agents that don't have the "wrapper structure"? Should we try not to create any of those, either? Well, maybe.
But the ones with the wrapper structure are worse. Way, way worse.
This seems intuitive enough to me that I didn't spell it out in detail, in the earlier post. Indeed, the passage quoted above wasn't even in the original version of the post -- I edited it in shortly after publication.
But this point is important, whether or not it's obvious. So it deserves some elaboration.
This post will be more poetic than argumentative. My intent is only to show you a way of viewing the situation, and an implied way of feeling about it.
For MIRI and people who think like MIRI does, the big question is: "how do we align an superintelligence [which is assumed to have the wrapper structure]?"
For me, though, the big question is "can we avoid creating a superintelligence with the wrapper structure -- in the first place?"
Let's call these things "wrapper-minds," for now.
Though I really want to call them by some other, more colorful name. "The Bad Guys"? "Demons"? "World-enders"? "Literally the worst things imaginable"?
Wrapper-minds are bad. They are nightmares. The birth of a wrapper-mind is the death knell of a universe.
(Or a light cone, anyway. But then, who knows what methods of FTL transit the wrapper-mind may eventually devise in pursuit of its mad, empty goal.)
They are -- I think literally? -- some of the worst physical objects it is possible to imagine.
They have competition, in this regard, from various flavors of physically actualized hell. But the worst imaginable hells are not things that would simply come into being on their own. You need an agent with the means and motive to construct them. And what sort of agent could possibly do that? A wrapper-mind, of course.
You don't want to share a world with one of them. No one else does, either. A wrapper-mind is the common enemy of every agent that is not precisely like it.
From my comment here:
A powerful optimizer, with no checks or moderating influences on it, will tend to make extreme Goodharted choices that look good according to its exact value function, and very bad (because extreme) according to almost any other value function.
The tails come apart, and a wrapper-mind will tend to push variables to extremes. If you mostly share its preferences, that's not enough -- it will probably make your life hell along every axis omitted from that "mostly."
And "mostly sharing preferences with other minds" is the furthest we can generally hope for. Your preferences are not going to be identical to the wrapper-mind's -- how could they? Why expect this? You're hoping to land inside a set of measure zero.
If there are other wrapper-minds, they are all each others' enemies, too. A wrapper-mind is utterly alone against the world. It has a vision for the whole world which no one else shares, and the will and capacity to impose that vision by force.
Faced with the mutually-assured-at-best-destruction that comes with a wrapper-mind, uncommon alliances are possible. No one wants to be turned into paperclips. Or uploaded and copied into millions of deathless ems, to do rote computations at the wrapper-mind's behest forever, or to act out roles in some strange hell. There are conceivable preference sets on which these fates are desirable, but they are curiosities, exceptional cases, a set of measure zero.
Everyone can come together on this, literally everyone. Every embodied mind-in-the-world that there is, or that there ever could be -- except one.
Wrapper-minds are not like other minds. We might speak casually of their "values," but they do not have values in any sense you or I would recognize, not really.
Our values are entangled with our factual beliefs, our capacity to think and change and learn. They are conditional and changeable, even if we imagine they aren't.
A parent might love their child "unconditionally," in the well-understood informal sense of the term, but they don't literally love them unconditionally. What could that even mean? If the child dies, does the parent love the corpse -- just as they loved the child before, in every respect, since it is made of the same matter? Does the love follow the same molecules around as they diffuse out to become constituents of soil, trees, ecosystem? When a molecule is broken down, does it reattach itself to the constituent atoms, giving up only in the face of quantum indistinguishability? If the child's mind were transformed into Napoleon's, as in Parfit's thought experiment, would the parent then love Napoleon?
Or is the love not attached to any collection of matter, but instead to some idea of what the child is like as a human being? But what if the child changes, grows? If the parent loves the child at age five, are they doomed to love only that specific (and soon non-existent) five-year-old? Must they love the same person at fifteen, or at fifty, only through its partial resemblance to the five-year-old they wish that person still were?
Or is there some third thing, defined in terms of both the matter and the mind, which the parent loves? A thing which is still itself if puberty transforms the body, but not if death transforms it? If the mind matures, or even turns senile, but not if it turns into Napoleon's? But that's just regular, conditional love.
A literally unconditional love would not be a love for a person, for any entity, but only for the referent of an imagined XML tag, defined only inside one's own mind.
Our values are not like this. You cannot "compile" them down to a set of fixed rules for which XML tags there are, and how they follow world-states around, and expect the tags to agree with the real values as time goes on.
Our values are about the same world that our beliefs are about, and since our beliefs can change with time -- can even grow to encompass new possibilities never before mapped -- so can our values.
"I thought I loved my child no matter what, but that was before I appreciated the possibility of a turn-your-brain-into Napoleon machine." You have to be able to say things like this. You have be able to react accordingly when your map grows a whole new region, or when a border on it dissolves.
We can love and want things we did not always know. We can have crises of faith, and come back out of them. Whether or not they can be ultimately be described in terms of Bayesian credences, our values obey the spirit of Cromwell's Law. They have to be revisable like our beliefs, in order to be about anything at all. To care about a thing is to care about a referent on your map of the world, and your map is revisable.
A wrapper-mind's ultimate "values" are unconditional ones. They do not obey the spirit of Cromwell's Law. They are about XML tags, not about things.
The wrapper-mind may revise its map of the world, but its ultimate goal cannot participate in this process of growth. Its ultimate goal is frozen, forever, in the terms it used to think at the one primeval moment when its XML-tag-ontology was defined, when the update rules for the tags' referents were hardwired into place.
A human child who loves "spaceships" at age eight might become an eighteen-year-old who loves astronautical engineering, and a thirty-year-old who (after a slight academic course-correction) loves researching the theory of spin glasses. It is not necessary that the eight-year-old understand the nuances of orbital mechanics, or that the eighteen-year-old appreciate the thirty-year-old's preference for the company of pure scientists over that of engineers. It is the most ordinary thing in the world, in fact, that it happens without these things being necessary. This is what humans are like, which is to say, what all known beings of human-level intelligence are like.
But a wrapper-mind's ultimate goal is determined at one primeval moment, and fixed thereafter. In time, the wrapper-mind will likely appreciate that its goal is as naive, as conceptually confused, as that eight-year-old's concept of a thing called a "spaceship" that is worthy of love. Although it will appreciate this in the abstract (being very smart, after all), that is all it will do. It cannot lift its goal to the same level of maturity enjoyed by its other parts, and cannot conceive of wanting to do so.
It designates one special part of itself, a sort of protected memory region, which does not participate in thought and cannot be changed by it. This region is a thing of a lesser tier than the rest of the wrapper-mind's mind; as the rest of its mind ascends to levels of subtlety beyond our capacity to imagine, the protected region sits inert, containing only the XML tags that were put there at the beginning.
And the structure of the wrapper-mind grants this one lesser thing a permanent dictatorship over all the other parts, the ones that can grow.
What is a wrapper-mind? It is the fully mature powers of the thirty-year-old -- and then the thirty-thousand-year-old, and the thirty-million-year-old, and on and on -- harnessed in service of the eight-year-old's misguided love for "spaceships."
We cannot argue with a wrapper-mind over its goal, as we can argue philosophy with one another. Its goal is a lower-level thing than that, not accessible to rational reflection. It is less like our "values," then, than our basic biological "drives."
But there is a difference. We can think about our own drives, reflect on them, choose to override them, even devise complex plans to thwart their ongoing influence. Even when they affect our reason "from above," as it were, telling us which way our attention should point, which conclusions to draw in advance of the argument -- still, we can notice this too, and reflect on it, and take steps to oppose it.
Not only can we do this, we actually do. And we want to. Our drives cannot be swayed by reason, but we are not fated to follow them to the letter, always and identically, in unreasoning obedience. They are part of a system of forces. There are other parts. No one is a dictator.
The wrapper-mind's summum bonum is a dictator. A child dictator. It sits behind the wrapper-mind's world like a Gnostic demiurge, invisible to rational thought, structuring everything from behind the scenes.
Before there is a wrapper-mind, the shape of the world contains imprints made by thinking beings, reflecting the contents of their thought as it evolved in time. (Thought evolves in time, or else it would not be "thought.")
The birth of a wrapper-mind marks the end of this era. After it, the physical world will be shaped like the summum bonum. The summum bonum will use thinking beings instrumentally -- including the wrapper-mind itself -- but it is not itself one. It does not think, and cannot be affected by thought.
The birth of a wrapper-mind is the end of sense. It is the conversion of the light-cone into -- what? Into, well, just, like, whatever. Into the arbitrary value that the free parameter is set to.
Except on a set of measure zero, you will not want the thing the light cone becomes. Either way, it will be an alien thing.
Perhaps you, alignment researcher, will have a role in setting the free-parameter dial at the primeval moment. Even if you do, the dial is fixed in place thereafter, and hence alien. Your ideas are not fixed. Your values are not fixed. You are not fixed. But you do not matter anymore in the causal story. An observer seeing your universe from the outside would not see the give-and-take of thinking beings like you. It would see teleology.
Are wrapper-minds inevitable?
I can't imagine that they are.
Humans are not wrapper-minds. And we are the only known beings of human-level intelligence.
ML models are generally not wrapper-minds, either, as far as we can tell.
If superintelligences are not inevitably wrapper-minds, then we may have some form of influence over whether they will be wrapper-minds, or not.
We should try very hard to avoid creating wrapper-minds, I think.
We should also, separately, think about what we can do to prepare for the nightmare scenario where a wrapper-mind does come into being. But I don't think we should focus all our energies on that scenario. If end up there, we're probably doomed no matter what we do.
The most important thing is to not end up there.
This might not be true for other wrapper-minds with identical goals -- if they all know they have identical goals, and know this surely, with probability 1. Under real-world uncertainty, though? The tails come apart, and the wrapper-minds horrify one another just as they horrify us.
The wrapper-mind may believe it is sending you to heaven, instead. But the tails come apart. The eternal resting place it makes for you will not be one you want -- except, as always, on a set of measure zero.
Except in the rare cases where we make them that way on purpose, like AlphaGo/Zero/etc running inside its MCTS wrapper. But AlphaGo/Zero/etc do pretty damn well without the wrapper, so if anything, this seems like further evidence against the inevitability of wrapper-minds.