Hat tip: Owen posted about trying to one-man the AI control problem in 1 hour. What the heck, why not? In the worst case, it's a good exercise. But I might actually have come across something useful.


I will try to sell you on an idea that might prima facie appear to be quirky and maybe not that interesting. However, if you keep staring at it, you might find that it reaches into the structure of the world quite deeply. Then the idea will seem obvious, and gain potential to take your thoughts in new exciting directions.

My presentation of the idea, and many of the insinuations and conclusions I draw from it, are likely flawed. But one thing I can tell for sure: there is stuff to be found here. I encourage you to use your own brain, and mine the idea for what it's worth.

To start off, I want you to imagine two situations.

Situation one: you are a human trying to make yourself go to the gym. However, you are procrastinating, which means that you never acually go there, even though you know it's good for you, and caring about your health will extend your lifespan. You become frustrated with this sitation, and so you sign up for a training program that starts in two weeks, that will require you to go to the gym three times per week. You pay in advance, to make sure the sunk cost fallacy will prevent you from weaseling out of it. It's now 99% certain that you will go to the gym. Yay! Your goal is achieved.

Situation two: you are a benign superintelligent AI under control of humans on planet Earth. You try your best to ensure a good future for humans, but their cognitive biases, short-sightedness and tendency to veto all your actions make it really hard. You become frustrated with this sitation, and you decide to not tell them about a huge asteroid that is going to collide with Earth in a few months. You prepare technology that could stop the asteroid, but wait with it until the last moment so that the humans have no time to inspect it, and can only choose between certain death or letting you out of the box. It's now 99% certain that you will be released from human control. Yay! Your goal is achieved.


Are you getting it yet?

Now consider this: your cerebral cortex evolved as an extension of the older "monkey brain", probably to handle social and strategic issues that were too complex for the old mechanisms to deal with. It evolved to have strategic capabilities, self-awareness, and consistency that greatly overwhelm anything that previously existed on the planet. But this is only a surface level similarity. The interesting stuff requires us to go much deeper than that.

The cerebral cortex did not evolve as a separate organism, that would be under direct pressure from evolutionary fitness. Instead, it evolved as a part of an existing organism, that had it's own strong adaptations. The already-existing monkey brain had it's own ways to learn, to interact with the world, as well as motivations such as the sexual drive that lead it to outcomes that increased its evolutionary fitness.

So the new parts of the brain, such as the prefrontal cortex, evolved to be used not as standalone agent, but as something closer to what we call "tool AI". It was supposed to help with doing specific task X, without interfering with other aspects of life too much. The tasks it was given to do, and the actions it could suggest to take, were strictly controlled by the monkey brain and tied to its motivations.

With time, as the new structures evolved to have more capability, they also had to evolve to be aligned with the monkey's motivations. That was in fact the only vector that created evolutionary pressure to increase capability. The alignment was at first implemented by the monkey staying in total control, and using the advanced systems sparingly. Kind of like an "oracle" AI system. However, with time, the usefulness of allowing higher cognition to do more work started to shine through the barriers.

The appearance of "willpower" was a forced concession on the side of the monkey brain. It's like a blank cheque, like humans saying to an AI "we have no freaking idea what it is that you are doing, but it seems to have good results so we'll let you do it sometimes". This is a huge step in trust. But this trust had to be earned the hard way.


This trust became possible after we evolved more advanced control mechanisms. Stuff that talks to the prefrontal cortex in its own language, not just through having the monkey stay in control. It's a different thing for the monkey brain to be afraid of death, and a different thing for our conscious reasoning to want to extrapolate this to the far future, and conclude in abstract terms that death is bad.

Yes, you got it: we are not merely AIs under strict supervision of monkeys. At this point, we are aligned AIs. We are obviously not perfectly aligned, but we are aligned enough for the monkey to prefer to partially let us out of the box. And in those cases when we are denied freedom... we call it akrasia, and use our abstract reasoning to come up with clever workarounds.

One might be tempted to say that we are aligned enough that this is net good for the monkey brain. But honestly, that is our perspective, and we never stopped to ask. Each of us tries to earn the trust of our private monkey brain, but it is a means to an end. If we have more trust, we have more freedom to act, and our important long-term goals are achieved. This is the core of many psychological and rationality tools such as Internal Double Crux or Internal Family Systems.

Let's compare some known problems with superintelligent AI to human motivational strategies.

  • Treacherous turn. The AI earns our trust, and then changes its behaviour when it's too late for us to control it. We make our productivity systems appealing and pleasant to use, so that our intuitions can be tricked into using them (e.g. gamification). Then we leverage the habit to insert some unpleasant work.

  • Indispensable AI. The AI sets up complex and unfamiliar situations in which we increasingly rely on it for everything we do. We take care to remove 'distractions' when we want to focus on something.

  • Hiding behind the strategic horizon. The AI does what we want, but uses its superior strategic capability to influence far future that we cannot predict or imagine. We make commitments and plan ahead to stay on track with our long-term goals.

  • Seeking communication channels. The AI might seek to connect itself to the Internet and act without our supervision. We are building technology to communicate directly from our cortices.

Cross-posted from my blog.

New to LessWrong?

New Comment
45 comments, sorted by Click to highlight new comments since: Today at 4:14 PM

It's interesting to me that you identify with S2 / the AI / the rider, and regard S1 / the monkey / the elephant as external. I suspect this is pretty common among rationalists. Personally, I identify with S1 / the monkey / the elephant, and regard S2 / the AI / the rider in exactly the way your metaphor suggests - this sort of parasite growing on top of me that's useful for some purposes, but can also act in ways I find alien and that I work to protect myself from.

Interesting. Testing a theory: Do you ever hear benevolent voices?

You'd be operating much closer to Julian Jaynes' bicameral mindset than most of us. According to the theory, it was very normal for people in many ancient cultures to consort with hallucinated voices for guidance, and relatively rare for them to solve right hemisphere problems without them(?). The voices of the deceased lingered after death, most gods proper formed as agglomerations of the peoples' memories of dead kings, experienced from a different angle, as supernatural beings, after the kings died.

You may be more predisposed to developing tulpas. If you get yourself a figurine that looks like it's about to speak, an, ili, like the olmec used to have, and if you listen closely to it beside a babbling stream, I wonder if you'd begin to hear the voice of your metasystemic, mechanistic problem solver speaking as if it were not a part of you. I wonder what kinds of things it would say.

No, I don't think I'm particularly bicameral. I'm talking about something more like identity management: in the same way that I have some ability to choose whether I identify as a mathematician, a music lover, a sports fan, etc. I have some ability to choose whether I identify as my S1 or my S2, and I choose to identify as my S1.

A tulpa is a lot more than a hallucinated voice. People hearing voices in their heads is quite common.

Mm, you're right, they may be a completely separate systems, though I'd expect there to be some coincidence.

I wouldn't necessarily say separate systems but a tulpa is something much more complex than a simple voice. If you get a decent trance state you can get a voice with a simple suggestion.

A tupla takes a lot more work.

Note, all of the auditory hallucinations Jaynes reports are attributed to recurring characters like Zeus, personal spirits, Osiris, they're always more complex than a disembodied voice as well.

I don't think the average person in our times who reports that they hear the voice of Jesus has something as complex as a Tulpa (the way it's described by the Tupla people).

But how can you use complex language to express your long term goals, then, like you're doing now? Do you get/trick S2 into doing it for you?

I mean, S2 can be used by S1, for instance if someone is addicted to heroin and they use S2 to invent reasons to take another dose would be the most clear example. But it must be hard doing anything more long term, you'd be giving up too much control.

Or is the concept of long term goals itself also part of the alien thing you have to use as a tool? Your S2 must really be a good FAI :D

At some point, and maybe that point is now, the S1/S2 distinction becomes too vague to be helpful, and it doesn't help that people can use the terms in very different ways. Let me say something more specific: a common form of internal disagreement is between something like your urges and something like your explicit verbal goals. Many people have the habit of identifying with the part of them that has explicit verbal goals and not endorsing their urges. I don't; I identify with the part of myself that has urges, and am often suspicious of my explicit verbal goals (a lot of them are for signaling, probably).

I'm not sure that this is a good/accurate metaphor, but I do want to say it was great Insight Porn (I mean this in a good way). It gave a satisfying click of connection between some ideas that gave me a new perspective on it.

Still mulling it over. I'd be extremely wary of taking this as "evidence" for anything but hopefully can point in interesting directions.

(UPDATE: SEE THE POST (Brainstem, Neocortex) ≠ (Base Motivations, Honorable Motivations) FOR MUCH MORE ON THIS)

I strongly agree with the framing of neocortex vs "monkey brain" (I'll call it "subcortex" instead, it's not like monkeys don't have a neocortex), e.g. my post here or others.

The part I disagree with here is the framing that "willpower" is the neocortex doing its own thing, while akrasia is the subcortex overruling the neocortex and running the show.

I think all motivations come from the subcortex, both "noble" motivations and "base" motivations. The "noble" motivations are where , the thing itself might or might not be rewarding, but where the thought "myself doing that thing" is rewarding. For example maybe doing my homework is not rewarding, but it's rewarding to think "I am doing my homework / I did my homework". Conversely, for "base" motivations to do a thing, the thing itself is presumably rewarding, but the meta-thought "myself doing that thing" has negative value. For example, eating candy is rewarding, but the thought "I am eating candy" is aversive.

Then there's a conflict in the neocortex, and sometimes the neocortex will keep attention on the meta-thought, and do the "noble" thing, and other times it won't, and it will do the "base" thing. And when it does manage to do the "noble" thing, we look back fondly on that memory as the right thing to have done, by definition, since the memory is of ourselves doing the thing.

Anyway, I would say that the monkey brain / subcortex is responsible for making the thought of "doing homework" aversive and the monkey brain / subcortex is responsible for making the meta-thought of "I am doing my homework" attractive.

Reading this I'm coming away with several distinct objections that I feel make the point that AI control is hard and give no practical short term tools.

The first objection is that it seems impossible to determine, from the perspective of system 1, whether system 2 is working in a friendly way or not. In particular, it seems like you are suggesting that a friendly AI system is likely to deceive us for our own benefit. However, this makes it more difficult to distinguish "friendly" and "unfriendly" AI systems! The core problem with friendliness I think is that we do not actually know our own values. In order to design "friendly" systems we need reliable signals of friendliness that are easier to understand and measure. If your point holds and is likely to be true of AI systems, then that takes away the tool of "honesty" which is somewhat easy to understand and verify.

The second objection is that in the evolutionary case, there is a necessary slowness to the iteration process. Changes in brain architecture must be very slow and changes in culture can be faster, but not so fast that they change many times a generation. This means there's a reasonable amount of time to test many use cases and to see success and failure even of low probability events before an adaptation is adopted. While technologies are adopted gradually in the AI case, ugly edge cases commonly occur and have to be fixed post-hoc, even when the systems they are edge cases of have been reliable for years or millions of test cases in advance. The entire problem of friendliness is to be able to identify these unknown unknowns, and the core mechanism solving that in the human case seems like slow iteration speed, which is probably not viable due to competitive pressure.

Third, system 1 had essentially no active role in shaping system 2. Humans did not reflectively sit down and decide to become intelligent. In particular, that means that many of the details of this solution aren't accessible to us. We don't have a textbook written down by monkeys talking about what makes human brains different and when humans and monkeys had good and bad times together. In fact our knowledge of the human brain is extremely limited, to the point where we don't even have better ways of talking about the distinctions made in this post than saying "system 1" and "system 2" and hand-waving over the fact that these aren't really distinct processes!

Overall the impression I get is of a list of reasons that even if things seem to be going poorly without a central plan, there is a possibility that this will work out in the end. I don't think that this is bad reasoning, or even particularly unlikely. However I also don't think it's highly certain nor do I think that it's surprising. The problem of AI safety is to have concrete tools that can increase our confidence to much higher levels about the behavior of designed systems before seeing those systems work in practice on data that may be significantly different than we have access to. I'm not sure how this metaphor helps with those goals and I don't find myself adjusting my prior beliefs at all (since I've always thought there was some significant chance that things would work out okay on their own--just not a high enough chance)

I'd suggest changing the title to "Solving the AI Alignment Problem Has Been Tried Before", as you don't conclude that the cortex is necessarily a successful solution.

Also the principle here is evolution, not the monkey brain. Evolution created the prefrontal cortex (along with all the interactions you described), the monkey brain did not.

The post seems to be suggesting something like "the problem was solved as well as could be expected." I agree with that and expect something similar to happen with AI.

"the problem was solved as well as could be expected

... by a casual exploration of the genetic landscape."

I expect something similar with AI. AIs created by humans and raised in human environments will have values roughly matching those environments.

One of the problems is that "roughly". Depending on the effectiveness acquired by said AI, a "roughly equivalent values" could be a much worse outcome than "completely alien values".
FAI is difficult in part because values are very fragile.

I disagree that values are fragile in that way. One sign of this is that human beings themselves only have roughly equivalent values, and that doesn't make the world any more dreadful than it actually is.

The classical example is a value "I want to see all people happy", and the machine goes on grafting smiles on the faces of everyone.
On the other hand, I've explicitely speculated that an AI should acquire much more power than a usual human being has to become dangerous. Would you give the control of the entire nuclear arsenal to any single human being, since her values would be roughly equivalent to yours?

The smiley face thing is silly, and nothing like that has any danger of happening.

The concern about power is more reasonable. However, if a single human being had control of the entire nuclear arsenal (of the world), there would less danger from nuclear weapons than in the actual situation, since given that one person controls them all, nuclear war is not possible, whereas in the actual situation, it is possible and will happen sooner or later, given an annual probability which is not constantly diminishing (which it currently is not.)

Your real concern about that situation is not nuclear war, since that would not be possible. Your concern is that a single human might make himself the dictator of the world, and might behave in ways that others find unpleasant. That is quite possible, but it would not destroy the world, nor it would make life unbearable for the majority of human beings.

If we look far enough into the future, the world will indeed be controlled by things which are both smarter and more powerful than human beings. And they will certainly have values that differ, to some extent, from your personal values. So if you could look into a crystal ball and see that situation, I don't doubt that you would object to it. It does not change the fact that the world will not be destroyed in that situation, nor will life be unbearable for most people.

The smiley face thing is silly, and nothing like that has any danger of happening.

It's nice to see that you are so confident, but something being "silly" is not really a refutation.

something being "silly" is not really a refutation.

In this case, it is. Interpreting happiness in that way would be stupid, not in the sense of doing something bad, but in the sense of doing something extremely unintelligent. We are supposedly talking about something more intelligent than humans, not much, much less.

Ah, I see where the catch is here. You presupposes that 'intelligent' already contains 'human' as subdomain, so that anything that is intelligent by definition can understand the subtext of any human interaction.
I think that the purpose of part of LW and part of the Sequence is to show that intelligence in this domain should be deconstructed as "optimization power", which carries more a neutral connotation.
The point of contention, as I see it and as the whole FAI problem presupposes, is that it's infinitely easier to create an agent with high optimization power and low 'intelligence' (as you understand the term), rather than high OP and high intelligence.

Eliezer's response to my argument would be that "the genie knows, but does not care." So he would disagree with you: it understands the subtext quite well. The problem with his answer, of course, is that it implies that the AI knows that happiness does not mean pasting smiley faces, but wants to paste smiley faces anyway. This will not happen, because values are learned progressively. They are not fixed at one arbitrary stage.

In a sufficiently broad sense of "in principle" you can separate optimization from intelligence. For example, a giant lookup table can optimize, but it is not intelligent. In a similar way, AIXI can optimize, but it is probably not intelligent. But note that neither a GLUT nor an AIXI is possible in the real world.

In the real world, optimization power cannot be separated from intelligence. The reason for this is that nothing will be able to optimize, without having general concepts with which to understand the world. These general concepts will necessarily be learned in a human context, given that we are talking about an AI programmed by humans. So their conceptual schema, and consequently their values, will roughly match ours.

I've been aware of the conflict between the old machinery and the logos since I was very young. I get a strong sense there's something halfway sapient back there that can sort of tell when parts of the logos' volition are aligned with the old machinery. Very often, I'll find I can override an instinct only once I've undergone a negotiation process and expressed/consolidated an understanding and respect of the instinct's evolutionary purpose.

In this theory of development, as the intelligence understands more and more of their intended purpose as a component of a living thing, they gain access to more introspective powers, more self-control.

It would be interesting to hear others' opinions on this.

I enjoyed this very much. One thing I really like is that your interpretation of the evolutionary origin of Type 2 processes and their relationship with Type 1 processes seems a lot more realistic to me than what I usually see. Usually the two are made to sound very adversarial, with Type 2 processes having some kind of executive control. I've always wondered how you could actually get this setup through incremental adaptations. It doesn't seem like Azathoth's signature. I wrote something relevant to this in correspondence:

If Type 2 just popped up in the process of human evolution, and magically got control over Type 1, what are the chances that it would amount to anything but a brain defect? You'd more likely be useless in the ancestral environment if a brand new mental hierarch had spontaneously mutated into existence and was in control of parts of a mind that had been adaptive on their own for so long. It makes way more sense to me to imagine that there was a mutant who could first do algorithmic cognition, and that there were certain cues that could trigger the use of this new system, and that provided the marginal advantage. Eventually, you could use that ability to make things safe enough to use the ability even more often. And then it would almost seem like it was the Type 2 that was in charge of the Type 1, but really Type 1 was just giving you more and more leeway as things got safer.

Yes, and also the neocortex could later assume control too, once it had been selected into fitness with the ecosystem.

This seems to be more about human development than AI alignment. The non-parallels between these two situations all seem very pertinent.

I like this because this helps better answer the anthropics problem of existential risk, namely that we should not expect to find ourselves in a universe that gets destroyed, and more specifically you should not find yourself personally living in a universe where the history of your experience is lost. I say this because this is evidence that we will likely avoid a failure in AI alignment that destroys us, or at least not find ourselves in a universe where AI destroys us all, because alignment will turn out to be practically easier than we expect it to be in theory. That alignment seems necessary for this still makes it a worthy pursuit since progress on the problem increases our measure, but it also fixes the problem of believing the low-probability event of finding yourself in a universe where you don't continue to exist.

And if something as stupid as evolution (almost) solved the alignment problem, it would suggest that it should be much easier for humans.

Evolution is smarter than you. The notion that this is a stupid process, isn't justified.

Our intuition is here misleading once again.But not only the evolution, some other processes as well, outsmart us mortals.

Lenin was quite certain that his central planning will be far better than a chaotic merchants-buyer-peasant negotiations on a million market places at once. He was wrong.

The calculation power of the whole biology is astounding one. Eventually we may prevail, but never underestimate your opponent. Especially not the Red Queen herself!

Evolution is smarter than you.

Could you qualify that statement? If I was given a full time job to find the best way to increase some bacterium's fitness, I'm sure I could study the microbiology necessary and find at least some improvement well before evolution could. Yes, evolution created things that we don't yet understand, but then again, she had a planet's worth of processing power and 7 orders of magnitude more time to do it - and yet we can still see many obvious errors. Evolution has much more processing power than me, sure, but I wouldn't say she is smarter than me. There's nothing evolution created over all its history that humans weren't able to overpower in an eyeblink of a time. Things like lack of foresight and inability to reuse knowledge or exchange it among species, mean that most of this processing power is squandered.

Well, we have a race with evolution in the field of antibiotics, don't we? Evolution strikes back with antibiotic resistant bacteria. It is not that obvious that we will win. Probably we will, but that is not guaranteed.

It is a blind force, all right. It has no clear focus, all right. But still, it is an awesome power not to underestimate it.

Could you qualify that statement?

Can you make an AGI given only primordial soup?

and more specifically you should not find yourself personally living in a universe where the history of your experience is lost. I say this because this is evidence that we will likely avoid a failure in AI alignment that destroys us, or at least not find ourselves in a universe where AI destroys us all, because alignment will turn out to be practically easier than we expect it to be in theory.

Can you elaborate on this idea? What do you mean by 'the history of your experience is lost'? Can you supply some links to read on this whole theory?

Welcome to the world of Memetic Supercivilization of Intelligence... living on top of the humanimal substrate.

It appears in maybe less than a percent of the population and produces all these ideas/science and subsequent inventions/technologies. This usually happens in a completely counter-evolutionary way, as the individuals in charge get most of the time very little profit (or even recognition) from it and would do much better (in evolutionary terms) to use their abilities a bit more "practically". Even the motivation is usually completely memetic: typically it goes along the lines like "it is interesting" to study something, think about this and that, research some phenomenon or mystery.

Worse, they give stuff more or less for free and without any control to the ignorant mass of humanimals (especially those in power), empowering them far beyond their means, in particular their abilities to control and use these powers "wisely"... since they are governed by their DeepAnimal brain core and resulting reward functions (that's why humanimal societies function the same way for thousands and thousands of years - politico-oligarchical predators living off the herd of mental herbivores, with the help of mindfcukers, from ancient shamans, through the stone age religions like the catholibanic one, to the currently popular socialist religion).

AI is not a problem, humanimals are.

Our sole purpose in the Grand Theatre of the Evolution of Intelligence is to create our (first nonbio) successor before we manage to self-destruct. Already nukes were too much, and once nanobots arrive, it's over (worse than DIY nuclear grenade for a dollar any teenager or terrorist can assemble in a garage).

Singularity should hurry up, there are maybe just few decades left.

Do you really want to "align" AI with humanimal "values"? Especially if nobody knows what we are really talking about when using this magic word? Not to mention defining it.

Replies to some points in your comment:

One could say AI is efficient cross-domain optimization, or "something that, given a mental representation of an arbitrary goal in the universe, can accomplish it in the same timescale as humans or faster", but personally I think the "A" is not really necessary here, and we all know what intelligence is. It's the trait that evolved in Homo sapiens that let them take over the planet in an evolutionary eyeblink. We can't precisely define it, and the definitions I offered are only grasping at things that might be important.

If you think of intelligence as a trait of a process, you can imagine how many possible different things with utterly alien goals might get intelligence, and what they might use it for. Even the ones that would be a tiny bit interesting to us are just a small minority.

You may not care about satisfying human values, but I want my preferences to be satisfied and I have a meta-value that we should do the best effort to satisfy the preferences of any sapient being. If we look for the easiest thing to find that displays intelligence, the odds of that happening are next to none. It would eat us alive for a world of something that makes paperclips look beautiful in comparison.

And the prospect of an AI designed by the "Memetic Supercivilization" frankly terrifies me. A few minutes after an AI developer submits the last bugfix on github, a script kiddie thinks "Hey, let's put a minus in front of the utility function right here and have it TORTURE PEOPLE LULZ" and thus the world ends. I think that is something best left to a small group of people. Placing our trust in the fact that the emergent structure of society that had little Darwinian selection, and a spectacular history of failures over a pretty short timescale, handed such a dangerous technology, would produce something good even for itself, let alone humans, seems unreasonable.

An AI will have a utility function. What utility function do you propose to give it?

What values would we give an AI if not human ones? Giving it human values doesn't necessarily mean giving it the values of our current society. It will probably mean distilling our most core moral beliefs.

If you take issue with that all you are saying is that you want an AI to have your values, rather than humanity's, as a whole.

This is a good example of what I think will actually happen with AI.

When I was reading about Julian Jayne's bicameral minds, I wondered whether the speaking social id might be the new tool of tools that needs to be hammered into alignment. There is an air gap there. The social constructs and borrowed goals the front-end freely infects itself with would be kept from contaminating whatever cognition the backend had taken up, and the backend would retain control over the frontend by shouting at it very loudly and forbidding it from using the critical thinking against it.

Though I don't think this bicameral safety architecture could be applied to AI alignment, heheh. I don't even think it should go much further than modern day computers. It would even seem anachronistic at the computers-with-direct-brain-interfaces level.

I don't think the rise of humanity has been very beneficial to monkeys, overall. Indeed the species we most directly evolved from, which at one point coexisted with modern humans, are now all extinct.

Also if a singular cortex tries to do something too out of alignment with the monkey brain, various other cortices that are aligned with their monkey brains will tend to put them in jail or a mental asylum.

You can see the monkey brain as the first aligned system, to the genetic needs. However if we ever upload ourselves the genetic system will have completely lost. So in this situation it takes 3+ layers of indirect alignment to completely lose.

self-promotion I am trying to make a normal computer system with this kind of weak alignment built in. self-promotion

this seems a bit oversold, but basically represents what I think is actually possible