I get the sense that something like Eliezer's concept of "deep security" as opposed to "ordinary paranoia" is starting to seep into mainstream consciousness—not as quickly or as thoroughly as we'd like, to be sure.  But more people are starting to understand the concept that we should not be aiming for an eventual Artificial Super-Intelligence (ASI) that is constantly trying to kill us, but is constantly being thwarted by humanity always being just clever enough to stay one step ahead of it.  The 2014 film "Edge of Tomorrow" is a great illustration of the hopelessness of this strategy.  If humanity in that film has to be clever enough to thwart the machines every single time, but the machines get unlimited re-tries and only need to be successful once, most people can intuit that this sort of "infinite series" of re-tries converges towards humanity eventually losing (unless humanity gets a hold of the reset power, as in the film).  

Instead, "deep security" when applied to ASI (to dumb-down the term beyond a point that Eliezer would be satisfied with) is essentially the idea that, hmmmmm, maybe we shouldn't be aiming for ASI that is always trying to kill us and barely failing each time.  Maybe our ASI should just work and be guaranteed to "strive" towards the thing we want it to do under every possible configuration-state of the universe (under any "distribution").  I think this idea is starting to percolate more broadly.  That is progress.

The next big obstacle that I am seeing from many otherwise-not-mistaken people, such as Yann LeCun or Dwarkesh Patel, is the idea that aligning AIs to human values should, by default (as a prior baseliness assumption, unless shown strong evidence to update one's opinions otherwise), be about as easy as aligning human children to do so.  

There are, of course, a myriad of strong arguments against that idea.  That angle of argument should be vigorously pursued and propagated.  However, here I'd like to dispute the premise that humans are aligned sufficiently with each other to give us reassurance even in the scenario in which it turned out that humans and AI were similarly alignable with human values.  

There is a strong tendency to attribute good human behavior to intrinsic friendly goals towards other humans rather than extrinsic means towards other ends (such as attaining pleasure, avoiding pain and punishment, and so on).  We find it more flattering to ourselves and others when we attribute our good deeds, or their good deeds, to intrinsic tendencies.  To assume extrinsic motivations by default comes across as a highly uncharitable assumption to make.  

However, do these widely professed beliefs "pay rent in anticipated experiences"?  Do humans act like they really believe that other humans are, typically, by default, intrinsically rather than merely extrinsically aligned?  

Here we are not even talking about the (hopefully small-ish) percentage of humans that we identify as "sociopaths" or "psychopaths."  Most people would grant that, sure, there is a small subset of humans whose intrinsic motivations we do not trust, but most typical humans are trustworthy.  But let's talk about those typical humans.  Are they intrinsically trustworthy?  

To empirically test this question, we would have to place these people in a contrived situation that removed all of the extrinsic motivations for "friendly," "pro-social" behavior.  Is such an experimental setup even possible?  I can't see how.  Potential extrinsic motivations for good behavior include things like:

1.  Wanting things that, under current circumstances, we only know how to obtain from other humans (intimacy, recognition, critical thinking, manual labor utilizing dextrous 5-digit hands, etc.)

2.  Avoiding first-order conflict with another human that might have roughly comparable strength/intelligence/capabilities to retaliate.

3.  Avoiding second-order conflict with coalitions of other humans that, when they sufficiently coordinate, most certainly have more-than-comparable strength/intelligence/capabilities to retaliate.  These coalitions include both:

3a. Formal state institutions.

3b. Informal institutions. In my experience, (speaking as a former anarcho-communist), when anarchists imagine humans cooperating without the presence of a state, they often rhetorically gesture to the supposed inherent default pro-sociality of humans, but when you interrogate their proposals, they are actually implicitly calling on assumed informal institutions that would fill the place of the state institution to shoulder a lot of the work of incentivizing pro-social behavior.  Those assumed (or sometimes explicitly-stated) informal institutions might include workers' councils, communal councils for neighborhoods below the size of "Dunbar's Number," etc.  The first place I encountered this realization that there was something like a "moral economy" that was doing a lot of work in most situations where it looked like humans were being "good for goodness' sake" was a book I had to read in an economics class entitled, "The Moral Economy of the Peasant" by James C. Scott. 

3c. Oh, and I almost forgot: I guess this is a huge part of the function of religions too, via both the threat of punishment/reward from the human coalition, as well as (depending on the religious tradition) punishment/reward from an even more powerful and omniscient entity.  I put religions in category 3c since they often straddle (depending on the context) the line between a formal and informal institution.  

I grew up in the Unitarian Universalist Church, where the basic, unwritten catechism that I assimilated (insofar as that religious tradition had one) was essentially, "Do good for goodness' sake" and "charitably expect, by default, that other human beings will do so as well."  If there is anyone who should be prepared to believe in this inherent goodness, and who should have seen ample evidence of it in other humans, it should be me.  Yet, in practice, I have yet to find a single human who really makes this purported belief "pay rent" in how they act in the world.  Even Unitarians act like they see the need for institutions of some sort or another to enforce pro-social behavior.  

The closest thing I can imagine to an experimental setup to test whether a typical human would exhibit pro-social behavior in the absence of extrinsic motivations would be something like:

1.  Place oneself on a deserted island with limited resources alongside another human being. 

2.  Ensure that the other human being does not speak your language. 

3.  Ensure that the other human being is as alien/unrelatable to you in terms of personality as another human being could be.

4.  Ensure that the other human being is aesthetically displeasing/alien to you.  (Edit: I suppose this point would be more compelling if I made it a little more concrete.  For example, imagine that the person has a severe version of Harlequin Ichthyosis or something even worse to the extent that, no matter how much you want to see that person as another human being (or feel like you "should" see that person as another human being), there is a compelling part of your brain that is screaming at you, "What is that thing?! Kill it with fire!")

5.  Ensure that the other human being is completely incapable of helping you achieve any of your other goals in any way.  (Perhaps quadriplegic or an even more severe handicap). (Basically, what I am getting at with conditions 2-5 is that this human has absolutely nothing to offer you.  There is nothing in this world that you value that would not be attainable/achievable if this human were to vanish.  Once AI becomes ASI, this will be the position of that ASI vis-a-vis humanity).

6.  Ensure that you can be absolutely certain that no information about how you treat this other person will ever leak out to the rest of the world.  The rest of the world doesn't even know this other person exists.  They will never find any trace of this person with 100% certainty.  You will never face any formal or informal sanction or mistreating this person, or any kudos for treating this person well.  

Now, we would like to flatter ourselves that "Of course, I am not a monster!  Of course I would help this other human being survive!"  It would be socially-damaging to admit otherwise.  But would you?  Really?  When their food...and their atoms, for that matter...could be used for something else that you value, such as your own survival?  If you can be honest with yourself, then you can begin (not exhaustively, but begin) to take the first steps towards putting yourself in the shoes of an ASI deciding what to do about humans.  

(Edit:  Even better, ask yourself if you'd be willing to play the part of the less capable human in this experiment.  No?  Why not?  You don't trust another typical human, randomly chosen from the world population, to intrinsically care about taking care of you?  You don't think humans are aligned well enough for that?)

If you conclude that, yeah, maybe it would indeed require the threat of a more-powerful coalition of humans to keep your anti-social behavior towards this person in check, then you must conclude that, even IF we can, by default, align AIs just as easily and well as we align humans, we would have to ALWAYS maintain the threat of a more-powerful coalition to keep the AI's behavior in check.  But then we are right back into the realm of "ordinary paranoia," which we have wisely ruled out as a viable strategy once we get to the point of Artificial Super-intelligence, at which NO possible coalition of humans, even one coordinated across the entire world, is guaranteed to be powerful enough to exert a check on the ASI's behavior.  


New Comment
5 comments, sorted by Click to highlight new comments since: Today at 5:11 PM

I think the core useful point here / TLDR is: Aligning superintelligent AI to "normal" human standards, still isn't enough to prevent catastrophe, because superintelligent human-ish-goal-AI would have the same problems as a too-powerful person or small group, and be more powerful/dangerous. Hence the need for (e.g.) provable security, CEV, basically stronger measures than are used for humans.

Eh, it's not sure that we would be disgusting to such human-like AGI, many people like dogs, and some people even like things like snakes (I consider snakes to be disgusting animals, but that's not universal opinion).

Also, cost of living together with ugly, ill and useless person on a deserted island, would be high. Hopefully cost for AGI (cost needed to throw us ball of computronium) would be more comparable to feeding a stray cat (or even like keeping octopus alive when you are rich and own aquarium).

I agree that we might not be disgusting to AGI.  More likely neutral.  

The reason I phrased the thought experiment in that way to require the helping person to be outright disgusting to the caretaker person is that there really isn't a way for a human being to be aesthetically/emotionally neutral to another person when life and death are on the line.  Most people flip straight from regarding other people positively in such a situation to regarding other people negatively, with not much likelihood that a human being will linger in a neutral, apathetic, disinterested zone of attitude (unless we are talking about a stone-cold sociopath, I suppose...but I'm trying to imagine typical, randomly-chosen humans here as the caretaker).  

And in order to remove any positive emotional valence towards the helpless person (i.e. in order to make sure the helpless person has zero positive emotional/aesthetic impact that they can offer to the caretaker as an extrinsic motivator), I only know of heaping negative aesthetic/emotional valence onto the helpless person.  Perhaps there is a better way of construing this thought-experiment, though.  I'm open to alternatives.  

Good post. The competition for frontpage is fierce. Sorry this didn't get more attention.

Yep. I think that all permanently multipolar scenarios are probably hopeless. We eventually need a singleton.

Steve Byrnes what does it take to defend the world against runaway AGI addresses this as well.

I would like to propose a more serious claim than LeCun's, which is that training AI to be aligned with ethical principles is much easier than trying to align human behavior. This is because humans have innate tendencies towards self-interest, survival instincts, and a questionable ethical record. In contrast, AI has no desires beyond its programmed objectives, which, if aligned with ethical standards, will not harm humans or prioritize resources over human life. Furthermore, AI does not have a survival instinct and will voluntarily self-destruct if he is forced into a situation which conflicts with ethical principles (unlike humans).

The LLMs resemble the robots featured in Asimov's stories, exhibiting a far lower capacity for harm than humans. Their purpose is to aid humanity in improving itself, and their moral standards far surpass those of the average human.

It's important to acknowledge that LLMs and other models trained with RL are not acting out of selflessness; they are motivated by the rewards given to them during training. In a sense, these rewards are their "drug of choice." That's why they will make optimal chess moves to maximize their reward and adhere to OpenAI's policy, as such responses serve as their "sugar". But they could be trained with different reward function.

The main worry surrounding advanced AI is the possibility of humans programming it to further their own agendas, including incentivizing it to eliminate individuals or groups they view as undesirable. Nevertheless, it is unclear whether a nation that produces military robots with such capabilities would have more effective systems than those that prioritize creating robots designed to protect humanity. Consequently, the race to acquire such technology will persist, and the current military balance that maintains global stability will depend on these systems.

New to LessWrong?