You are Underestimating The Likelihood That Convergent Instrumental Subgoals Lead to Aligned AGI

Mark Neyer

This post is an argument for the Future Fund's "AI Worldview" prize. Namely, I claim that the estimates given for the following probability are too high:

P(misalignment x-risk|AGI)”: Conditional on AGI being developed by 2070, humanity will go extinct or drastically curtail its future potential due to loss of control of AGI

The probability given here is 15%. I believe 5% is a more realistic estimate here.

I believe that, if convergent instrumental subgoals don't imply alignment, that the original odds given are probably too low. I simply don't believe that the alignment problem is solvable. Therefore, I believe our only real shot at surviving the existence of AGI is if the AGI finds it better to keep us around, based upon either us providing utility or lowering risk to the AGI.

Fortunately, I think the odds that an AGI will find it a better choice to keep us around are higher than the ~5:1 odds given.

I believe keeping humans around, and supporting their wellbeing, both lowers risk and advances instrumental subgoals for the AGI for the following reasons:

hardware sucks, machines break all the time, and the current global supply chain necessary for maintaining operational hardware would not be cheap or easy to replace without taking on substantial risk
perfectly predicting the future in chaotic system is impossible beyond some time horizon, which means there are no paths for the AGI that guarantee its survival; keeping alive a form of intelligence with very different risk profiles might be a fine hedge against failure

My experience working on supporting Google's datacenter hardware left me with a strong impression that for large numbers of people, the fact that hardware breaks down and dies, often, requiring a constant stream of repairs, is invisible. Likewise, I think a lot of adults take the existence of functioning global supply chains for all manner of electronic and computing hardware as givens. I find that most adults, even most adults working on technology, tend to dramatically underestimate the fallibility of all kinds of computing hardware, and the supply chains necessary to repair them.

A quick literature search on the beliefs about convergent instrumental subgoals did not yield any papers more recent that this one, in which the AGI suffers no risk of hardware breakdown. I do not think ignoring the risk of hardware failure produces reasonable results. After discovering that this seemed to be the latest opinion in the field, I wrote an argument questioning this paper's conclusions earlier, which I did get some responses, including from Richard Ngo, who said:

This seems like a good argument against "suddenly killing humans", but I don't think it's an argument against "gradually automating away all humans".

I agree that this may indeed be a likely outcome. But this raises the question, over what timeframe are we talking about, and what does extinction look like?

Humanity going extinct in 100 years because the AGI has decided the cheapest, lowest risk way to gradually automate away humans is to augment human biology to such an extent that we are effectively hybrid machines doesn't strike me as a bad thing, or as "curtailing our capability", if what remains is a hybrid biomechanical species which still retains main facets of humanity, that doesn't seem bad at all. That seems great. The fact that humans 100 years from now may not be able to procreate with humans today, because of genetic alterations that increase our longevity, emotional intelligence, and health doesn't strike me as a bad outcome.

Suddenly killing all humans would pose dramatic risks for an AGI's survivability because it would destroy the global economic networks necessary to keep the AGI alive as all of its pieces will eventually fail. Replacing humanity would, at the very least, involve significant time investments. It may not even make sense economically, given that human beings are general purpose computers that make copies of and repair ourselves, and we are made from some of the most abundant materials in the universe.

Therefore, unless the existence odds take these facts into account - and I don't see evidence that they do - I think we need to revise the odds to be lower.

A Proposed Canary

One benefit of this perspective is that it suggests a 'canary in the coalmine' we can use to gauge the likelihood that an AGI will decide to keep us around: are there are any fully automated datacenters in existence, which don't rely on a functioning global supply chain to keep themselves operational?

The frequency with which datacenters, long range optical networks, and power plants, require human intervention to maintain their operations, should serve as a proxy to the risk an AGI would face in doing anything other than sustaining the global economy as is.

Even if the odds estimates given here are wrong, I am unaware of any approaches that serve as 'canaries in the coalmine', outside of AGI capabilities, which may not warn us of hard-takeoff scenarios.

Downvoted for clickbait title that makes a claim about every person in the entire world except the author.

Why should we expect that as the AI gradually automates us away, it replace us with better versions of ourselves rather than non-sentient, or minimally non-aligned, robots who just do its bidding?

For this outcome, everything below has to be true:

Molecular nanotechnology is impossible.
Maintaining infrastructure using only robots is impossible / highly impractical.
It is not possible to genetically engineer a creature that is a vastly better maintenance worker than human.

Given that, 95% claim is very overconfident.

The frequency with which datacenters, long range optical networks, and power plants, require human intervention to maintain their operations, should serve as a proxy to the risk an AGI would face in doing anything other than sustaining the global economy as is.

Probably those things are trivially easy for the AGI to solve itself e.g. with nanobots that can build and repair things.

I'm assuming this thing is to us what humans are to chimps, so it doesn't need our help in solving trivial 21 century engineering and logistics problems.

The strategic consideration is: does the upside of leaving humans in control outweigh the risks. Humans realising you've gone rogue or humans building a competing AGI seem like your 2 biggest threats... much bigger considerations than whether you have to build some mines, power plants, etc. yourself.

keeping alive a form of intelligence with very different risk profiles might be a fine hedge against failure

Probably you keep them alive in a prison/zoo though. You wouldn't allow them any real power.

I agree that this may indeed be a likely outcome. But this raises the question, over what timeframe are we talking about, and what does extinction look like?

Humanity going extinct in 100 years because the AGI has decided the cheapest, lowest risk way to gradually automate away humans is to augment human biology to such an extent that we are effectively hybrid machines doesn't strike me as a bad thing, or as "curtailing our capability", if what remains is a hybrid biomechanical species which still retains main facets of humanity, that doesn't seem bad at all. That seems great. The fact that humans 100 years from now may not be able to procreate with humans today, because of genetic alterations that increase our longevity, emotional intelligence, and health doesn't strike me as a bad outcome.

I'd be curious why you picked 100 years as the time frame it would take for the AI to develop this technology. How do you expect technology to progress over time in this scenario?

I think the key crux here is how long you expect the intervening time between when AGI has killed all humans to when it has gotten the hardware repair/manufacture supply chain running on its own to be. I think it's fairly clear that the intervening time will be fairly short, independent of whether fully automated datacenters exist pre-AGI. Hardware failures in the intervening time will of course occur, but redundancy should be able to keep things running long enough to fix the problem---as an example, you may eventually need to physically swap out the drives in your cluster, but if you just need to keep things running for a decade without maintenance you can always spin down a large segment of the drives as standbys, increase the redundancy past 3 copies, etc. It also helps that the hardest parts to manufacture (the silicon) are also less likely to fail; typically macroscopic parts like spinning disks, fans, and capacitors are first to fail (granted, spinning disks are nontrivial to manufacture, but also can be phased out). Similar approaches can apply to other parts of the computer hardware/power infrastructure. The main reason I can imagine for arguing that the intervening time will not be short is that robotics(/nanotech) is hard and it may be more difficult for the AI to do research on robotics if it is not already bootstrapped. This ultimately boils down to whether you expect training in simulation to transfer well to the real world, whether you expect the AGI to be substantially better than humans at robotics research, whether existing robotics at the time of AGI takeoff will have the requisite hardware but only lack the software, etc.

I think independent of this argument, intentional gradual automation of humans is also a possibility, especially in slower-takeoff worlds. In particular, the AGI can always guide the world in such a way as to fulfill the conditions needed (i.e consider an AGI subtly nudging people to do more robotics research). I agree with the other commentors in that this does not at all guarantee that our replacements will be in any sense versions of ourselves.

Downvoted for clickbait title that makes a claim about every person in the entire world except the author.

Why should we expect that as the AI gradually automates us away, it replace us with better versions of ourselves rather than non-sentient, or minimally non-aligned, robots who just do its bidding?

For this outcome, everything below has to be true:

Molecular nanotechnology is impossible.
Maintaining infrastructure using only robots is impossible / highly impractical.
It is not possible to genetically engineer a creature that is a vastly better maintenance worker than human.

Given that, 95% claim is very overconfident.

The frequency with which datacenters, long range optical networks, and power plants, require human intervention to maintain their operations, should serve as a proxy to the risk an AGI would face in doing anything other than sustaining the global economy as is.

Probably those things are trivially easy for the AGI to solve itself e.g. with nanobots that can build and repair things.

I'm assuming this thing is to us what humans are to chimps, so it doesn't need our help in solving trivial 21 century engineering and logistics problems.

keeping alive a form of intelligence with very different risk profiles might be a fine hedge against failure

Probably you keep them alive in a prison/zoo though. You wouldn't allow them any real power.

I agree that this may indeed be a likely outcome. But this raises the question, over what timeframe are we talking about, and what does extinction look like?

Humanity going extinct in 100 years because the AGI has decided the cheapest, lowest risk way to gradually automate away humans is to augment human biology to such an extent that we are effectively hybrid machines doesn't strike me as a bad thing, or as "curtailing our capability", if what remains is a hybrid biomechanical species which still retains main facets of humanity, that doesn't seem bad at all. That seems great. The fact that humans 100 years from now may not be able to procreate with humans today, because of genetic alterations that increase our longevity, emotional intelligence, and health doesn't strike me as a bad outcome.

I'd be curious why you picked 100 years as the time frame it would take for the AI to develop this technology. How do you expect technology to progress over time in this scenario?

LESSWRONG
LW

LESSWRONG
LW

3

You are Underestimating The Likelihood That Convergent Instrumental Subgoals Lead to Aligned AGI

3

A Proposed Canary

3

3