AI Safety in a Vulnerable World: Requesting Feedback on Preliminary Thoughts

Jordan Arel

Cross-Posted to EA Forum

I would like feedback on a hypothesis that has been percolating in my brain for the past few months.

Epistemic Status: I have studied AI Safety for less than 100 hours, but have been thinking about x-risk for several years.

I am concerned that even in some cases where advanced AI is aligned, the environment in which it exists may still make it unsafe.

If I am not mistaken, “AI Alignment” seems to mean getting AI to do what we want without harmful side effects, but “AI Safety” seems to imply keeping AI from harming or destroying humanity.

These two may come apart in a “Vulnerable World Scenario” in which some future technologies destroy civilization by default. This may be the case because certain technologies have an intrinsic offense bias, meaning if even a small number of humans want to kill everyone, or competing groups are willing to kill each other, those attacking will succeed and those defending will fail by default.

Offense Bias

If there is an offense bias in advanced AI, or any other technology advanced AI leads to, it is not clear that aligning AI i.e. “getting AI to do what we want” would keep us safe. If multiple world powers have advanced AI and they each order their AI to destroy their enemies and protect their own citizens, then if there is an offense bias, and it easier to attack than defend, each AI may succeed in destroying the enemy, but fail to defend its own citizens, meaning everyone dies.

Due to entropy (the universe’s in-built destruction bias,^[1]) the fragility of humans, and the incredible flexibility of advanced AI, it seems quite plausible, I would even guess more likely than not, that advanced AI will constitute or enable an offense bias.

This problem is compounded when we consider the many powerful advanced technologies AI may accelerate in the near future, such as bio-technology, 3D printing, nanotechnology, advanced robotics, brain-machine interfaces, advanced internet of things applications, advanced wearable/cyborg technologies, advanced computer viruses, black swan (unknown unknown) technologies, etc.

Due to advanced AI processes like PASTA (Process for Automating Scientific and Technological Advancement,)powerful advanced technologies could arrive and have transformative effects quite soon, and any one of them could have an offense bias, as could any combination of them, including combinations with already existing technologies such nuclear weapons and drones. This may result in a combinatorial explosion of possible offensive synergies occurring as the number of technologies increase.

Perhaps something similar could be said of defensive technologies, though I am uncertain how the balance would play put. It seems probable to me that the more advanced technologies we expect there to be, and the more powerful we expect them to be, the more concerned we should be about this possibility.

It seems quite possible many of the protective factors humans have historically possessed (social interdependence, fragility/mortality, not overwhelmingly powerful, etc.) will break down, and so it should not be too surprising if one or more unprecedented offense biases occur.

I will next address a concept I will call “Human Alignment” which may be a way of framing solutions to a vulnerable world scenario.

Human Alignment

By “human alignment,” I mean a state of humanity in which most or all of humanity systematically cooperates to achieve positive-sum outcomes for everyone (or at a minimum are prevented from pursuing negative sum outcomes), in a way perpetually sustainable into the future. While exceedingly difficult, saving a vulnerable world from existential catastrophe may necessitate this.

Bostrom points out that if humanity retains a “wide and recognizably human distribution of motives” resulting in a multipolar world order and an “apocalyptic residual,” then even a single apocalyptic actor with access to certain advanced technology may spell the end of civilization. As mentioned, however, actors need not be apocalyptic; it may be enough that they are willing to risk destroying each other to defend themselves, or in pursuit of their own interests.

In “The Vulnerable World Hypothesis,” (VWH) a possible solution Bostrom proposes is universal surveillance of everyone at all times to prevent apocalyptic behavior. Many find this solution unpalatable, though perhaps better than extinction. This would result in humanity being (at least minimally) aligned by force.

Another possible solution is to sustainably eliminate all malicious and apocalyptic intentions, or in other words to universally create enough moral progress that no one desires to kill each other, or is willing to risk destroying humanity. Bostrom seems to dismiss this solution as intractable. I think, however, that by using systemic interventions which incorporate mildly to moderately advanced AI to re-shape the moral fitness landscape toward desirable traits, among other interventions, this may be more tractable than it seems at first glance. I wrote the rough draft of a book on such solutions (for x-risk / vulnerable world in general, not AI x-risk specifically) before formally discovering EA, longtermism, and the VWH. I am now trying to understand the AI x-risk landscape better to see if a vulnerable world scenario is likely given the development of advanced AI.

Conclusion

My main question is whether a vulnerable world induced AI x-risk scenario seems plausible or likely.

I think my main crux is whether AI is likely to be multi-polar, hence multiple agents have access to advanced AI.

Another factor is whether advanced AI is likely to have uneven abilities such that the ability to commit genocide or to create new dangerous technologies is developed before the ability to defend humans, predict what technologies will be dangerous, or align humanity.

I am also very curious if this is something others have talked about, and if so, I would appreciate references to these discussions.

Finally, I would greatly appreciate any thoughts on my reasoning in general, what I may be missing, and what would be promising directions for further research for me.

Thank you in advance for your feedback!

^{^}
By which I mean it is easier to break something than to create or fix it, not exactly the same as offense bias, but closely related

[-]Charlie Steiner1y40

This is one of the reason why there's a fair amount of discussion of bargaining on here. In a multipolar world, agents will likely find that they are better off bargaining rather than destroying each other - and so you probably don't get a universe where everyone is dead, instead you get a world that's the outcome of a bargaining process.

Or if there's an offense bias but one agent is favored over the others, maybe it ignores bargaining, wipes out its enemies, and you no longer have a multipolar world.

[-]Jordan Arel1y10

Hm, logically this makes sense, but I don’t think most agents in the world are fully rational, hence the continuing problems with potential threats of nuclear war despite mutually assured destruction and extremely negative sum outcomes for everyone. I think this could be made much more dangerous by much more powerful technologies. If there is a strong offense bias and even a single sufficiently powerful agent willing to kill others, and another agent willing to strike back despite being unable to defend themselves by doing so, this could result in everyone dying.

The other problem is maybe there is an apocalyptic terrorist Unabomber Anti-natalist negative utilitarian type who is able to access this technology and just decides to literally kill everyone.

I definitely think a multipolar decaying into a unipolar situation seems like a possibility, I guess one thing I’m trying to do is weigh how likely this is against other scenarios where multipolarity leads to mutually assured destruction or apocalyptic terrorism.