Book review: Artificial Intelligence Safety and Security, by Roman V. Yampolskiy.

This is a collection of papers, with highly varying topics, quality, and importance.

Many of the papers focus on risks that are specific to superintelligence, some assuming that a single AI will take over the world, and some assuming that there will be many AIs of roughly equal power. Others focus on problems that are associated with current AI programs.

I've tried to arrange my comments on individual papers in roughly descending order of how important the papers look for addressing the largest AI-related risks, while also sometimes putting similar topics in one group. The result feels a little more organized than the book, but I worry that the papers are too dissimilar to be usefully grouped. I've ignored some of the less important papers.

The book's attempt at organizing the papers consists of dividing them into "Concerns of Luminaries" and "Responses of Scholars". Alas, I see few signs that many of the authors are even aware of what the other authors have written, much less that the later papers are attempts at responding to the earlier papers. It looks like the papers are mainly arranged in order of when they were written. There's a modest cluster of authors who agree enough with Bostrom to constitute a single scientific paradigm, but half the papers demonstrate about as much of a consensus on what topic they're discussing as I would expect to get from asking medieval peasants about airplane safety.

Drexler's paper is likely the most important paper here, but it's frustratingly cryptic.

He hints at how we might build powerful AI systems out of components that are fairly specialized.

A key criterion is whether an AI has "strong agency". When I first read this paper a couple of years ago, I found that confusing. Since then I've read some more of Drexler's writings (not yet published) which are much clearer about this. Drexler, hurry up and publish those writings!

Many discussions of AI risks assume that any powerful AI will have goals that extend over large amounts of time and space. Whereas it's fairly natural for today's AI systems to have goals which are defined only over the immediate output of the system. E.g. Google translate only cares about the next sentence that it's going to create, and normal engineering practices aren't pushing toward having it look much further into the future. That seems to be a key difference between a system that isn't much of an agent, versus one with strong enough agency to have the instrumentally convergent goals that Omohundro (below) describes.

Drexler outlines intuitions which suggest we could build a superintelligent system without strong agency. I expect that a number of AI safety researchers will deny that such a system will be sufficiently powerful. But it seems quite valuable to try, even if it only has a 50% chance of producing human-level AI, because it has little risk.

This distinction between types of agency serves as an important threshold to distinguish fairly safe AIs from AIs that want to remake the universe into something we may or may not want. And it looks relatively easy to persuade AI researchers to stay on the safe side of that threshold, a long as they see signs that such an approach will be competitive.

By "relatively easy", I mean something like "requires slightly less than heroic effort", maybe a bit harder than it has been to avoid nuclear war or contain Ebola. There are plenty of ways to make money while staying on the safe side of that threshold. But some applications (Alexa? Siri? NPCs?) where there are moderate incentives to have the system learn about the user in ways that would blur the threshold.

Note that Drexler isn't describing a permanent solution to the main AI safety risks, he's only describing a strategy that would allow people to use superintelligence in developing a more permanent solution.

Omohundro's The Basic AI Drives paper has become a classic paper in AI safety, explaining why a broad category of utility functions will generate strategies such as self-preservation, resource acquisition, etc.

Rereading this after reading Drexler's AI safety writings, I now see signs that Omohundro has anthropomorphised intelligence a bit, and has implicitly assumed that all powerful AIs will have broader utility functions than Drexler considers wise.

Also, Paul Christiano disagrees with one of Omohundro's assumptions in this discussion of corrigibility.

Still, it seems nearly certain that someday we will get AIs for which Omohundro's warnings are important.

Focus on a singleton AI's value alignment

Bostrom and Yudkowsky give a succinct summary of the ideas in Superintelligence (which I reviewed here), explaining why we should worry about ethical problems associated with a powerful AI.

Soares discusses why it looks infeasible to just encode human values in an AI (e.g. Sorcerer's Apprentice problems), and gives some hints about how to get around that by indirectly specifying how the AI can learn human values. That includes extrapolating what we would want if we had the AI's knowledge.

He describes an ontology crisis: a goal as simple as "make diamond" runs into problems if "diamond" is described in terms of carbon atoms, but the AI switches to using a nuclear model that sees protons/neutrons/electrons - how do we know whether it will identify those particles as carbon?

Tegmark provides a slightly different way of describing problems with encoding goals into AI's: What happens if an AI is programmed to maximize the number of human souls that go to heaven, and ends up deciding that souls don't exist? This specific scenario seems unlikely, but human values seem complex enough that any attempt to encode them into an AI risks similar results.

Olle Häggström tries to analyze how we could use a malicious Oracle AI while keeping it from escaping its box.

He starts with highly pessimistic assumptions (e.g. implicitly assuming Drexler's approach doesn't work, and worrying that the AI might decide that hedonic utilitarianism is the objectively correct morality, and that maximizing hedons per kilogram of brain produces something that isn't human).

Something seems unrealistic here. Häggström focuses too much on whether the AI can conceal a dangerous message in its answers.

There are plenty of ways to minimize the risk of humans being persuaded by such messages. Häggström shows little interest in them.

It's better to have a security mindset than not, but focusing too much on mathematically provable security can cause researchers to lose sight of whether they're addressing the most important questions.

Herd at al talk about how value drift and wireheading problems are affected by different ways of specifying values.

They raise some vaguely plausible concerns about trade-offs between efficiency and reliability.

I'm not too concerned about value drift - if we get the AI(s) to initially handle this approximately right (with maybe some risks due to ontological crises), the AI will use its increasing wisdom to ensure that subsequent changes are done more safely (for reasons that resemble Paul Christiano's intuitions about the robustness of corrigibility).

Concerns about paths to AI

Bostrom's Strategic Implications of Openness in AI Development thoughtfully describes what considerations should influence the disclosure of AI research progress.

Sotola describes a variety of scenarios under which an AI or collections of AIs could cause catastrophe. It's somewhat useful at explaining why AI safety is likely to be hard. It's likely to persuade people who are currently uncertain to be more uncertain, but unlikely to address the beliefs that lead some people to be confident about safety.

Turchin and Denkenberger discuss the dangers of arms races between AI developers, and between AIs. The basic ideas behind this paper are good reminders of some things that could go wrong.

I was somewhat put off by the sloppy writing (e.g. "it is typically assumed that the first superintelligent AI will be infinitely stronger than any of its rivals", followed by a citation to Eliezer Yudkowsky, who has expressed doubts about using infinity to describe real-world phenomena).

Chessen worries about the risks of AI-driven disinformation, which might destabilize democracies.

Coincidentally, SlateStarCodex published Sort by Controversial (a more eloquent version of this) around the time when I read this paper.

This seems much less than an extinction risk by itself, but it might make some important governments short-sighted at key times when the more permanent risks need to be resolved.

His policy advice seems uninspired, e.g. suggesting privacy laws that sound like the GDPR.

And "Americans must choose to pay for news again." This seems quite wrong to me. I presume Chessen means we should return to paying for news via subscriptions instead of via ads.

But the tv news of 1960s was financed by ads, and was about as responsible as anything we might hope to return to. My view is that increased click-bait is due mainly to having more choices of news organizations. Back before cable tv, our daily news choices were typically one or two local newspapers, and two to five tv channels. Those gravitated toward a single point of view.

Cable tv enabled modest ideological polarization, and a modest increase in channel surfing, which caused a modest increase in sensationalism. Internet enabled massive competition, triggering a big increase in sensationalism.

Note that the changes from broadcast tv to cable tv to internet multimedia involved switching from ad-based to subscription to ad-based models, with a steady trend away from a focus on fact-checking (although that fact-checking may have been mostly checking whether the facts fit the views of some elite?).

Wikipedia is an example which shows that not paying for news can generate more responsible news - at the cost of entertainment.

There's still plenty of room for responsible people to create new news institutions (e.g. bloggers such as Nate Silver), and it doesn't seem particularly hard for us to distinguish them from disinformation sources.

The main problem seems to be that people remain addicted to sources as they become click-baity, and continue to treat them as news sources even after noticing alternatives such as Nate Silver.

I expect the only effective solution will involve most of us agreeing to assign low status to people who treat click-baity sources as anything more than entertainment.

Miscellaneous other approaches

Torres says the world needs to be ruled by a Friendly AI. (See a shorter version of the paper here.)

His reasoning is loosely based on Moore's Law of Mad Science: Every eighteen months, the minimum IQ necessary to destroy the world drops by one point. But while Eliezer intended that to mostly focus on the risk of the wrong AI taking over the world, Torres extends that to a broad set of weapons that could enable one person to cause human extinction (e.g. bioweapons).

He presents evidence that some people want to destroy the world. I suspect that some of the people who worry him are thinking too locally to be a global danger, but there's likely enough variation that we should have some concern that there are people who seriously want to kill all humans.

He asks how low we need to get risk of any one such malicious person getting such a weapon in order to avoid human extinction. But his answer seems misleading - he calculates as if the risks were independent for each person. That appears to drastically overstate the risks.

Oh, and he also wants to stop space colonization. It creates risks of large-scale war. But even if that argument convinced me that space colonization was bad, I'd still expect it to happen. Mostly, he doesn't seem to be trying very hard to find good ways to minimize interstellar war.

If we're going to colonize space fairly soon, then his argument is weakened a good deal, and it would then imply that there's a short window of danger, after which it would take more unusual weapons to cause human extinction.

What's this got to do with AI? Oh, right. A god-like AI will solve extinction risks via means that we probably can't yet distinguish from magic (probably involving mass surveillance).

Note that a singleton can create extinction risks. Torres imagines a singleton that would be sufficiently wise and stable that it would be much safer than the risks that worry him, but we should doubt how well a singleton would match his stereotype.

Torres is correct to point out that we live in an unsafe century, but he seems wrong about important details, and the AI-relevant parts of this paper are better explained by the Bostrom/Yudkowsky paper.

Bostrom has recently published a better version of this idea.

Miller wants to build addiction into an AI's utility function. That might help, but it looks to me like that would only be important given some fairly bizarre assumptions about what we can and can't influence about the utility function.

Bekdash proposes adopting the kind of rules that have enabled humans to use checks and balances to keep us safe from other humans.

The most important rules would limit AI's span of control - AI must have limited influence on the world, and must be programmed to die.

Bekdash proposes that all AIs (and ems) go to an artificial heaven after they die. Sigh. That looks relevant only for implausibly anthropomorphised AI.

Bekdash want to prevent AIs from using novel forms of communication ("it is easier to monitor and scrutinize AI communication than that of humans.") - that seems to be clear evidence that Bekdash has no experience at scrutinizing communications between ordinary computer programs.

He also wants to require diversity among AIs.

Bekdash proposes that global law ensure obedience to those rules. Either Bekdash is carefully downplaying the difficulties of enforcing such laws, or (more likely) he's depending on any illegal AIs being weak enough that they can be stopped after they've had a good deal of time to enhance themselves. In either case, his optimism is unsettling.

Why do these papers belong in this book?

Prasad talks about how to aggregate opinions, given the constraints that opinions are expressed only through a voting procedure, and that Pareto dominant alternatives are rare.

I expect a superintelligent AI to aggregate opinions via evidence that's more powerful than voting for political candidates or complex legislation (e.g. something close to estimating how much utility each person gets from each option).

I also expect a superintelligent AI to arrange something close to Pareto dominant deals often enough that it will be normal for 95+% of people to consent to decisions, and pretty rare for us to need to fall back on voting. And even if we do occasionally need voting, I'm optimistic that a superintelligence can usually come up with the kind of binary choice where Arrow's impossibility theorem doesn't apply.

So my impression is that Prasad doesn't have a clear reason for applying voting theories to superintelligence. He is at very least assuming implausibly little change in how politics works. Maybe there will be some situations where a superintelligence needs to resort to something equivalent to our current democracy, but he doesn't convince me that he knows that. So this paper seems out of place in an AI safety book.

Portugal et el note that the leading robot operating system isn't designed to prevent unauthorized access to robots. They talk about how to add an ordinary amount of security to it. They're more concerned with minimizing the performance cost of the security than they are with how secure the result is. So I'm guessing they're only trying to handle fairly routine risks, not the risks associated with human-level AI.


3 comments, sorted by Click to highlight new comments since: Today at 9:33 PM
New Comment

Thanks for this post! It was especially valuable to see the link to Eliezer's comments in "I expect that a number of AI safety researchers will deny that such a system will be sufficiently powerful." It explains some aspects of Eliezer's worldview that had previously confused me. Personally, I am at the opposite end of the spectrum relative to Eliezer--my intuition is that consequentialist planning and accurate world-modeling are fundamentally different tasks which are likely to stay that way. I'd argue that the history of statistics & machine learning is the history of gradual improvements to accurate world-modeling which basically haven't shown any tendencies towards greater consequentialism. My default expectation is this trend will continue. The idea that you can't have one without the other seems anthropomorphic to me.

Links to the papers would be useful.

Only some of them are online; the previous review had their full names for ease of Googling and links to some.