I just read Roman Yampolskiy's new 500-pages-multi-author anthology: Artificial Intelligence Safety and Security to get a comprehensive view of the field. Here is my attempt at reviewing the book while explaining the main points of those 28 independent chapters.

General Comments about the Book

The first part, Concerns of Luminaries, is made of 11 of the most influential articles related to future AI progress, presented in chronological order. Reading this part felt like navigating through history, chatting with Bill Joy and Ray Kurzweil.

The second part, Responses of Scholars, is comprised of 17 chapters written by academics specifically for this book. Each chapter addresses AI Safety and Security from a different perspective, from Robotics/Computer security to Value Alignment/Ethics.

Part I - Concerns of Luminaries

Why the Future Doesn't Need Us (Bill Joy, Wired Magazine, 2000)

The book starts with the influential essay written by Bill Joy at the beginning of the century. The author, co-founder of Sun Microsystems, author of vi and core contributor of BSD Unix, recounts a decisive talk he had with Ray Kurzweil in 1998 that changed his perspective on the future of the technology.

In particular, Joy compares Genetics, Nanotechnology and Robotics (GNR) in the 21st century with Weapons of Mass Destruction (WMD) last century, being mainly concerned with Nanotechnology and Robotics (where robotics encompasses both actual robotics but also AI.

The author's knowledge about nanotechnology comes from Feynman's talk There is plenty of room at the bottom and Drexler's Engines of Creation (1986). Joy believed nanotechnology didn't work, until he discovered that nanoscale molecular electronics was becoming practical. This raised his awareness about Knowledge-Based Mass Destruction amplified by self-replication in nanotechnologies.

My comments: this article was not much about AI, but gave a general overview of the concerns with future technology. I really enjoyed this essay on GNR by someone who contributed this much to technological progress.

The Deeply Intertwined Promise and Peril of GNR (Ray Kurzweil, 2005)

It's one of the only chapter in The Singularity is Near that addresses existential risk.

Responding to Bill Joy's article, Kurzweil presents several ways to build defensive technology to avoid existential risk: to prevent out-of-control self-replicating nanorobots, just build an immune system that can also self-replicate.

For Kurzweil, "Intelligence is inherently impossible to control". His proposition to align AI with human values is "to foster those values in our society today and [go] forward. [...] The non-biological intelligence we are creating is and will be embedded in our societies and will reflect our values."

My comments: I was pleasantly surprised to see Kurzweil addressing Yudkwosky's Friendly AI and Bostrom's framework of existential risk in 2005. He appears to know very well the risks associated with an intelligence explosion, but have different opinions on how to deal with them.

The Basic AI Drives (Omohundro, 2008)

This is an essential AI paper describing the core "drives" that any sufficiently advanced intelligence would possess (e.g. self-improvement, rationality or self-protection).

My comments: this paper presents critical examples of instrumental goals, laying the groundwork for current AI Safety research.

The ethics of Artificial Intelligence (Bostrom & Yudkowsky, 2011)

The paper starts by giving some context about AGI, and then presents principles to better think about AGI ethics:

Moral status should be independent of the substrate of implementation of an intelligence and of its ontogeny.
When considering duration of experiences, subjective time must be taken into account.

My comments: this chapter answered Kurzweil and cited Omohundro (cf. previous chapters), giving a feeling of consistency to the book.

Friendly AI: the Physics Challenge (Tegmark, 2015)

Max Tegmark proposes a more physicist-oriented approach to Friendly AI. The questions I found the most interesting are:

What if a final goal becomes undefined and the agent undergoes an "ontological crisis"?
What does it mean to have a "final goal" if there is no clear "end of time" in physics?
What would be "not boring" to optimize?

The fraction of matter that is Human? Conscious?
One's ability to predict the future?
The computational power of the cosmos?

My comments: some exciting points and examples. Interesting to have Tegmark's physics-oriented perspective.

MDL Intelligence Distillation: Exploring Strategies for Safe Access to Superintelligent Problem-Solving Capabilities (Drexler, 2015)

This paper presents a method of producing a self-replicating AI with a safe goal of intelligence distillation, where the key metric is the description length of an AI capable of open-ended recursive improvement.

Additionally, Drexler defines what he calls Transitional AI Safety, or reduction-risks methods for AI Safety research, like:

extending the time available for research
enabling experimentation with very capable AI
using smarter-than-human intelligence to solve AI Safety problems

My comments: This was the first technical chapter, and it was much more difficult to follow. It feels like Drexler (author of the Engines of Creation on nanotechnology which was cited in the beginning by Bill Joy) is answering the previous chapters that often cited his work! In the conclusion, the author gives more meta-advices on AI Safety research, like bridging the AI research agenda gap or enriching the conceptual universe, which I found really interesting.

The value learning problem (Soares, 2016)

The paper surveys methods and problems for the value learning problem. One of the key component is an inductive value learning system that would learn to classify outcomes, using some labeled value-learning data. Soares then considers how to adapt the algorithm for different issues (e.g. corrigibility or ontology/ambiguity identification). To solve those problems, such an algorithm would need to be able to learn from sparse data and to identify a referent of a label of the training data for any given model of reality.

My comments: This paper was very enjoyable to read. Clear and straight to the point. It surveys multiple important problems and answers them with a simple and concrete method: inductive value learning.

Adversarial examples in the physical world (Alexey Kurakin & Ian Goodfellow & Samy Bengio, 2016)

Essentially, it's possible to send adversarial examples to a classifier that will be misclassified, even in a real world setting (e.g. using a camera input, not feeding directly data into the model). Even assuming the difference between an adversarial example and a training example is much smaller (in magnitude) than a certain noise, the adversarial example can still be misclassified while the noise correctly classified.

My comments: Straight to the point paper with a bunch of concrete experiments, pictures and results. Made me consider the Security concerns in a more pragmatic way (e.g. adding a noise on a stop sign).

How might AI come about: different approaches and their implications for life in the universe (David Brin, 2016)

Multiple paths of AI development are discussed (e.g. in "robotic-embodied childhood", child-like robots are fostered into human homes and raised like children). Other concepts such as Sapience, the control problem or ethics are considered.

My comments: The essay explains superficially multiple concepts. The author seems to be trying to give an intuition for a "Skynet-scenario" to a general audience. This chapter felt weak and fuzzy compared to the rest.

The MADCOM Future:how AI will enhance computational propaganda, reprogram human culture, and threaten democracy... and what can be done about it (Matt Chessen, 2017)

"Ten years from now, you won't be able to tell whether you're interacting with a human online or not. In the future, most online speech and content will be machines talking to machines."

Machine-driven communication (or MADCOM) will revolutionize online interaction. This could lead to global information warfare where humans can't compete alone with computational propaganda. Multiple US Policy recommendations are given to limit the bad consequences of MADCOM and help implement countermeasures.

My comments: I found the thorough analysis of the possible impacts of MADCOM insightful. This made me update my beliefs about the importance of computational propaganda this century. However, I found that the chapter focused too much on the short-term. For instance, in the definition of AGI (in the glossary), it says: "AGI is still science-fiction"!

Strategic implications of openness in AI development (Bostrom, 2017)

What would be the short-, medium- and long-term impacts of an open AI development?

In the short- and medium-term, AI labs would still conduct original research to build skills, keep up with the state of the art and have a monopoly on their research for a few months (while competitors are catching up). So openness would result in accelerating AI progress.

In the long-term, the final stages for building AGI will likely be much more competitive. To avoid a tight race where AI Safety is dismissed, a singleton scenario is preferable.

Another possibility to take into account is hardware overhang. If algorithmic breakthrough is what leads to an intelligence explosion, then openness would favor small groups (that don't have access to advanced hardware). On the other hand, if hardware is the decisive factor, then openness would favor "elite", or already established labs.

Openness in AI development would clearly help AI Safety research, for it's difficult to work on making AI systems safer if those systems are kept secret. AI Safety could benefit more from openness than AI Capability, because of a need of external perspectives. However, altruistic outsiders are already likely to openly contribute to AI Safety.

My comments: I found this paper very dense. It made me deeply consider the strategic intricacies of openness. I think this chapter is fundamental for this book and I was happy to have it at the end of this part.

Part II - Responses of Scholars

Using human history, psychology, and biology to make AI safe for humans (Gus Bekdash)

Two intelligences are differentiated: Human Intelligence (HI) and Gene Intelligence (GI). GI is essentially the intelligence of all human genomes (or the intelligence of evolution). GI provides control for human reproduction (and therefore controls HI). If all intelligences prove to be inherently dangerous, then we could use the same human principles that made humanity safe for AI (22 principles are considered).

My comments: This chapter appeared to me both concrete and ambitious because of the precise description of those 22 proposed principles.

AI Safety: a first-person perspective (Edward Frenkel)

In this chapter, the mathematician Edward Frenkel tells his story. In particular, he describes the defense mechanism after a traumatism called dissociation, and how it affected his research life.

His point is the following: at the end humans will get to program the AI, so we should make sure every AI Safety personnel is sane to avoid a flight 9525 scenario, where a depressed pilot caused the death of hundreds of passengers.

My comments: His story touched me, but I felt it did not directly address AI Safety.

Strategies for an unfriendly oracle AI with reset button (Olle Häggström)

Is it safe to have an oracle AI answering only yes/no question and being reset every time? Can the AI send a lethal message?

The AI has several methods to (still) try to send a specific message:

The Naïve frequency-based approach consists in answering yes to all question with a certain probability p where the binary expansion of the probability is the message. However, the gigantic number of questions that need to be asked before the frequency of yes answers converges towards the probability p makes the whole approach impractical. Furthermore, humans could ask very simple questions to make sure it is not trying to send a message by answering randomly.
The multiple types of questions approach, where the AI can classify questions and will always answer the same thing for the same category of question. It partitions the question set into (Q_1, ... , Q_n) so that humans can retrieve the message if they have the partition and at least one answer for each set. This method is much faster and could be feasible if the question/answer protocol is automated.

My comments: Very clear chapter. The math to prove the bounds were easy to follow. I am still unsure about the possibility that the humans would already know the partition Q_1, ... , Q_n, or try to decipher a lethal message. The methods seem to only work if the humans know what method the AI is using and want to know the content of the message.

Goal Changes in Intelligent Agents (Seth Herd, Stephen J. Read, Randall O’Reilly, and David J. Jilk)

There is a trade-off between spending time trying to solve motivation drift, representation drift, and reducing the risks of hacking.

My comments: Short and clear chapter. It was a bit redundant considering the chapter "The basic AI drives" and "The value learning problem".

Limits to Verification and Validation of Agentic Behavior (David J. Jilk)

This chapter addresses verifiability using a framework for deontology. The main result is that verifying if an agent will always exhibit a good behaviour given a deontology is not computable. This result is then applied to AI Safety strategy.

My comments: I found the math symbols unnecessary. More generally, the chapter was too technical for a non-expert audience.

Adversarial Machine Learning (Phillip Kuznetsov, Riley Edmunds, Ted Xiao, Humza Iqbal, Raul Puri, Noah Golmant, and Shannon Shih)

After a general introduction on what adversarial examples are, the chapter presents a taxonomy of different attacks depending on the level of information about the model being attacked.

One of the key claims is that AI security against adversarial examples should not be neglected, because given enough time and efforts an attacker can build a pseudo-model and cause harm.

My comments: This chapter was very well presented and clear. It resonated with one of the first chapters "Adversarial examples in the physical world".

Value alignment via tractable preference distance (Andrea Loreggia, Nicholas Mattei, Francesca Rossi, and K. Brent Venable)

A framework to compute distances and represent conditional preferences (e.g. "preferring red cars if the car is a convertible") is proposed. The value alignment procedure relies on a distance between subjective preferences and ethical principles.

My comments: The chapter provided very little explanation of how the procedure for value alignment worked intuitively, and I found it difficult to follow without any background on CP-nets.

A Rationally Addicted Artificial Superintelligence (James D. Miller)

In economy, beneficial addiction increases your consumption capital as you consume. For instance, a superintelligence could get utility from learning mathematics, and the more it learns mathematics the more it can learn it "quickly" by self-modification (the self-modification uses mathematics).

The author claims that a superintelligence will have a drive to pursue such beneficial addictions. Therefore, AI Safety researchers should try to find beneficial addictions that promote human welfare.

My comments: It's the first article of the book that is heavily influenced by economics. I found it clear and it convinced me of the importance of beneficial addictions.

On the Security of Robotic Applications Using ROS (David Portugal, Miguel A. Santos, Samuel Pereira, and Micael S. Couceiro)

This chapter addresses security concerns regarding a robotic framework: the Robotic Operating System (ROS). This framework is only used in research, does not have security by default and has multiple security holes (e.g. communication between nodes use clear text in TCP/IP and UDP/IP).

Additional security features are presented:

a native Transport Layer Security (TLS) for all socket transport
using Advanced Encryption Standard (AES) for encryption
integrating the IP extension to security (IPSec)
having a security architecture at the application level

My comments: This chapter made me realize how important securing robotic applications is. No matter how safe we make our AI systems, if someone is able to hack a connected device or a robot, then everything is lost.

Social Choice and the Value Alignment Problem (Mahendra Prasad)

Social choice theory studies how we can aggregate information about agents into a group decision. Although most normative questions (i.e. how the world should be) about AI ethics will be provided by social choice, some will be built into the AI by its designers. Some questions that must be answered are:

Standing: who or what has its values influence the AI's behaviours and values?
Measurement: how are ethical values measured? In what context? What about discounted utilities and individuals that cannot represent themselves (e.g. posthumans or dead people)?

Solutions to those questions include:

A four-stage procedural legitimacy that aims at building a constitution of the Superintelligent AI.
A risk aversion principle: we don't care if the voting system fails on some low-stage execution but we want it to be extremely robust on high-stage decision.
Nondeterministic voting: to prevent voters from strategically misrepresenting their preferences.

My comments: I feel like this chapter was adapted from an AI ethics article for a textbook on AI Safety and Security, but did not particularly address AI Safety issues. The story about Condorcet was unnecessary.

Disjunctive Scenarios of Catastrophic AI Risk (Kaj Sotala)

The ASI-PATH model proposes to estimate the risk of an AI catastrophe from the probability of the conjunction of multiple events. However, this model does not take into account the fact that Major Strategic Advantages (MSA) are sufficient for catastrophic risks (i.e. human well-being damage on a large scale with more than 10 million deaths) and that any global turbulence can lead to existential risk.

After exhaustively listing Decisive Strategic Advantage (DSA) and MSA enablers, Sotala proposes a disjunctive diagram that could lead to an AI catastrophe.

My comments: Being already familiar with the ASI-PATH model, I was happy to see another diagram that helped me think about AI risk estimates and an exhaustive list of DSA/MSA enablers.

Offensive Realism and the Insecure Structure of the International System: Artificial Intelligence and Global Hegemony (Maurizio Tinnirello)

How can International Relations be improved to better manage AI Risk, given the theory of Offensive Realism (i.e. the lack of a world government leads states to joust for power)? This chapter focused on militarized AI, AI Security and political instability.

My comments: I found that the chapter could be applied to any highly destructive weapons and was not specific to AI.

Superintelligence and the Future of Governance: On Prioritizing the Control Problem at the End of History (Phil Torres)

The key claim is that humanity needs to build an enlightened despot, or friendly supersingleton to prevent state and non-state actors to blow up the world with dual-use emerging technologies

My comments: Very clear diagrams. This made me update my beliefs about an enlightened despot scenario.

Military AI as a Convergent Goal of Self-Improving AI (Alexey Turchin and David Denkenberger)

There is a drive to build Military Tool AIs because an AGI will face enemies, such as:

Teams working on AGI
Nation states or alien civilizations
Any AGI that already exists or might exist

Examples of Military AI (such as the Terminator) have become taboo, and that is a pity because it is (still) a likely outcome.

My comments: Made me reconsider the importance of military AI. Making progress on Alignment is not enough.

A Value-Sensitive Design Approach to Intelligent Agents (Steven Umbrello and Angelo F. De Bellis)

Value-Sensitive Design (VSD) is a flexible and capable design for building Aligned agents. It's self-reflexive, fallible and continually improving (it seeks to predict emerging values/issues and influence the design of technologies early in the process). The main focus is on stakeholders' values.

My comments: I found this chapter too abstract. I would have preferred more concrete applications or a more precise formalization.

Consequentialism, Deontology, and Artificial Intelligence Safety (Mark Walker)

The claim in this chapter is that AI Safety research should prefer consequentialism over deontology. The author then does an introduction on normative ethics: Utilitarianism values happiness (maximizing principle) where Kantianism values the moral good will (never substract, i.e. never treat people solely as means).

My comments: This chapter did not address AI Safety and superficially covered normative ethics.

Smart Machines ARE a Threat to Humanity (Kevin Warwick)

Kevin Warwick criticizes two points made be Penrose in Shadows of the Mind (1994):

the human brain is a complex organ, but there is no "great deal of randomness" inside
AI will be able to "understand", but in a different way

My comments: I feel like this chapter addresses problems that were being discussed more than 20 years ago, that are already solved or not relevant anymore (e.g. Asimov's three laws).

Conclusion

The first part, presenting authoritative articles from top AI researchers, made me realize how the most influential ideas of this century have built upon each other. I was pleased to see previous chapters being cited over and over by the next ones. Most of the content of this part is available online, but I enjoyed the selection and organization of those chapters.

The second part, made of 17 chapters written by academics specifically for this book, discussed recent research on very different topics (adversarial machine learning, social choice, nanotechnology, ethics, alignment, robotics, AI governance, AI strategy, psychology, AI policy), and gave me a general overview of the field of AI Safety and Security.

LESSWRONG
is fundraising!
LW