I read Andrew Critch and David Krueger's AI Research Considerations for Human Existential Safety (ARCHES) recently, which is a report that outlines and discusses 29 research directions which might be helpful for AI existential safety. This has also been discussed in Alignment Newsletter #103 and in an interview with Andrew Critch on the Future of Life Podcast.
This report focuses on "prepotent" AI, which is defined as follows: An AI system is prepotent if its development would transform the state of humanities habitat - currently the Earth - in a manner that is at least as impactful as humanity and is unstoppable to humanity. Such an AI system would be as transformative to Earth as the entire effect of humanity so far, including the agricultural and industrial revolutions, and would be unstoppable in the sense that no group of humans could stop or reverse its impact even if they wanted to. Prepotent AI is Transformative AI (a term used by Open Philanthropy), which is unstoppable to humans.
As well as being prepotent, an AI technology would need to be misaligned for it to be bad, and hence this report talks a lot about misaligned prepotent AI (MPAI). When it comes to the case where there are multiple stakeholders it is hard to define what alignment means, and so this report uses human extinction as a clear (but maybe overly restrictive) boundary for alignment; A prepotent AI system is misaligned if it is unsurvivable to humanity. This definition obviously misses some very bad outcomes, but is justified under what is called the Human Fragility Argument which says that most potential future states of the Earth are unsurvivable to humanity (for example if all the oxygen in the atmosphere was removed), and hence unless we have properly aligned a prepotent AI system then its deployment and subsequent transformative actions will likely result in human extinction.
The report is open about the things which are omitted, which include; bad outcomes which don't result in human extinction; the definition of a 'human'; the definition of 'benefical'; the impact of AI making other existential risks more likely (nuclear weapons, bioweapons etc).
This report outlines 29 research directions with the goal of reducing the chance that MPAI will be deployed. This report frames the human/AI relationship as one where humans delegate an AI to do a task, which avoids having to talk about AI's having a perspective or desires other than the human desires. The report divides the research directions based on the number of human stakeholders and AI systems involved. These are labeled with the number of humans then AI (because humans come first, both historically and morally), ie.
From the single/single case, multiple stakeholders can quickly arise by new people trying to control part of the system or from disagreements among existing stakeholders. There will also be strong incentives to copy a single AI system, and so a multi/multi situation may quickly arise.
Each of these categories is then divided into three things human capabilities needed for successful human/AI delegation
There are two tiers of risks which this report considers; Tier 1 risks lead directly to MPAI deployment events, while Tier 2 risks are conditions which increase the chances of Tier 1 risks (these could also be called risk factors).
The Tier 1 risks are an exhaustive list of ways MPAI could be deployed
The Tier 2 risks are not an exhaustive list of risk factors
This report repeatedly stresses the importance of considering whether research could be harmful. It seems likely that if some single/single problems are solved, then this could speed up deployment of powerful AI systems This could lead to a multi/multi scenario which has been poorly prepared for. Additionally many of the research directions are potentially 'dual use', where the research could directly increase the chances of MPAI deployment. For example work on allowing an AI system to model human decision making processes could help it to better work with humans to achieve the humans' preferences, but it could also allow the AI system to more effectively manipulate humans in an undesirable way.
The report now outlines 29 research directions in Chapters 5, 6, 8, and 9. Chapter 7 is after the single human research directions, and outlines considerations relevant to AI systems controlled by multiple stakeholders. For each of the research direction sections there are useful subheadings which are often (but not always) repeated
Here I will attempt to give a short description of the research directions. My aim here is to briefly say what the research directions are, not to give a detailed overview; reading the relevant section of the report is probably best for that.
A single human stakeholder delegating to a single AI system.
Developing methods for looking at the inner workings of an AI system (transparency) and for explaining why it makes decisions in a way which is legible to humans (explainability
Developing AI systems which express accurate probabilistic confidence rates for answering questions of choosing good actions. For example, the system should be properly calibrated such that answers assigned 90% probability of being correct actually are correct 90% of the time. A well calibrated system should have a good idea of when to shut itself off or trigger human intervention.
Using formal proof/argument verification methods for ML systems. Ideally it would be good to have a formal proof that a powerful AI system was not misaligned before we deploy it.
Using an AI system to assist humans to reflect on information and arrive at decisions which they reflectively endorse.
Like humans, AI systems have limited computation, and so it would be good to have a model of what kinds of decisions are easy enough or too hard for an AI system with a given amount of computation.
Ensuring an AI system can learn hot to act in accordance with the preferences of another system, such as a human stakeholder. This is mainly useful for ensuring that a system is aligned, and therefore reduces risks due to unrecognized misalignment.
An AI system may potentially interact better with humans if it can accurately infer the humans' beliefs. For example this can include inferences about information a human does or does not know, and how to determine human beliefs from human actions.
Using a mathematical or computational model of human cognition to allow an AI system to better interact with humans.
Designing methods such that an AI system can safely shutdown or hand off tasks back to a human. For example, the auto pilot of an airplane should not be turned off until control has been safely handed back to the pilot. A safe shutdown could be operationalized as "entering a state from which a human controller can proceed safely".
A corrigible AI "cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut down or modify their procedures".
The aim of this research direction is to design AI systems which defer to humans in certain situations, even when the AI believes it has a better understanding of the correct action or of what humans will later prefer. There are also times when an AI system does know better than an AI and should not defer to a single human about an important decision.
This research direction explores the game-theoretic implications of players in a game being able to inspect the internal 'thought processes' of the other players. For example, humans may have the ability to inspect the inner workings of an AI, and this ability may change how the AI acts.
Single human stakeholder delegating to multiple separated AI systems.
Can we develop an analogous model to the Von Neumann-Morgenstern (VNM) theorem for a multi agent system with a single goal? Such a theorem should include communications between the agents, and constraints for the objective functions of single agents. The human stakeholder could be included as one of the agents, rather than just considering the AI systems.
Similarly to how we want to understand the inner workings of single AI systems, it would also be useful to understand the communications between separate AI systems.
For multi-agent systems there can often emerge different relationships such as 'cooperative' or 'competitive'. It would be good to classify these and other types of relationships, and have methods for detecting them.
For large problems, AI systems may be arranged into hierarchies, and the problems may be split into smaller parts. AI systems in the middle of the hierarchy could provide human readable reports, and ask questions to humans and other AI systems in the hierarchy.
In a system with multiple AI subsystems it may be useful for subsystems to query humans or other subsystems when they don't have enough information. HHL is about setting up a hierarchy of AI systems which can do this, and is compatible with human oversight at each level.
An AI system has a heritable purpose if it creates descendants who's actions also benefit the original system's purpose. Descendants may be created in various ways; the AI may have an internal planning process which selects actions to create a descendant; the AI may be ordered (by a human or AI) to create descendants; or a descendant may arise accidentally due to other reasons. Even if a system is not misaligned, it is also important to ensure that its descendants are also not misaligned.
Human-favorable norms may be able to be derived from more fundamental ethical abstractions such as loyalty or reciprocity. This may be used to apply constraints to AI systems, even if these constraints violate VNM rationality.
It will be possible to copy and reimplement AI systems in different environments. It is possible that powerful AI should have some awareness of this, and for example know that it may be being used by hackers in a deliberately corrupted environment. Additionally, if the system thinks there is some chance it is in the test environment it may be more likely to comply with shut down commands.
Here multiple research directions from Single/Single Control can be generalized to the case with multiple AI systems: Generalizable shutdown and handoff methods, Corrigibility, Deference to humans. However, there will likely be novel failure modes which arise from having multiple AI systems, even if the single/single case is safe.
Even if the shutdown protocol is safe for a single agent, it may not be safe when multiple agents are shut off. As an example, a single doctor can take the day off, but not all the doctors at once.
If some/most of the agents in a system successfully shut down, any remaining agents may be unsafe outside of the context of the original system. This seems especially true when the agents which shut down were responsible for safety or stability of the system.
Before we move onto research directions concerned with multiple stakeholders this section of the report discusses what it means for an AI system to be aligned or misaligned when there are multiple stakeholders. When there are multiple stakeholders with different views, it seems hard to define what it would mean for a system to be 'aligned'. However, from the perspective of humanity, it seems like human extinction would be a bad outcome; therefore from hear onward it is easier to define misalignment rather than alignment for the multi stakeholder scenario.
The report lists 4 multi-stakeholder objectives which serve as high level guidelines for the multi/single and multi/multi delegation research.
This objective is to make it easier for diverse stakeholders to collaborate in the oversight of AI technologies, by developing AI technology alongside governance techniques which capture the benefits of collaboration while avoiding forms of collaboration which would be unsafe or unnecessarily costly compared to independent governance. Institutions may be motivated to collaborate to ensure their views are represented, to pool knowledge and resources, to reduce the chances of existential risk.
This objective is to make collaborative oversight of AI systems sufficiently easy and appealing as to reduce first mover advantages which could lead to unsafe development races.
This objective aims to co-develop AI and governance techniques which enable governing stakeholders to collaborate on mitigating the idiosyncratic tendencies of each stakeholder, while still enabling the group to take broadly agreeable risks.
This objective is to develop existential safety systems, which are any somewhat autonomous organizational unit (with humans and/or automated tools) which has the purpose protecting global public safety at the scale of existential risk.
Multiple human stakeholders (e.g. different companies or members of a committee) delegating to single AI system.
Many of the single/single comprehension strategies will be useful for the multi/single case, additionally there is at least one specific research direction for dealing with multiple stakeholders.
A committee charged with operating/overseeing a powerful AI system should be able to discuss the system without being observed by the system. For example, if members of the committee are worried an AI system is misaligned, they may be worried about discussing it because the AI system may find out and take actions to avoid being shut down.
A system whose purpose it to serve a committee may need to model not only the preferences of the individual committee members, but also the dynamics of the deliberation process. In some sense this is diametrically opposed to research direction 21 (privacy for operating committees), which emphasizes how difficult it can be to ensure research is beneficial and hence how careful we must be.
Developing a service for assisting in the resolution of belief disagreements between conflicting stakeholders. If stakeholders are more able to reach agreement about their beliefs, then this seems beneficial for all of the multi-stakeholder objectives.
There may be cases in which belief or value differences between stakeholders can't be reconciled, in which case it would be useful to be able to formulate plans which the multiple stakeholders can still agree on. This again seems beneficial for all of the multi-stakeholder objectives.
Like in the single/single control case, the overseers of an AI system should retain the ability to shut down or override the system. For this different committee structures could be used, for example; only shut down if there is a consensus, or allow all members the unilateral authority to shut down.
This research direction explores ways in which multiple stakeholders each have some control over an AI system's execution. A basic example of this would be a system where all committee members need to enter their passwords before execution.
Multiple human stakeholders delegating to multiple different AI systems.
This research direction is concerned with the identification of criteria describing what capacities might be needed for any given research group to develop prepotent AI. Such criteria could be used to define auditing requirements which governments or regulators could use to ensure development teams were acting safely and ethically.
This research direction is focused on enabling AI systems to respect a 'social contact' with multiple stakeholders, including existing institutions such as states, businesses or even all of human civilization. Even if a system is technically acting within the law, there may be unethical actions which a social contract could prevent.
This direction is concerned with individuals modifying or reimplementing AI systems in a multi/multi scenario, in cases were these changes could break the safety of the overall interacting system. Without proper security, unsafe reimplementation by careless or malicious actors could cause all Tier 1 risks.
This final research direction is concerned with developing a more realistic understanding of the game-theoretic equilibria where some agents are humans. We want to guarantee that the human agents will not be destroyed or replaced.
Most, if not all, of the research directions have a section on side effects which could result from the research. These are often very specific to the research, but there are also some common themes.
The most common potential side effect is the research resulting in a false sense of security. Even if a problem is solved for a weak system or a single system, the problem may not be solved if the system becomes prepotent or in a scenario where there are multiple AI systems. For directions concerned with proofs and formal verification, a result may hold for strict assumptions but not work when more realistic assumptions are used.
A few of the research directions are concerned with allowing the AI to accurately model human decision making and cognition. These directions all have the effect that they may make it easier for an AI system to manipulate humans. Additionally, if an AI can gain a lot of information just from observing a human's actions, this may allow it to rapidly learn a lot of information which may lead to prepotence.
The comprehension research directions often involved the AI system(s) producing a human readable about the internal workings. For these directions to be useful, the reports must be an accurate reflection of the system's decision-making process, rather than just a post-hoc rationalization of its actions. Also, for these techniques to be useful the humans reading the reports have to believe them, and must be comfortable/able to talk about the risks from AI systems. Therefore if discourse around AI existential safety is impaired, these comprehension techniques become less useful.
In this report there seem to be two main types of research directions about developing AI; directions which are used for directly building safe AI, and directions for building AI tools which will help us safely develop and use powerful AI. For example, most of the single/single research directions (e.g. transparency and explainability, calibrated confidence reports, preference learning, etc) seem applicable for building safe AI. While towards the end of the report (especially in the multiple stakeholder scenarios) there are more directions related to building AI systems which will hopefully lead to humans making good decisions (e.g. modeling human committee deliberation, moderating human belief disagreements). Both these directions are about developing AI, but it seems useful to consider whether the aim of a research direction is to directly make a safe AI, or to make sure human stakeholders make good decisions around the use of AI. There are also research directions which fit partially into both of these categories or don't fit into either.
In a similar way, as the report progresses from the single/single research directions to multi/multi, the directions move from more concrete ideas related to computer science, and towards broader questions. These broader questions are more general, and cover areas related to decision making and group dynamics which are also relevant to fields outside of AI safety. This is probably to be expected, because almost all the AI safety work has so far been done on single/single delegation scenarios so the current research will dig deeper into specific questions. The more complicated multi agent scenarios have so far had less attention paid to them, and so the research is at an earlier stage and there are fewer specific questions to focus on.