Allan Dafoe, head of the AI Governance team at FHI has written up a research agenda. It's a very large document, and I haven't gotten around to reading it all, but one of the things I would be most excited about is people in the comments quoting the most important snippets that give a big picture overview over the content of the agenda.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 2:38 AM

I thought this was a quite good operationalisation of how hard aligning advanced AI systems might be, which I've taken from the conclusion of the overview of the technical landscape. (All of what follows is a direct quote, but it's not in quotes because the editor can't do that at the same time as bullet points.)

---

There are a broad range of implicit views about how technically hard it will be to make safe advanced AI systems. They differ on the technical difficulty of safe advanced AI systems, as well as risks of catastrophe, and rationality of regulatory systems. We might characterize them as follows:

  • Easy​: We can, with high reliability, prevent catastrophic risks with modest effort, say 1-10% of the costs of developing the system.
  • Medium: ​Reliably building safe powerful systems, whether it be nuclear power plants or advanced AI systems, is challenging. Doing so costs perhaps 10% to 100% the cost of the system (measured in the most appropriate metric, such as money, time, etc.).
    • But incentives are aligned​. Economic incentives are aligned so that companies or organizations will have correct incentives to build sufficiently safe systems. Companies don’t want to build bridges that fall down, or nuclear power plants that experience a meltdown.
    • But incentives will be aligned​. Economic incentives are not perfectly aligned today, as we have seen with various scandals (oil spills, emissions fraud, financial fraud), but they will be after a few accidents lead to consumer pressure, litigation, or regulatory or other responses.[85]
    • But we will muddle through​. Incentives are not aligned, and will never be fully. However, we will probably muddle through (get the risks small enough), as humanity has done with nuclear weapons and nuclear energy.
    • And other factors will strongly work against safety​. ​Strong profit and power incentives, misperception, heterogenous theories of safety, overconfidence and rationalization, and other pathologies conspire to deprive us of the necessary patience and humility to get it right. This view is most likely if there will not be evidence (such as recoverable accidents) from reckless development, and if the safety function is steep over medium level of inputs (“This would not be a hard problem if we had two years to work on it, once we have the system. It will be almost impossible if we don’t.”).
  • Hard or Near Impossible​: Building a safe superintelligence is like building a rocket and spacecraft for a moon-landing, without ever having done a test launch. It costs [86] greater than, or much greater than, 100% of development costs.
  • We don’t​ know​.

[85] This assumes that recoverable accidents occur with sufficient probability before non-recoverable accidents.

[86] Yudkowsky, Eliezer. “So Far: Unfriendly AI Edition.” EconLog | Library of Economics and Liberty, 2016. http://econlog.econlib.org/archives/2016/03/so_far_unfriend.html.

There's an interesting appendix listing desiderata for good AI forecasting, that includes the catchy phrase "epistemically temporally fractal​" (which I feel compelled to find a place to use in my life). This first three points are reminiscent of Zvi's recent post.

Appendix A: Forecasting Desiderata
1. We want our forecasting targets to be indicators for relevant ​achievements. This includes targets that serve as (leading) indicators for important economic capabilities, such as a capability that would pose a substantial employment threat to a large group of people. It includes indicators for important security capabilities, such as in potent AI cyber-offense, surveillance and imagery intelligence, or lie detection. It includes indicators for important technical achievements, such as those that are thought to be crucial steps on the path to more transformative capabilities (e.g. AGI), those that are central to many problem areas, or those that would otherwise substantially accelerate AI progress. [115]
2. We want them to be accurate​ indicators, as opposed to noisy indicators that are not highly correlated with the important events. Specifically, where E is the occurrence or near occurrence of some important event, and Y is whether the target has been reached, we want P(not Y|not E)~1, and P(Y | E) ~1. An indicator may fail to be informative if it can be “gamed” in that there are ways of achieving the indicator without the important event being near. It may be a noisy indicator if it depends on otherwise irrelevant factors, such as whether a target happens to take on symbolic importance as the focus of research.
3. We want them to be well specified​: they are unambiguous and publicly observable, so that it will not be controversial to evaluate whether E has taken place. These could be either targets based on a commonly agreed objective metric such as an authoritative measure of performance, or a subjective target likely to involve agreement across judges. Judges will not ask later: “what did you mean”? [116]
4. We want them to be somewhat near-term probable​: we should not be too confident in the near-term about whether they will occur. If they all have tiny probabilities (<1%) then we will not learn much after not seeing any of them resolve. The closer the [117 probability of a forecasting event and a set of predictions is to 50%, over a given time frame, the more we will learn about forecasting ability, and the world, over that time frame.
5. We ideally want them to be epistemically temporally fractal​: we want them to be such that good forecasting performance on near-term forecasts is informative of good forecasting performance on long-term predictions. Near-term forecasting targets are more likely to have this property as they depend on causal processes that are likely to continue to be relevant over the long-term.
6. We want them to be jointly maximally informative​. This means that we ideally want a set of targets that score well on the above criteria. A way in which this could not be so is if some targets are highly statistically dependent on others, such as if some are logically entailed by others. Another heuristic here is to aim for forecasting targets that exhaustively cover the different causal pathways to relevant achievements.
---
[115] The Good Judgment Project sometimes refers to an indicator that is relevant as one that is diagnostic of a bigger issue that we care about.
[116] Tetlock has called this the “Clairvoyance Test”: if you asked a clairvoyant about your forecasting question, would they be able to answer you or would they require clarification on what you meant. See Tetlock, Philip E., and Dan Gardner. Superforecasting: The art and science of prediction. Random House, 2016, and https://www.edge.org/conversation/philip_tetlock-how-to-win-at-forecasting
[117] Though we should learn a lot from seeing one such unexpected event occur. Thus, such a “long-shot” target would be a worthwhile forecasting target to a person who assigns intermediate subjective probability of it occurring, even if everyone else in the community is confident it will (not) occur.

Most of the sections are just lists of interesting questions for further research; the lists of questions seem fairly comprehensive. The section on avoiding arms races has something more in the way of conceptually breaking up the space - in particular, the third and fourth paragraphs distill the basic models around these topics in a way I found useful. My guess is that this section is most representative of the future work of Allan Dafoe.

6.4 Avoiding or Ending the Race
Given the likely large risks from an AI race, it is imperative to examine​ possible routes for avoiding races or ending one underway​. The political solutions to global public bads are, in increasing explicitness and institutionalization: norms, agreements (“soft law”), treaties, or institutions. These can be bilateral, multilateral, or global. Norms involve a rough mutual understanding about what (observable) actions are unacceptable and what sanctions will be imposed in response. Implicit norms have the advantage that they can arise without explicit consent, but the disadvantage that they tend to be crude, and are thus often inadequate and may even be misdirected. A hardened form of international norms is customary law, though absent [109] a recognized international judiciary this is not likely relevant for great-power cooperation.[110]
Diplomatic agreements and treaties involve greater specification of the details of compliance and enforcement; when well specified these can be more effective, but require greater levels of cooperation to achieve. Institutions, such as the WTO, involve establishing a bureaucracy with the ability to clarify ambiguous cases, verify compliance, facilitate future negotiations, and sometimes the ability to enforce compliance. International cooperation often begins with norms, proceeds to (weak) bilateral or regional treaties, and consolidates with institutions.
Some conjectures about when international cooperation in transformative AI will be more likely are when: (1) the parties mutually perceive a strong interest in reaching a successful agreement (great risks from non-cooperation or gains from cooperation, low returns on unilateral steps); (2) when the parties otherwise have a trusting relationship; (3) when there is sufficient consensus about what an agreement should look like (what compliance consists of), which is more likely if the agreement is simple, appealing, and stable; (4) when compliance is easily, publicly, and rapidly verifiable; (5) when the risks from being defected on are low, such as if there is a long “breakout time”, a low probability of a power transition because technology is defense dominant, and near-term future capabilities are predictably non-transformative; (6) the incentives to defect are otherwise low. Compared to other domains, AI appears in some ways less amenable to international cooperation--conditions (3), (4), (5), (6)--but in other ways could be more amenable, namely (1) if the parties come to perceive existential risks from unrestricted racing and tremendous benefits from cooperating, (2) because China and the West currently have a relatively cooperative relationship compared to other international arms races, and there may be creative technical possibilities for enhancing (4) and (5). We should actively pursue technical and governance research today to identify and craft potential agreements.
Third-Party Standards, Verification, Enforcement, and Control
One set of possibilities for avoiding an AI arms race is the use of third party standards, verification, enforcement, and control. What are the prospects for cooperation through third party institutions? The first model, almost certainly worth pursuing and feasible, is an international “safety” agency responsible for “establishing and administering safety standards.” This is crucial to achieve common knowledge about what counts as compliance. The second [111] “WTO” or “IAEA” model builds on the first by also verifying and ruling on non-compliance, after which it authorizes states to impose sanctions for noncompliance. The third model is stronger still, endowing the institution with sufficient capabilities to enforce cooperation itself. The fourth, “Atomic Development Authority” model, involves the agency itself controlling the dangerous materials; this would involve building a global AI development regime sufficiently outside the control of the great powers, with a monopoly on this (militarily) strategic technology. Especially in the fourth case, but also for the weaker models, great care will need to go into their institutional design to assure powerful actors, and ensure competence and good motivation.
Such third party models entail a series of questions about how such institutions could be implemented. What are the prospects that great powers would give up sufficient power to a global inspection agency or governing body? What possible scenarios, agreements, tools, or actions could make that more plausible? What do we know about how to build government that is robust against sliding into totalitarianism and other malignant forms (see section 4.1)? What can we learn from similar historical episodes, such as the failure of the Acheson-Lilienthal Report and Baruch Plan, the success of arms control efforts that led towards the 1972 Anti-Ballistic Missile (ABM) Treaty, and episodes of attempted state formation? 112
There may also be other ways to escape the race. Could one side form a winning or encompassing coalition? Could one or several racers engage in unilateral “stabilization” of the world, without risking catastrophe? The section AI Ideal Governance discusses the desirable properties of a candidate world hegemon.
---
[109] Joseph Nye advocates for cyber-norms. Nye, Joseph S. “A Normative Approach to Preventing Cyberwarfare.” Project Syndicate, March 13, 2017. https://www.project-syndicate.org/commentary/global-norms-to-prevent-cyberwarfare-by-joseph-s--nye-2017-03. However, the case of cyber weapons may also point to some technologies that are much more limited in their potential to be controlled through arms control agreements. For an introduction, cf. Borghard, Erica D., and Shawn W. Lonergan. “Why Are There No Cyber Arms Control Agreements?” Council on Foreign Relations, January 16, 2018. https://www.cfr.org/blog/why-are-there-no-cyber-arms-control-agreements.
[110] Cf. Williamson, Richard. “Hard Law, Soft Law, and Non-Law in Multilateral Arms Control: Some Compliance Hypotheses.” Chicago Journal of International Law 4, no. 1 (April 1, 2003). https://chicagounbound.uchicago.edu/cjil/vol4/iss1/7. 38
[111] As per Article II of the IAEA Statute.