AI alignment landscape


Ω 18

AI Risk
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Here (link) is a talk I gave at EA Global 2019, where I describe how intent alignment fits into the broader landscape of “making AI go well,” and how my work fits into intent alignment. This is particularly helpful if you want to understand what I’m doing, but may also be useful more broadly. I often find myself wishing people were clearer about some of these distinctions.

Here is the main overview slide from the talk:

The highlighted boxes are where I spend most of my time.

Here are the full slides from the talk.

3 comments, sorted by Highlighting new comments since Today at 11:49 AM
New Comment

Summary for the Alignment Newsletter: Basically just pasting in the image.


Here are a few points about this decomposition that were particularly salient or interesting to me.
First, at the top level, the problem is decomposed into alignment, competence, and coping with the impacts of AI. The "alignment tax" (extra technical cost for safety) is only applied to alignment, and not competence. While there isn't a tax in the "coping" section, I expect that is simply due to a lack of space; I expect that extra work will be needed for this, though it may not be technical. I broadly agree with this perspective: to me, it seems like the major technical problem which differentially increases long-term safety is to figure out how to get powerful AI systems that are trying to do what we want, i.e. they have the right motivation. Such AI systems will hopefully make sure to check with us before taking unusual irreversible actions, making e.g. robustness and reliability less important. Note that techniques like verification, transparency, and adversarial training may still be needed to ensure that the alignment itself is robust and reliable (see the inner alignment box); the claim is just that robustness and reliability of the AI's capabilities is less important.
Second, strategy and policy work here is divided into two categories: improving our ability to pay technical taxes (extra work that needs to be done to make AI systems better), and improving our ability to handle impacts of AI. Often, generically improving coordination can help with both categories: for example, the publishing concerns around GPT-2 have allowed researchers to develop synthetic text detection (the first category) as well as to coordinate on when not to release models (the second category).
Third, the categorization is relatively agnostic to the details of the AI systems we develop -- these only show up in level 4, where Paul specifies that he is mostly thinking about aligning learning, and not planning and deduction. It's not clear to me to what extent the upper levels of the decomposition make as much sense if considering other types of AI systems: I wouldn't be surprised if I thought the decomposition was not as good for risks from e.g. powerful deductive algorithms, but it would depend on the details of how deductive algorithms become so powerful. I'd be particularly excited to see more work presenting more concrete models of powerful AGI systems, and reasoning about risks in those models, as was done in Risks from Learned Optimization.

Link to a video of the talk.

(edited post to include)