This represents thinking about AGI safety done under mentorship by Evan Hubinger. Thanks also to Buck Shlegeris, Noa Nabeshima, Thomas Kwa, Sydney Von Arx and Jack Ryan for helpful discussion and comments.
A common perspective to take with respect to AI safety to try and construct a story in which artificial general intelligence (AGI) goes right. I have recently acquired another perspective that I find helpful, which is AGI safety as disrupting stories for why AGI will go poorly. I will start by presenting an analogy to convey intuition, then apply this perspective to concrete examples of current alignment strategies.
Suppose you have a bomb that is about to explode. Your task is to make sure no one gets hurt. How might you reason about how to do this?
One way to do this is backward chaining, or backchaining, which involves identifying a sub-goal that, if achieved, will accomplish your main goal. This process is then recursively applied to generate a concrete next action.
Applied to the situation at hand, I might identify that defusing the bomb would prevent everyone from getting hurt. However, I do not know how to defuse a bomb. A sub-goal that would help me defuse a bomb is learning about bombs, which might suggest that I take a class on bomb-defusal. This is not helpful because I do not have time to take such a class. In general, backchaining might fail when there are many possible ways to accomplish a goal because it might cause you to fixate on one possible path.
Suppose that instead I ask “why do I think the bomb explosion will go poorly?”. The answer I get seems much more useful: the bomb will throw lots of little bits everywhere and those bits will hit people. This gives me two possible subquestions to ask: “why will the bomb throw bits everywhere?” and “why will those bits hit people?”. Answering the second question: the people are close by and there is nothing stopping the bits. Again, this yields two possible recursive subquestions. Notice how this framing forces me to consider many possible paths towards solving my goal simultaneously.
The point of the second perspective is to construct a specific story for why the bomb explosion will hurt people. The goal is to eventually arrive at something that looks like “the timer will hit zero, which will activate this diode, allowing electricity to flow from the battery into the explosive. This will cause the bomb to explode, flinging bits of metal at high velocities. Since there are people close to the bomb and there is nothing in the way, these bits of metal will hit the people, hurting them”. If the story is complete, if it includes the full reason why the current situation might result in people getting hurt, then violating any step of the story should make the current situation safe.
In this case, one is left with a number of possible solutions:
- Make the timer not hit zero
- Make the diode never activate
- Break the circuit, now allowing electricity to flow into the explosive
- Move the people farther away
- Put something in-between the bomb and the people
- Make the people bits-of-metal-proof
Asking for a story of danger is a mental move that encourages specificity instead of abstraction. It is easier to be specific about something you expect to happen as opposed to conditioning on things that might happen and searching for explanations. Once one has generated a specific story of danger, the only remaining task is to violate one of its assumptions.
Applied to AGI Safety
You think that there is something called “AGI” that is going to cause bad things to happen. Your task is to make bad things not happen. How might you start reasoning about this?
Instead of backchaining, we can try to disrupt the current forward chain that leads to a bad outcome. Instead of asking yourself “how do I make AGI go well?”, it might be more useful to ask yourself “why do I think AGI will go poorly?” 
Note that what follows is one particular way to break down one particular argument. I encourage people to consider other arguments for danger and other possible breakdowns.
We will consider, as example, the argument laid out by Richard Ngo:
- We’ll build AIs which are much more intelligent than humans; that is, much better than humans at using generalisable cognitive skills to understand the world.
- Those AGIs will be autonomous agents which pursue long-term, large-scale goals, because goal-directedness is reinforced in many training environments, and because those goals will sometimes generalise to be larger in scope.
- Those goals will by default be misaligned with what we want, because our desires are complex and nuanced, and our existing tools for shaping the goals of AIs are inadequate.
- The development of autonomous misaligned AGIs would lead to them gaining control of humanity’s future, via their superhuman intelligence, technology and coordination - depending on the speed of AI development, the transparency of AI systems, how constrained they are during deployment, and how well humans can cooperate politically and economically.
Despite the relatively high level of specificity compared to previous arguments, this argument is still underspecified in a number of places (what’s goal-directedness? What’s a cognitive skill? Why are our existing tools inadequate? What is generalization?). This is not a slight against Richard; most of these questions do not have answers (yet).
While we could recursively expand all parts of this argument, there are simplifications we can make. The goal of constructing a specific story for danger is to identify the easiest parts to violate. Therefore, we can save effort by only recursively expanding the parts of the argument that are most likely to be false.
Argument (1) seems likely to be true because of the strong economic incentives at play. Strategies that are oriented around preventing the creation of AGI seem incredibly unlikely to succeed. I think the only plausible scenario in which AGI is not developed is because humanity has suffered a large setback (up to and including extinction) for a different reason.
Argument (2) seems relatively uncertain. We currently have a poor understanding of agency and goal-directed behavior. It seems plausible that goal-directedness tends to be reinforced during training, but far from completely determined. It is also not obvious that those goals will generalize to be larger in scope; in particular, there is large uncertainty over the difficulty of preventing this generalization.
Argument (3) seems relatively likely, although still far from determined. I am confident that human values are complicated. I am uncertain whether or not developing tools that enable better goal shaping is tractable, although I am not very optimistic.
Argument (4) seems relatively likely to be true because of the economic incentives at play. In particular, there seems to be a strong incentive to deploy opaque AI systems quickly and broadly. However, in cases where achieving domain generality for AI systems is difficult, deployment might remain relatively narrow. If opaque AI systems tend to behave strangely when deployed, there might be an incentive to increase transparency.
This suggests that further specifying arguments (2) and (3) is likely to be most fruitful.
(2) AGIs will be autonomous agents...
Those AGIs will be autonomous agents which pursue long-term, large-scale goals, because goal-directedness is reinforced in many training environments, and because those goals will sometimes generalise to be larger in scope.
We can expand this argument into:
- 2a. Goal-directedness is reinforced in many training environments.
- 2b. Goals will generalize to be large in scope.
- 2c. Therefore, AGIs will have goals with large scope.
Recursing on (2a): Why is goal-directedness reinforced in many training environments?
- 2a1. Training environments tend to reinforce behaviors that accomplish a given task well.
- 2a2. Goal-directedness is useful for many tasks.
We now have a pair of research questions:
- 2a1*. How can we construct training environments with incentives that result in not-goal-directed agents?
- 2a2*. What are tasks for which goal-directedness is not useful?
Notice that this process of breaking down questions has slightly missed the forest for the trees. In particular, “construct a combination of task and training environment that produces not-goal-directed agents” sufficiently defuses argument (2a) but doesn’t directly answer (2a1*) or (2a2*). In practice, we might expect some research agendas to partially defuse the story for AGI danger in multiple places, such that the entire story no longer holds. However, it still might be useful to analyze the degree of defusal various pieces undergo.
Trying to recurse on (2b), I don’t have a very good answer to “why will goals generalize to be large in scope?”, so let’s call research question 2b* “how can we prevent goals from generalizing to be large in scope?”
(3) AGI goals will be misaligned with what we want...
Those goals will by default be misaligned with what we want, because our desires are complex and nuanced, and our existing tools for shaping the goals of AIs are inadequate.
We can expand this argument into:
- 3a. Human values are complex.
- 3b. It is extremely difficult to instill complex goals in AIs.
- 3c. Therefore, AGI goals will be misaligned with human values.
Interrogating why human values are complex is not likely to yield actionable interventions, so I will not do it.
Recursing on (3b): Why is it difficult to instill complex goals in AIs?
- 3b1. Complex goals are hard to instill without small imperfections.
- 3b2. Any imperfection will result in misalignment.
This leaves us with a pair of research questions:
- 3b1*. How can we make it easier to instill complex goals in AIs?
- 3b2*. How can we make alignment robust to goal specification imperfection?
Further, note that (3a) and (3b) only imply that our AIs will not have human goals. This potentially leaves us with the additional avenue of 3c*: Are there simple goals that produce aligned agents?
Applied to Agendas
Specifying a story for AGI danger has allowed us to identify six research questions, answers to which would potentially weaken the story for danger, which should be equivalent to increasing the chance of safety. We can roughly attempt to pair each of these research questions with current AI Safety research that attempts to address it.
- 2a1*. How can we construct training environments with incentives that result in not-goal-directed agents? Relaxed adversarial training for myopia.
- 2a2*. What are tasks for which goal-directedness is not useful? Predicting data is possibly such a task, which gives you Microscope AI. Predicting science in particular gives you STEM AI.
- 2b*. How can we prevent goals from generalizing to be large in scope? Impact measures or myopia.
- 3b1*. How can we make it easier to instill complex goals in AIs? Value Learning.
- 3b2*. How can we make alignment robust to goal specification imperfection? Corrigibility.
- 3c*. Are there simple goals that are aligned with human values? Imitating humans.
In addition to thinking of part of AI Safety research as trying to address specific questions, we can consider how alignment proposals interact with the set of questions as a whole.
For example, AI Safety via debate might be safe because “win a debate” might be aligned with human values (3c*) and not generalize to be large in scope (2b*).
There is also work that doesn’t fit within this picture. Much of the work done on the Agent Foundations agenda, for instance, seeks to better understand assumptions about agency made by traditional stories for danger. One could think of this work as trying to create the pieces that will enable strong arguments for safety instead of trying to defuse arguments for danger.
Vague danger scenarios
Many disagreements about the probability of existential risk due to AGI involve different intuitions about what the default scenario is going to be. Some people suspect that if we don’t have an ironclad reason to suspect AGI will go well, it will almost certainly go poorly. Other people think that the first thing we try has a reasonable chance of going fairly well. One can imagine a spectrum with “disaster by default” on one side and “alignment by default” on the other. To the extent that one is closer to “disaster by default”, trying to defuse specific arguments for AGI danger seems like it's missing the forest for the trees, analogous to trying to improve computer security by not allowing users to use “password” as their password. To the extent that one is closer to “alignment by default”, trying to defuse specific arguments seems quite useful, closer to conducting a fault analysis on a hypothetical airplane crash.
If one believes that AGI will be misaligned by default, there is no particular reason why defusing specific arguments for AGI danger should make you more confident that AGI will be safe. In theory, every argument that gets defused should make you marginally more confident, but this update can be very small. Imagine someone presenting you with a 1000 line computer program. You tell them there’s a bug in their code, and they report back to you that they checked lines 1-10 and there was no bug. Are you more confident there isn’t a bug? Yes. Are you confident that there isn’t a bug? No. In these situations, backchaining to the desired outcome is more useful than breaking chains that lead to undesirable outcomes.
One should believe defusing arguments for danger is useful only insofar as those arguments are cruxes for their belief that AGI is an existential risk. However, believing an argument is a crux does not guarantee that the argument has captured the true causal reason for why AGI is an existential risk. One can imagine being persuaded by a flawed argument, then working to defuse that argument in ways that are useless given the flaw.
For example, an argument for AGI danger might involve a ghost in the machine that modifies its own code to remove constraints. If that argument is a crux for someone, they might devote effort into designing safeguards against AGIs modifying their own source code. This seems unlikely to reduce the chances of AGI induced existential risk, but it does diffuse that particular argument for AGI danger.
In general, the confidence one should draw from research that aims at defusing arguments for danger should be proportional to the confidence that that particular argument captures the true reasons for danger. In practice, I suspect that most research is supported by multiple perspectives, including backchaining and argument diffusal.
One way to think about AI Safety is as answering the question “how can I make AGI go well?” Considering this perspective carefully might allow you to backchain to a promising research agenda.
Another way of thinking about AI Safety is as answering the question “how can I prevent AGI from going poorly?” Since it is easier to be specific about AGI going poorly compared to AGI going well, this might be a useful perspective to consider and generate research directions.
Relying purely on this perspective is dangerous. Defusing arguments for danger is more convincing the more one anticipates alignment by default and correspondingly less convincing if one anticipates disaster by default. Additionally, one should be careful to weigh by the confidence they have in the argument.
Overall, I think that defusing arguments for AGI danger is a useful perspective to take on AGI safety research. Much existing research does not fit cleaning into this perspective. Nevertheless, considering research from this angle has given me a clearer picture of how the entire space fits together.
For the interested, this is a good example of backchaining applied to AI safety. ↩︎
Technically, we want to expand the parts of the argument such that we think additional labor can most shift if from being “true” to “false”. Just expanding things that might be false seems like a good proxy. ↩︎
See The Rocket Alignment Problem for an example of such an argument. ↩︎
Rohin Shah puts about 30% on “the first thing we try just works and we don’t even need to solve any sort of alignment problem” in AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah. ↩︎
I think your conclusion section is really important, because it prevents a possible misinterpretation of your post.
Since I'm much closer to the "disaster by default" end of the spectrum, I think most of our effort should focus on the safety stories approach rather than the defusing dangers approach. And I think you haven't presented any arguments for safety by default; you've just explained what we should do if we believe in safety by default. So it would be a misinterpretation of your post to think that it argues for the defusing disaster strategy to take priority over the safety stories strategy.
Instead (and this is how I interpret your post) both strategies should be pursued no matter where on the spectrum you are, but to different extents. E.g. if you are in the middle, you split effort 50-50 between strategies, and if you are towards the alignment by default edge, you split effort 80-20, etc. This seems quite plausible to me.
I absolutely agree that I'm not arguing for "safety by default".
I don't quite agree that you should split effort between strategies, i.e. it seems likely that if you think 80% disaster by default, you should dedicate 100% of your efforts to that world.
OK, interesting. Well, here's my argument for effort-splitting then: There are probably diminishing returns to pursuing each strategy. In research in general, ideas and questions tend to cross-pollinate, etc. And if you are 20% confident that research project X is the most important, and 80% that research project Y is most important, and they are both on a similar topic, this seems like a classic case where you should do both (but with more effort towards Y).
This is more of an intuition than an argument, I guess. But what do you think?
My opposite intuition is suggested by the fact that if you're trying to guess correctly a series of random digits with 80% "1" and 20% "0", then you should always guess "1".
I don't quite know how to model cross-pollination and diminishing sort of returns. I think working on both for the information value is likely going to be very good. It seems hard to imagine a scenario where you're robustly confident that one project is 80% better taking diminishing returns into account without being able to create a 3rd project with the best features of both, but if you're in that scenario I think just spending all your efforts on the 80% project seems correct.
One example is deciding between 2 fundamentally different products your startup could be making. We also supposed that creating an MVP of either product that would provide information would take a really long time. In this situation, if you suspect one of them is 60% likely to be better than the other it would be less useful to spend your time in a 60/40 split rather than building the MVP of the one likely to be better and reevaluating after getting more information.
The version of your claim that I agree with is "In your current epistemic state, you should spend all your time pursuing the 80% project, but the 80% probably isn't that robust, working on a project has diminishing returns, and other projects will give more information value, globally the amount of time you expect to spend on the 80% project is about 80%."
Here's a way to model diminishing returns: The first hour of research on strategy X produces as much value as the next two hours, which produces as much value as the next four hours, etc. Value = log_2(hours). If this is true, then you should split your hours such that log_2(hourstowards80project)*0.8 + log_2(hourstoward20project)*0.2 is maximized, which I think means that you should distribute your hours across projects proportional to their probability... https://www.wolframalpha.com/input/?i=argmax%28log_2%28X%29*0.8+%2B+log_2%281-X%29*0.2%29 (I don't know much math so I'm not confident I'm doing this right)
Value of information I hadn't even considered, but maybe we can bundle it up with diminishing returns and say it's part of the reason returns diminish.
Huh. It seems like there is some general theorem here that might be worth writing up. If we combine the heavy-tailed hypothesis with this theorem, maybe we get some sort of nontrivial and useful general heuristic: The optimal allocation of time/money/etc. is proportional to the probability that a project is the most valuable thing you can be doing. That is, take the options you are considering, and evaluate the probability that each option is the best of the bunch. Then, distribute your resources according to that probability. This will be optimal or approximately optimal so long as (1) returns to resources diminish logarithmically for each project at about the same rate, and (2) the best project is likely to be several times better than the next-best and so on (heavy-tailed distribution of project goodness). I think 2 is usually true for altrustic projects, and insofar as 1 is false, maybe it doesn't matter because we are ignorant of which project diminishes faster, or maybe we do know which project diminishes faster and we can adjust accordingly (it should just be another multiplier to the ratio when dividing up resources, I think). I expect someone has said all this before somewhere...
That's fair, I should properly write out the brain-like AGI danger scenario(s) that have been in my head, one of these days. :-)
I like this strategy a lot.
Also, there's a lonely sentence missing a completion:
Thanks! Also, oops - fixed.