The framework

Here, I will briefly introduce what I hope is a fundamental and potentially comprehensive set of questions that an AGI safety research agenda would need to answer correctly in order to be successful. In other words, I am claiming that a research agenda that neglects these questions would probably not actually be viable for the goal of AGI safety work arrived at in the previous post: to minimize the risk of AGI-induced existential threat.

I have tried to make this set of questions hierarchical, by which I simply mean that particular questions make sense to ask—and attempt to answer—before other questions; that there is something like a natural progression to building an AGI safety research agenda. As such, each question in this framework basically accepts the (hypothesized) answer from the previous question as input. Here are the questions:

  1. What is the predicted architecture of the learning algorithm(s) used by AGI?
  2. What are the most likely bad outcomes of this learning architecture?
  3. What are the control proposals for minimizing these bad outcomes?
  4. What are the implementation proposals for these control proposals?
  5. What is the predicted timeline for the development of AGI?

Some immediate notes and qualifications:

  • As stated above, note that each question Q directly builds from whatever one’s hypothesized answer is to Q-1. This is why I am calling this question-framework hierarchical.
  • Question 5 is not strictly hierarchical in this sense like questions 1-4. I consider one’s hypothesized AGI development timeline to serve as an important ‘hyperparameter’ that calibrates the search strategies that researchers adopt to answer questions 1-4.
  • I do not intend to rigidly claim that it is impossible to say anything useful about bad outcomes, for example, without first knowing an AGI’s learning algorithm architecture. In fact, most of the outcomes I will actually discuss in this sequence will be architecture-independent (I discuss them for that very reason). I do claim, however, that it is probably impossible to exhaustively mitigate bad outcomes without knowing the AGI’s learning algorithm architecture. Surely, the devil will be at least partly in the details.
  • I also do not intend to claim that AGI must consist entirely of learning algorithms (as opposed to learning algorithms being just one component of AGI). Rather, I claim that what makes the AGI safety control problem hard is that the AGI will presumably build many of its own internal algorithms through whatever learning architecture is instantiated. If there are other static or ‘hardcoded’ algorithms present in the AGI, these probably will not meaningfully contribute to what makes the control problem hard (largely because we will know about them in advance).
  • If we interpret the aforementioned goal of AGI safety research (minimize existential risk) as narrowly as possible, then we should consider “bad outcomes” in question 2 to be shorthand for “any outcomes that increase the likelihood of existential risk.” However, it seems totally conceivable that some researchers might wish to expand the scope of “bad outcomes” such that existential risk avoidance is still prioritized, but clearly-suboptimal-but-non-existential risks are still worth figuring out how to avoid.
  • Control proposals ≠ implementation proposals. I will be using the former to refer to things like imitative amplification, safety via debate, etc., while I’m using the latter to refer to the distinct problem of getting the people who build AGI to actually adopt these control proposals (i.e., to implement them).

Prescriptive vs. descriptive interpretations

I have tailored the order of the proposed question progression to be both logically necessary and methodologically useful. Because of this, I think that this framework can be read in two different ways: first, with its intended purpose in mind—to sharpen how the goals of AGI safety research constrain the space of plausible research frameworks from which technical work can subsequently emerge (i.e., it can be read as prescriptive). A second way of thinking about this framework, however, is as a kind of low-resolution prediction about what the holistic progression of AGI safety research will ultimately end up looking like (i.e., it can be read as descriptive). Because each step in the question-hierarchy is logically predicated on the previous step, I believe this framework could serve as a plausible end-to-end story for how AGI safety research will move all the way from its current preparadigmatic state to achieving its goal of successfully implementing control proposals that mitigate AGI-induced existential risks. From this prediction-oriented perspective, then, these questions might also be thought of as the relevant anticipated ‘checkpoints’ for actualizing the goal of AGI safety research.

Let’s now consider each of the five questions in turn. Because they build upon themselves, it makes sense to begin with the first question and work down the list.
 

11

15 comments, sorted by Click to highlight new comments since: Today at 7:53 AM
New Comment

A comment on your list of questions after reading the whole sequence: unlike John and Tekhne elsewhere in this comment thread, I am pretty comfortable with the hierarchical list of questions you are developing here.

This is a pretty useful set of questions that could be taken as starting points for all kinds of useful paradigmatic research.

I believe that part of John's lack of comfort with the above list of questions is caused by a certain speculative assumption he makes about AGI alignment, an assumption that is also made by many in MIRI, an assumption popular on this forum. The assumption is that, in order to solve AGI alignment, we first need to have nothing less than a complete scientific and philosophical revolution, a revolution that will make all current paradigms entirely obsolete.

If you believe in that speculative assumption, then your above step of already asking specific questions about AGI would be premature. It distracts from having a scientific revolution first.

John's speculative assumption is itself of course just another paradigm in the Kunhnian sense. It corresponds to a school of thought which says that AGI safety research must be about inventing entirely new paradigms, as opposed to, say, exploring how existing paradigms taken from many existing disciplines might be applied to the problem.

Myself, I am of the school that sees more value in exploring and combining existing paradigms. I think that approach is more likely to end up with actionable solutions for managing AGI safety risks. That being said, I think all here would agree that both schools could potentially come up with something valuable.

Your core claim is that all of these five questions need to be answered to minimize AI X-risk. Not only do I disagree with this, I claim that zero of these questions need to be answered to minimize AI X-risk.

Let's go through them in order...

What is the predicted architecture of the learning algorithm(s) used by AGI?

My mainline vision for a theory of alignment and agency would be sort of analogous to thermodynamics. Thermodynamics does not care about what architecture we use for our heat engines. Rather, it establishes the universal constraints which apply to all possible heat engines. (... or at least all heat engines which work with more-than-exponentially-tiny-probability.) Likewise, I want a theory of alignment and agency which establishes the universal constraints which apply to all agents (or at least all agents which "work" with more-than-exponentially-tiny-probability).

Why would we expect to be able to find such a theory? One argument: we don't expect that the alignment problem itself is highly-architecture dependent; it's a fairly generic property of strong optimization. So, "generic strong optimization" looks like roughly the right level of generality at which to understand alignment. (This is not the only argument for our ability to find such a theory, but it's a relatively simple one which doesn't need a lot of foundations.) Trying to zoom in on something narrower than that would add a bunch of extra constraints which are effectively "noise", for purposes of understanding alignment.

On top of that, there's the obvious problem that if we try to solve alignment for a particular architecture, it's quite probable that some other architecture will come along and all our work will be obsolete. (At the current pace of ML progress, this seems to happen roughly every 5 years.)

Put all that together, and I think this question is not only unnecessary, but plausibly actively harmful as a guide for alignment research.

(I also note that you have a whole section in your post on question 2 which correctly identifies most of the points I just made; all it's missing is the step of "oh, maybe we just don't actually need to know about the details of the architecture at all".)

What are the most likely bad outcomes of this learning architecture?

What are the control proposals for minimizing these bad outcomes?

I also think these two together are potentially actively harmful. I think the best explanation of this view is currently Yudkowsky's piece on Security Mindset; "figure out the most likely bad outcomes and then propose solutions to minimize these bad outcomes" is exactly what he's arguing against. One sentence summary: it's the unknown unknowns that kill us. The move we want is not "brainstorm failure modes and then avoid the things we brainstormed", it's "figure out what we want and then come up with a strategy which systematically achieves it (automatically ruling out huge swaths of failure modes simultaneously)".

What are the implementation proposals for these control proposals?

Setting aside that I don't agree with the "control proposals" framing, this question comes the closest to being actually necessary. Certainly we'll need implementations of something at some point.

On the other hand, starting from where we are now, I expect implementation to be relatively easy once we have any clue at all what to implement. So even if it's technically necessary to answer at some point, this question might not be very useful to think about ahead of time. We could solve the problem to a point where AI risk is minimized without necessarily putting significant thought into implementation proposals, especially if the core math ends being obviously-tractable. (Though, to be clear, I don't think that's a good idea; trying to build a great edifice of theory without empirical feedback of some kind is rarely useful in practice.)

  1. What is the predicted timeline for the development of AGI?

Personally, I consider timelines approximately-irrelevant for my research plans. Whatever the probable-shortest-path is to aligned AI, that's the path to follow, regardless of how long we have.

The case for timeline-relevance is usually "well, if we don't have any hope of properly solving the problem in time, then maybe we need a hail Mary". That's a valid argument in principle, but in practice, when we multiply together probability-of-hail-Mary-actually-working vs probability-that-AI-is-coming-that-soon, I expect that number to basically-never favor the hail Mary. It would require too high a probability of the Hail Mary working, and too little uncertainty about AGI being right around the corner.

Now, I do expect other people to disagree with that argument (mainly because they have less hope about solving alignment anytime soon without a Hail Mary). But remember that the post's original claim is that timeline estimates are necessary for alignment, which seems like far too strong a claim when I'm sitting here with an at-least-internally-coherent view in which timelines are mostly irrelevant.

More Generally...

Zooming out a level, I think the methodology used to generate these questions is flawed. If you want to identify necessary subquestions, then the main way I know how to do that is to consider a wide variety of approaches, and look for subquestions which are clearly crucial to all of them. Then, try to generate further approaches which circumvent those subquestions, and that counterexample-search-process will probably make clear why the subquestions are necessary.

When I imagine what process would generate the questions in this post, I imagine starting with one single approach, looking for subquestions which are clearly crucial to that one approach, and then trying to come up with arguments that those subquestions are necessary (without really searching for necessity-counterexamples to stress-test those arguments).

If I've mischaracterized your process, then I apologize in advance, but currently this hypothesis seems pretty likely.

My recommendation is to go find some entirely different approaches, look for patterns which hold up across approaches, and consider what underlying features of the problem generate those patterns.

On The Bright Side

Complaining aside, you've clearly correctly understood that the subquestions need to be necessary subquestions in order to form a paradigm; that necessity is what allows the paradigm to generalize across the work done by many different people.

I do think that insight is the rate-limiting factor for most people explicitly trying to come up with paradigms. So well done there! I think you're already past the biggest barrier. The next few barriers will involve a lot of frustrating work, a lot of coming up with frameworks which seem good to you only to have other people shoot holes in them, but I think you are probably capable of doing it if you decide to pursue it for a while.

Thanks for taking the time to write up your thoughts! I appreciate your skepticism. Needless to say, I don't agree with most of what you've written—I'd be very curious to hear if you think I'm missing something:

[We] don't expect that the alignment problem itself is highly-architecture dependent; it's a fairly generic property of strong optimization. So, "generic strong optimization" looks like roughly the right level of generality at which to understand alignment...Trying to zoom in on something narrower than that would add a bunch of extra constraints which are effectively "noise", for purposes of understanding alignment.

Surely understanding generic strong optimization is necessary for alignment (as I also spend most of Q1 discussing). How can you be so sure, however, that zooming into something narrower would effectively only add noise? You assert this, but this doesn't seem at all obvious to me. I write in Q2: "It is also worth noting immediately that even if particular [alignment problems] are architecture-independent [your point!], it does not necessarily follow that the optimal control proposals for minimizing those risks would also be architecture-independent! For example, just because an SL-based AGI and an RL-based AGI might both hypothetically display tendencies towards instrumental convergence does not mean that the way to best prevent this outcome in the SL AGI would be the same as in the RL AGI."

By analogy, consider the more familiar 'alignment problem' of training dogs (i.e., getting the goals of dogs to align with the goals of their owners). Surely there are 'breed-independent' strategies for doing this, but it is not obvious that these strategies will be sufficient for every breed—e.g., Afghan Hounds are apparently way harder to train, than, say, Golden Retrievers. So in addition to the generic-dog-alignment-regime, Afghan hounds require some additional special training to ensure they're aligned. I don't yet understand why you are confident that different possible AGIs could not follow this same pattern.

On top of that, there's the obvious problem that if we try to solve alignment for a particular architecture, it's quite probable that some other architecture will come along and all our work will be obsolete. (At the current pace of ML progress, this seems to happen roughly every 5 years.)

I think that you think that I mean something far more specific than I actually do when I say "particular architecture," so I don't think this accurately characterizes what I believe. I describe my view in the next post

[It's] the unknown unknowns that kill us. The move we want is not "brainstorm failure modes and then avoid the things we brainstormed", it's "figure out what we want and then come up with a strategy which systematically achieves it (automatically ruling out huge swaths of failure modes simultaneously)".

I think this is a very interesting point (and I have not read Eliezer's post yet, so I am relying on your summary), but I don't see what the point of AGI safety research is if we take this seriously. If the unknown unknowns will kill us, how are we to avoid them even in theory? If we can articulate some strategy for addressing them, they are not unknown unknowns; they are "increasingly-known unknowns!" 

I spent the entire first post of this sequence devoted to "figuring out what we want" (we = AGI safety researchers). It seems like what we want is to avoid AGI-induced existential risks. (I am curious if you think this is wrong?) If so, I claim, here is a "strategy that might systematically achieve this:" we need to understand what we mean when we say AGI (Q1), figure out what risks are likely to emerge from AGI (Q2), mitigate these risks (Q3), and implement these mitigation strategies (Q4).  

If by "figure out what we want," you mean "figure out what we want out of an AGI," I definitely agree with this (see Robert's great comment below!). If by "figure out what we want," you mean "figure out what we want out of AGI safety research," well, that is the entire point of this sequence!

I expect implementation to be relatively easy once we have any clue at all what to implement. So even if it's technically necessary to answer at some point, this question might not be very useful to think about ahead of time.

I completely disagree with this. It will definitely depend on the competitiveness of the relevant proposals, the incentives of the people who have control over the AGI, and a bunch of other stuff that I discuss in Q4 (which hasn't even been published yet—I hope you'll read it!). 

in practice, when we multiply together probability-of-hail-Mary-actually-working vs probability-that-AI-is-coming-that-soon, I expect that number to basically-never favor the hail Mary.  

When you frame it this way, I completely agree. However, there is definitely a continuous space of plausible timelines between "all-the-time-in-the-world" and "hail-Mary," and I think the probabilities of success [P(success|timeline) * P(timeline)] fluctuate non-obviously across this spectrum. Again, I hope you will withhold your final judgment of my claim until you see how I defend it in Q5! (I suppose my biggest regret in posting this sequence is that I didn't just do it all at once.)

Zooming out a level, I think the methodology used to generate these questions is flawed. If you want to identify necessary subquestions, then the main way I know how to do that is to consider a wide variety of approaches, and look for subquestions which are clearly crucial to all of them.

I think this is a bit uncharitable. I have worked with and/or talked to lots of different AGI safety researchers over the past few months, and this framework is the product of my having "consider[ed] a wide variety of approaches, and look for subquestions which are clearly crucial to all of them." Take, for instance, this chart in Q1—I am proposing a single framework for talking about AGI that potentially unifies brain-based vs. prosaic approaches. That seems like a useful and productive thing to be doing at the paradigm-level.

I definitely agree that things like how we define 'control' and 'bad outcomes' might differ between approaches, but I do claim that every approach I have encountered thus far operates using the questions I pose here (e.g., every safety approach cares about AGI architectures, bad outcomes, control, etc. of some sort). To test this claim, I would very much appreciate the presentation of a counterexample if you think you have one!

Thanks again for your comment, and I definitely want to flag that, in spite of disagreeing with it in the ways I've tried to describe above, I really do appreciate your skepticism and engagement with this sequence (I cite your preparadigmatic claim a number of times in it). 

As I said to Robert, I hope this sequence is read as something much more like a dynamic draft of a theoretical framework than my Permanent Thoughts on Paradigms for AGI Safety™.

Surely understanding generic strong optimization is necessary for alignment (as I also spend most of Q1 discussing). How can you be so sure, however, that zooming into something narrower would effectively only add noise? You assert this, but this doesn't seem at all obvious to me.

I mean, I don't actually need to defend the assertion all that much. Your core claim is that these questions are necessary, and therefore the burden is on you to argue not only that zooming in on something narrower might not just add noise, but that zooming in on something narrower will not just add noise. If it's possible that we could get to a point where AGI is no longer a serious threat without needing to answer the question, then the question is not necessary.

Also, regarding the Afghan hound example, I'd guess (without having read anything about the subject) that training Afghan hounds does not actually involve qualitatively different methods than training other dogs, they just need more of the same training and/or perform less well with the same level of training. Not that that's particularly central. The more important part is that I do not need to be confident that "different possible AGIs could not follow this same pattern"; you've taken upon yourself the burden of arguing that different possible AGIs must follow this pattern, otherwise question 1 might not be necessary.

If by "figure out what we want," you mean "figure out what we want out of an AGI," I definitely agree with this (see Robert's great comment below!).

That is basically what I mean, yes. I strongly recommend the Yudkowsky piece.

I completely disagree with [implementation being relatively easy/unhelpful to think about ahead of time]. It will definitely depend on the competitiveness of the relevant proposals, the incentives of the people who have control over the AGI, and a bunch of other stuff that I discuss in Q4 (which hasn't even been published yet—I hope you'll read it!).

Remember that if you want to argue necessity of the question, then it's not enough for these inputs to be relevant to the outcome of AGI, you need to argue that the question must be answered in order for AGI to go well. Just because some factors are relevant to the outcome does not mean that we must know those factors in advance in order to robustly achieve a good outcome.

However, there is definitely a continuous space of plausible timelines between "all-the-time-in-the-world" and "hail-Mary," and I think the probabilities of success [P(success|timeline) * P(timeline)] fluctuate non-obviously across this spectrum

Remember that if you want to argue necessity of the question, it is not enough for you to think that the probabilities fluctuate; you need a positive argument that the probabilities must fluctuate across the spectrum, by enough that the question must be addressed.

I definitely agree that things like how we define 'control' and 'bad outcomes' might differ between approaches, but I do claim that every approach I have encountered thus far operates using the questions I pose here (e.g., every safety approach cares about AGI architectures, bad outcomes, control, etc. of some sort). To test this claim, I would very much appreciate the presentation of a counterexample if you think you have one!

I think most of the strategies in MIRI's general cluster do not depend on most of these questions.

If it's possible that we could get to a point where AGI is no longer a serious threat without needing to answer the question, then the question is not necessary.

Agreed, this seems like a good definition for rendering anything as 'necessary.' 

Our goal: minimize AGI-induced existential threats (right?). 

My claim is that answering these questions is probably necessary for achieving this goal—i.e., P(achieving goal | failing to think about one or more of these questions) ≈ 0. (I say, "I am claiming that a research agenda that neglects these questions would probably not actually be viable for the goal of AGI safety work.")

That is, we would be exceedingly lucky if we achieve AGI safety's goal without thinking about 

  • what we mean when we say AGI (Q1),
  • what existential risks are likely to emerge from AGI (Q2),
  • how to address these risks (Q3),
  • how to implement these mitigation strategies (Q4), and
  • how quickly we actually need to answer these questions (Q5).

I really don't see how it could be any other way: if we want to avoid futures in which AGI does bad stuff, we need to think about avoiding (Q3/Q4) the bad stuff (Q2) that AGI (Q1) might do (and we have to do this all "before the deadline;" Q5). I propose a way to do this hierarchically. Do you see wiggle room here where I do not? 

FWIW, I also don't really think this is the core claim of the sequence. I would want that to be something more like here is a useful framework for moving from point A (where the field is now) to point B (where the field ultimately wants to end up). I have not seen a highly compelling presentation of this sort of thing before, and I think it is very valuable in solving any hard problem to have a general end-to-end plan (which we probably will want to update as we go along; see Robert's comment).   

I think most of the strategies in MIRI's general cluster do not depend on most of these questions.

Would you mind giving a specific example of an end-to-end AGI safety research agenda that you think does not depend on or attempt to address these questions? (I'm also happy to just continue this discussion off of LW, if you'd like.)

Would you mind giving a specific example of an end-to-end AGI safety research agenda that you think does not depend on or attempt to address these questions?

I think restricting oneself to end-to-end agendas is itself a mistake. One principle of e.g. the MIRI agenda is that we do not currently possess a strong enough understanding to create an end-to-end agenda which has any hope at all of working; anything which currently claims to be an end-to-end agenda is probably just ignoring the hard parts of the problem. (The Rocket Alignment Problem gives a good explanation of this view.)

I do think that finding necessary subquestions, or noticing that a given subquestion may not be necessary, is much easier than figuring out an end-to-end agenda. One can notice that e.g. an architecture-agnostic alignment strategy seems plausible (or arguably even necessary!) without figuring out all the steps of an end-to-end strategy.

Definitely agree that if we silo ourselves into any rigid plan now, it almost certainly won't work. However, I don't think 'end-to-end agenda' = 'rigid plan.' I certainly don't think this sequence advocates anything like a rigid plan. These are the most general questions I could imagine guiding the field, and I've already noted that I think this should be a dynamic draft. 

...we do not currently possess a strong enough understanding to create an end-to-end agenda which has any hope at all of working; anything which currently claims to be an end-to-end agenda is probably just ignoring the hard parts of the problem.

What hard parts of the problem do you think this sequence ignores?

(I explicitly claim throughout the sequence that what I propose is not sufficient, so I don't think I can be accused of ignoring this.)

Hate to just copy and paste, but I still really don't see how it could be any other way: if we want to avoid futures in which AGI does bad stuff, then we need to think about avoiding (Q3/Q4) the bad stuff (Q2) that AGI (Q1) might do (and we have to do this all "before the deadline;" Q5). This is basically tautological as far as I can tell. Do you agree or disagree with this if-then statement? 

I do think that finding necessary subquestions, or noticing that a given subquestion may not be necessary, is much easier than figuring out an end-to-end agenda.   

Agreed. My goal was to enumerate these questions. When I noticed that they followed a fairly natural progression, I decided to frame them hierarchically.  And, I suppose to your point, it wasn't necessarily easy to write this all up. I thought it would nonetheless be valuable to do so, so I did!

Thanks for linking the Rocket Alignment Problem—looking forward to giving it a closer read. 

I still really don't see how it could be any other way: if we want to avoid futures in which AGI does bad stuff, then we need to think about avoiding (Q3/Q4) the bad stuff (Q2) that AGI (Q1) might do (and we have to do this all "before the deadline;" Q5). This is basically tautological as far as I can tell. Do you agree or disagree with this if-then statement?

My comment at the top of this thread detailed my disagreement with that if-then statement, and I do not think any of your responses to my top-level comment actually justified the claim of necessity of the questions. Most of them made the same mistake, which I tried to emphasize in my response. This, for example:

How can you be so sure, however, that zooming into something narrower would effectively only add noise?

The question is not "How can John be so sure that zooming into something narrower would only add noise?", the question is "How can Cameron be so sure that zooming into something narrower would yield crucial information without which we have no realistic hope of solving the problem?".

I think this same issue applies to most of the rest of your replies to my original comment.

The question is not "How can John be so sure that zooming into something narrower would only add noise?", the question is "How can Cameron be so sure that zooming into something narrower would yield crucial information without which we have no realistic hope of solving the problem?".

I am not 'so sure'—as I said in the previous comment, I have only claim(ed) it is probably necessary to, for instance, know more about AGI than just whether it is a 'generic strong optimizer.' I would only be comfortable making non-probabilistic claims about the necessity of particular questions in hindsight.

I don't think I'm making some silly logical error. If your question is, "Why does Cameron think it is probably necessary to understand X if we want to have any realistic hope of solving the problem?", well, I do not think this is rhetorical! I spend an entire post defending and elucidating each of these questions, and I hope by the end of the sequence, readers would have a very clear understanding of why I think each is probably necessary to think about (or I have failed as a communicator!). 

It was never my goal to defend the (probable) necessity of each of the questions in this one post—this is the point of the whole sequence! This post is a glorified introductory paragraph. 

I do not think, therefore, that this post serves as anything close to an adequate defense of this framework, and I understand your skepticism if you think this is all I will say about why these questions are important. 

However, I don't think your original comment—or any of this thread, for that matter—really addresses any of the important claims put forward in this sequence (which makes sense, given that I haven't even published the whole thing yet!). It also seems like some of your skepticism is being fueled by assumptions about what you predict I will argue as opposed to what I will actually argue (correct me if I'm wrong!).

I hope you can find the time to actually read through the whole thing once it's published before passing your final judgment. Taken as a whole, I think the sequence speaks for itself. If you still think it's fundamentally bullshit after having read it, fair enough :)

I believe that it is very sensible to bring this sort of structure into our approach to AGI safety research, but at the same time it seems very clear that we should update that structure to the best of our ability as we make progress in understanding the challenges and potentials of different approaches. 

It is a feedback loop where we make each step according to our best theory of where to make it, and use the understanding gleaned from that step to update the theory (when necessary), which could well mean that we retrace some steps and recalibrate (this can be the case within and across questions). I think this connects to what both Charlie and Tekhne have said, though I believe Tekhne could have been more charitable.

In this light, it makes sense to emphasize the openness of the theory to being updated in this way, which also qualifies the ways in which the theory is allowed to be yet incomplete. Putting more effort into clarifying how this update process should look like seems like a promising addition to the framework that you propose. 

On a more specific note I felt that Q5 could just be in position 2 and maybe a sixth question would be "What is the predicted timeline for stable safety/control implementations?" or something of the sort. 

I also think that phrasing our research in terms of "avoiding bad outcomes" and "controlling the AGI" biases the way in which we pay attention to these problems. I am sure that you will also touch on this in the more detailed presentation of these questions, but at the resolution presented here, I would prefer the phrasing to be more open. 
"Aiming at good outcomes while/and avoiding bad outcomes" captures more conceptual territory, while still allowing for the investigation to turn out that avoiding bad outcomes is more difficult and should be prioritised. This extends to the meta-question of whether existential risk can be best adressed by focusing on avoiding bad outcomes, rather than developing a strategy to get to good outcomes (which are often characterised by a better abilitiy to deal with future risks) and avoid bad outcomes on the way there. It might rightfully appear that this is a more ambitious aim, but it is the less predisposed outlook! Many strategy games are based on the idea that you have to accumulate resources and avoid losses while at the same time improving your ability to accumulate resources and avoid losses in the future. Only focusing on the first aspect is a specific strategy in the space of possible ones, and often employed when one is close to losing. This isn't a perfect analogy in a number of ways, but serves to point out the more general outlook.
Similarly, we expect a superintelligent AGI to be out of our ability to control at some point, which invokes notions of "self-control" on part of the AGI or "justified trust" on our part - therefore, perhaps "influencing the development of the AGI" would be better, as, again, "influence" can cover more conceptual ground but can still be hardened into the more specific notion of "control" when appropriate.

Hey Robert—thanks for your comment!

it seems very clear that we should update that structure to the best of our ability as we make progress in understanding the challenges and potentials of different approaches. 

Definitely agree—I hope this sequence is read as something much more like a dynamic draft of a theoretical framework than my Permanent Thoughts on Paradigms for AGI Safety™.

"Aiming at good outcomes while/and avoiding bad outcomes" captures more conceptual territory, while still allowing for the investigation to turn out that avoiding bad outcomes is more difficult and should be prioritised. This extends to the meta-question of whether existential risk can be best adressed by focusing on avoiding bad outcomes, rather than developing a strategy to get to good outcomes (which are often characterised by a better abilitiy to deal with future risks) and avoid bad outcomes on the way there. 

I definitely agree with the value of framing AGI outcomes both positively and negatively, as I discuss in the previous post. I am less sure that AGI safety as a field necessarily requires deeply considering the positive potential of AGI (i.e., as long as AGI-induced existential risks are avoided, I think AGI safety researchers can consider their venture successful), but, much to your point, if the best way of actually achieving this outcome is by thinking about AGI more holistically—e.g., instead of explicitly avoiding existential risks, we might ask how to build an AGI that we would want to have around—then I think I would agree. I just think this sort of thing would radically redefine the relevant approaches undertaken in AGI safety research. I by no means want to reject radical redefinitions out of hand (I think this very well could be correct); I just want to say that it is probably not the path of least resistance given where the field currently stands.

(And agreed on the self-control point, as you know. See directionality of control in Q3.)

My main concern about sequences questions like this (leaving aside specific nitpicks like whether we should try to minimize the bad or realize the good) is that they don't account for generalization on later questions.

E.g. If there are certain bad things that happen across many different architectures, this can be a powerful clue to us about how we should think of the problem. Such a generalization violates the hierarchy by telling us about problems without automatically providing a reduction to the lower level of AI architecture.

So if you use this sequence of questions as a strict roadmap for research, you'll miss opportunities for generalization.

Thanks for your comment—I entirely agree with this. In fact, most of the content of this sequence represents an effort to spell out these generalizations. (I note later that, e.g., the combinatorics of specifying every control proposal to deal with every conceivable bad outcome from every learning architecture is obviously intractable for a single report; this is a "field-sized" undertaking.) 

I don't think this is a violation of the hierarchy, however. It seems coherent to both claim (a) given the field's goal, AGI safety research should follow a general progression toward this goal (e.g., the one this sequence proposes), and (b) there is plenty of productive work that can and should be done outside of this progression (for the reason you specify).

I look forward to hearing if you think the sequence walks this line properly!

These questions seem to cut up the conceptual space in an opinionated, and I think wrong, way.

What is the predicted architecture of the learning algorithm(s) used by AGI?

What even is an architecture? What's learning? What are learning algorithms, and how they have architecture? What sort of architecture matters? I know it's trivially easy to ask questions like this; but I think insofar as question 1 has meaning, it's making assumptions that are actually probably wrong.

What are the most likely bad outcomes of this learning architecture?

We care about all this because of the outcomes, but that doesn't mean the outcomes themselves are a deep or handy way of understanding what's wrong with an AGI.

What are the control proposals for minimizing these bad outcomes?

"Minimizing bad outcomes" sounds like this is a basically continuous variable, which is controversial. "Control proposals", if it means what it sounds like, is assuming too much; how do you know a good alignment strategy looks like control rather than something else?

What is the predicted timeline for the development of AGI?

Are you saying it's necessary to have an answer to this, to have an approach to alignment? Why would that be? (You write that these are questions "....an AGI safety research agenda would need to answer correctly in order to be successful...".)

Hi Tekhne—this post introduces each of the five questions I will put forward and analyze in this sequence. I will be posting one a day for the next week or so. I think I will answer all of your questions in the coming posts.

I doubt that carving up the space in this—or any—way would be totally uncontroversial (there are lots of value judgments necessary to do such a thing), but I think this concern only serves to demonstrate that this framework is not self-justifying (i.e., there is still lots of clarifying work to be done for each of these questions). I agree with this—that's why there I am devoting a post to each of them!

In order to minimize AGI-induced existential threats, I claim that we need to understand (i.e., anticipate; predict) AGI well enough (Q1) to determine what these threats are (Q2). We then need to figure out ways to mitigate these threats (Q3) and ways to make sure these proposals are actually implemented (Q4). How quickly we need to answer Q1-Q4 will be determined by how soon we expect AGI to be developed (Q5). I appreciate your skepticism, but I would counter that this seems actually like a fairly natural and parsimonious way to get from point A (where we are now) to point B (minimizing AGI-induced existential threats). That's why I claim that an AGI safety research agenda would need to answer these questions correctly in order to be successful.  

Ultimately, I can only encourage you to wait for the rest of the sequence to be published before passing a conclusive judgment!