I’m Michael Aird, a Senior Research Manager at Rethink Priorities and guest fund manager at the Effective Altruism Infrastructure Fund. Opinions expressed are my own. You can give me anonymous feedback at this link.

With Rethink, I'm mostly focused on helping lead our AI Governance & Strategy team. I also do some nuclear risk research, give input on our Generalist Longtermism team's work, and do random other stuff.

Previously, I did a range of longtermism-y and research-y things as a Research Scholar at the Future of Humanity Institute, a Summer Research Fellow at the Center on Long-Term Risk, and a Researcher/Writer for Convergence Analysis. More on my background here.

I mostly post to the EA Forum.

If you think you or I could benefit from us talking, feel free to message me! For people interested in doing EA-related research/writing, testing their fit for that, "getting up to speed" on EA/longtermist topics, or writing for the Forum, I also recommend this post.


Information hazards and downside risks
Moral uncertainty



Two questions:

  1. Is it possible to also get something re-formatted via this service? (E.g., porting a Google Doc with many footnotes and tables to LessWrong or the EA Forum.)
  2. Is it possible to get feedback, proofreading, etc. via this service for things that won't be posts?
    • E.g. mildly infohazardous research outputs that will just be shared in the relevant research & policy community but not made public

(Disclaimer: I only skimmed this post, having landed here from Habryka's comment on It could be useful if someone ran a copyediting service. Apologies if these questions are answered already in the post.)

Thanks for this post! This seems like good advice to me. 

I made an Anki card on your three "principles that stand out" so I can retain those ideas. (Mainly for potentially suggesting to people I manage or other people I know - I think I already have roughly the sort of mindset this post encourages, but I think many people don't and that me suggesting these techniques sometimes could be helpful.)

It's not sufficient to argue that taking over the world will improve prediction accuracy. You also need to argue that during the training process (in which taking over the world wasn't possible), the agent acquired a set of motivations and skills which will later lead it to take over the world. And I think that depends a lot on the training process.

[...] if during training the agent is asked questions about the internet, but has no ability to edit the internet, then maybe it will have the goal of "predicting the world", but maybe it will have the goal of "understanding the world". The former incentivises control, the latter doesn't.

I agree with your key claim that it's not obvious/guaranteed that an AI system that has faced some selection pressure in favour of predicting/understanding the world accurately would then want to take over the world. I also think I agree that a goal of "understanding the world" is a somewhat less dangerous goal in this context than a goal of "predicting the world". But it seems to me that a goal of "understanding the world" could still be dangerous for basically the same reason as why "predicting the world" could be dangerous. Namely, some world states are easier to understand than others, and some trajectories of the world are easier to maintain an accurate understanding of than others. 

E.g., let's assume that the "understanding" is meant to be at a similar level of analysis to that which humans typically use (rather than e.g., being primarily focused at the level of quantum physics), and that (as in humans) the AI sees it as worse to have a faulty understanding of "the important bits" than "the rest". Given that, I think:

  • a world without human civilization or with far more homogeneity of its human civilization seems to be an easier world to understand
  • a world that stays pretty similar in terms of "the important bits" (not things like distant stars coming into/out of existence), rather than e.g. having humanity spread through the galaxy creating massive structures with designs influenced by changing culture, requires less further effort to maintain an understanding of and has less risk of later being understood poorly

I'd be interested in whether you think I'm misinterpreting your statement or missing some important argument.

(Though, again, I see this just as pushback against one particular argument of yours, and I think one could make a bunch of other arguments for the key claim that was in question.)

Thanks for this series! I found it very useful and clear, and am very likely to recommend it to various people.

Minor comment: I think "latter" and "former" are the wrong way around in the following passage?

By contrast, I think the AI takeover scenarios that this report focuses on have received much more scrutiny - but still, as discussed previously, have big question marks surrounding some of the key premises. However, it’s important to distinguish the question of how likely it is that the second species argument is correct, from the question of how seriously we should take it. Often people with very different perspectives on the latter actually don’t disagree very much on the former.

(I.e., I think you probably mean that, of people who've thought seriously about the question, probability estimates vary wildly but (a) tend to be above (say) 1 percentage point of x-risk from a second species risk scenario and (b) thus tend to suffice to make the people think humanity should put a lot more resources into understanding and mitigating the risk than we currently do. Rather than that people tend to wildly disagree on how much effort to put into this risk yet agree on how likely the risk is. Though I'm unsure, since I'm just guessing from context that "how seriously we should take it" means "how much resources should be spent on this issue", but in other contexts it'd mean "how likely is this to be correct" or "how big a deal is this", which people obviously disagree on a lot.)

FWIW, I feel that this entry doesn't capture all/most of how I see "meta-level" used. 

Here's my attempted description, which I wrote for another purpose. Feel free to draw on it here and/or to suggest ways it could be improved.

  • Meta-level and object-level = typically, “object-level” means something like “Concerning the actual topic at hand” while “Meta-level” means something like “Concerning how the topic is being tackled/researched/discussed, or concerning more general principles/categories related to this actual topic”
    • E.g., “Meta-level: I really appreciate this style of comment; I think you having a policy of making this sort of comment is quite useful in expectation. Object-level: I disagree with your argument because [reasons]”

Thanks for writing this. The summary table is pretty blurry / hard to read for me - do you think you could upload a higher resolution version? Or if for some reason that doesn't work on LessWrong, could you link to a higher resolution version stored elsewhere?

My Anki cards

Nanda broadly sees there as being 5 main types of approach to alignment research. 

Addressing threat models: We keep a specific threat model in mind for how AGI causes an existential catastrophe, and focus our work on things that we expect will help address the threat model.

Agendas to build safe AGI: Let’s make specific plans for how to actually build safe AGI, and then try to test, implement, and understand the limitations of these plans. With an emphasis on understanding how to build AGI safely, rather than trying to do it as fast as possible.

Robustly good approaches: In the long-run AGI will clearly be important, but we're highly uncertain about how we'll get there and what, exactly, could go wrong. So let's do work that seems good in many possible scenarios, and doesn’t rely on having a specific story in mind. Interpretability work is a good example of this.

De-confusion: Reasoning about how to align AGI involves reasoning about complex concepts, such as intelligence, alignment and values, and we’re pretty confused about what these even mean. This means any work we do now is plausibly not helpful and definitely not reliable. As such, our priority should be to do some conceptual work on how to think about these concepts and what we’re aiming for, and trying to become less confused.
I consider the process of coming up with each of the research motivations outlined in this post to be examples of good de-confusion work

Field-building: One of the biggest factors in how much Alignment work gets done is how many researchers are working on it, so a major priority is building the field. This is especially valuable if you think we’re confused about what work needs to be done now, but will eventually have a clearer idea once we’re within a few years of AGI. When this happens, we want a large community of capable, influential and thoughtful people doing Alignment work.

Nanda focuses on three threat models that he thinks are most prominent and are addressed by most current research:

Power-Seeking AI

You get what you measure
[The case given by Paul Christiano in What Failure Looks Like (Part 1)]

AI Influenced Coordination Failures
[The case put forward by Andrew Critch, eg in What multipolar failure looks like. Many players get AGI around the same time. They now need to coordinate and cooperate with each other and the AGIs, but coordination is an extremely hard problem. We currently deal with this with a range of existing international norms and institutions, but a world with AGI will be sufficiently different that many of these will no longer apply, and we will leave our current stable equilibrium. This is such a different and complex world that things go wrong, and humans are caught in the cross-fire.]

Nanda considers three agendas to build safe AGI to be most prominent:

Iterated Distillation and Amplification (IDA)

AI Safety via Debate

Solving Assistance Games
[This is Stuart Russell’s agenda, which argues for a perspective shift in AI towards a more human-centric approach.]

Nanda highlights 3 "robustly good approaches" (in the context of AGI risk):




[I doubt he sees these as exhaustive - though that's possible - and I'm not sure if he sees them as the most important/prominent/most central examples.]

Thanks for this! I found it interesting and useful. 

I don't have much specific feedback, partly because I listened to this via Nonlinear Library while doing other things rather than reading it, but I'll share some thoughts anyway since you indicated being very keen for feedback.

  • I in general think this sort of distillation work is important and under-supplied
  • This seems like a good example of what this sort of distillation work should be like - broken into different posts that can be read separately, starting with an overall overview, each post is broken down into clear and logical sections and subsections, use of bold, clarity about terms, addition of meta notes where relevant
  • Maybe it would've been useful to just name & link to sources on threat models, agendas to build safe AGI, and robustly good approaches that you don't discuss in any further detail? Rathe than not mentioning them at all. 
    • That could make it easier for people to dive deeper if they want, could help avoid giving the impression that the things you list are the only things in those categories, and could help people understand what you mean by the overall categories by seeing more examples of things within the categories.
    • This is assuming you think there are other discernible nameable constituents of those categories which you didn't name - I guess it's possible that you don't think that.
  • I'll put in a reply to this comment the Anki cards I made, on the off chance that that's of interest to you as oblique feedback or of interest to other people so they can use the same cards themselves

Adam Binks replied to this list on the EA Forum with:

To add to your list - Subjective Logic represents opinions with three values: degree of belief, degree of disbelief, and degree of uncertainty. One interpretation of this is as a form of second-order uncertainty. It's used for modelling trust. A nice summary here with interactive tools for visualising opinions and a trust network.

Load More