Information hazards and downside risks
Moral uncertainty


Glad to hear that!

I do feel excited about this being used as a sort of "201 level" overview of AI strategy and what work it might be useful to do. And I'm aware of the report being included in the reading lists / curricula for two training programs for people getting into AI governance or related work, which was gratifying.

Unfortunately we did this survey before ChatGPT and various other events since then, which have majorly changed the landscape of AI governance work to be done, e.g. opening various policy windows. So I imagine people reading this report today may feel it has some odd omissions / vibes. But I still think it serves as a good 201 level overview despite that. Perhaps we'll run a followup in a year or two to provide an updated version. 

I'd consider those to be "in-scope" for the database, so the database would include any such estimates that I was aware of and that weren't too private to share in the database. 

If I recall correctly, some estimates in the database are decently related to that, e.g. are framed as "What % of the total possible moral value of the future will be realized?" or "What % of the total possible moral value of the future is lost in expectation due to AI risk?" 

But I haven't seen many estimates of that type, and I don't remember seeing any that were explicitly framed as "What fraction of the accessible universe's resources will be used in a way optimized for 'the correct moral theory'?"

If you know of some, feel free to comment in the database to suggest they be added :) 

...and while I hopefully have your attention: My team is currently hiring for a Research Manager! If you might be interested in managing one or more researchers working on a diverse set of issues relevant to mitigating extreme risks from the development and deployment of AI, please check out the job ad!

The application form should take <2 hours. The deadline is the end of the day on March 21. The role is remote and we're able to hire in most countries.

People with a wide range of backgrounds could turn out to be the best fit for the role. As such, if you're interested, please don't rule yourself out due to thinking you're not qualified unless you at least read the job ad first!

I found this thread interesting and useful, but I feel a key point has been omitted thus far (from what I've read): 

  • Public, elite, and policymaker beliefs and attitudes related to AI risk aren't just a variable we (members of the EA/longtermist/AI safety communities) have to bear in mind and operate in light of, but instead also a variable we can intervene on. 
  • And so far I'd say we have (often for very good reasons) done significantly less to intervene on that variable than we could've or than we could going forward. 
  • So it seems plausible that actually these people are fairly convincible if exposed to better efforts to really explain the arguments in a compelling way.

We've definitely done a significant amount of this kind of work, but I think we've often (a) deliberately held back on doing so or on conveying key parts of the arguments, due to reasonable downside risk concerns, and (b) not prioritized this. And I think there's significantly more we could do if we wanted to, especially after a period of actively building capacity for this. 

Important caveats / wet blankets:

  • I think there are indeed strong arguments against trying to shift relevant beliefs and attitudes in a more favorable direction, including not just costs and plausibly low upside but also multiple major plausible downside risks.[1] 
  • So I wouldn't want anyone to take major steps in this direction without checking in with multiple people working on AI safety/governance first. 
  • And it's not at all obvious to me we should be doing more of that sort of work. (Though I think whether, how, & when we should is an important question and I'm aware of and excited about a couple small research projects that are happening on that.) 

All I really want to convey in this comment is what I said in my first paragraph: we may be able to significantly push beliefs and opinions in favorable directions relative to where they are now or would be n future by default. 

  1. ^

    Due to time constraints, I'll just point to this vague overview.

Personally I haven't thought about how strong the analogy to GoF is, but another thing that feels worth noting is that there may be a bunch of other cases where the analogy is similarly strong and where major government efforts aimed at risk-reduction have occurred. And my rough sense is that that's indeed the case, e.g. some of the examples here.

In general, at least for important questions worth spending time on, it seems very weird to say "You think X will happen, but we should be very confident it won't because in analogous case Y it didn't", without also either (a) checking for other analogous cases or other lines of argument or (b) providing an argument for why this one case is far more relevant evidence than any other available evidence. I do think it totally makes sense to flag the analogous case and to update in light of it, but stopping there and walking away feeling confident in the answer seems very weird.

I haven't read any of the relevant threads in detail, so perhaps the arguments made are stronger than I imply here, but my guess is they weren't. And it seems to me that it's unfortunately decently common for AI risk discussions on LessWrong to involve this pattern I'm sketching here. 

(To be clear, all I'm arguing here is that these arguments often seem weak, not that their conclusions are false.)

(This comment is raising an additional point to Jan's, not disagreeing.)

Update: Oh, I just saw Steve Byrnes also the following in this thread, which I totally agree with:

"[Maybe one could argue] “It’s all very random—who happens to be in what position of power and when, etc.—and GoF is just one example, so we shouldn’t generalize too far from it” (OK maybe, but if so, then can we pile up more examples into a reference class to get a base rate or something? and what are the interventions to improve the odds, and can we also try those same interventions on GoF?)"


Two questions:

  1. Is it possible to also get something re-formatted via this service? (E.g., porting a Google Doc with many footnotes and tables to LessWrong or the EA Forum.)
  2. Is it possible to get feedback, proofreading, etc. via this service for things that won't be posts?
    • E.g. mildly infohazardous research outputs that will just be shared in the relevant research & policy community but not made public

(Disclaimer: I only skimmed this post, having landed here from Habryka's comment on It could be useful if someone ran a copyediting service. Apologies if these questions are answered already in the post.)

Thanks for this post! This seems like good advice to me. 

I made an Anki card on your three "principles that stand out" so I can retain those ideas. (Mainly for potentially suggesting to people I manage or other people I know - I think I already have roughly the sort of mindset this post encourages, but I think many people don't and that me suggesting these techniques sometimes could be helpful.)

It's not sufficient to argue that taking over the world will improve prediction accuracy. You also need to argue that during the training process (in which taking over the world wasn't possible), the agent acquired a set of motivations and skills which will later lead it to take over the world. And I think that depends a lot on the training process.

[...] if during training the agent is asked questions about the internet, but has no ability to edit the internet, then maybe it will have the goal of "predicting the world", but maybe it will have the goal of "understanding the world". The former incentivises control, the latter doesn't.

I agree with your key claim that it's not obvious/guaranteed that an AI system that has faced some selection pressure in favour of predicting/understanding the world accurately would then want to take over the world. I also think I agree that a goal of "understanding the world" is a somewhat less dangerous goal in this context than a goal of "predicting the world". But it seems to me that a goal of "understanding the world" could still be dangerous for basically the same reason as why "predicting the world" could be dangerous. Namely, some world states are easier to understand than others, and some trajectories of the world are easier to maintain an accurate understanding of than others. 

E.g., let's assume that the "understanding" is meant to be at a similar level of analysis to that which humans typically use (rather than e.g., being primarily focused at the level of quantum physics), and that (as in humans) the AI sees it as worse to have a faulty understanding of "the important bits" than "the rest". Given that, I think:

  • a world without human civilization or with far more homogeneity of its human civilization seems to be an easier world to understand
  • a world that stays pretty similar in terms of "the important bits" (not things like distant stars coming into/out of existence), rather than e.g. having humanity spread through the galaxy creating massive structures with designs influenced by changing culture, requires less further effort to maintain an understanding of and has less risk of later being understood poorly

I'd be interested in whether you think I'm misinterpreting your statement or missing some important argument.

(Though, again, I see this just as pushback against one particular argument of yours, and I think one could make a bunch of other arguments for the key claim that was in question.)

Thanks for this series! I found it very useful and clear, and am very likely to recommend it to various people.

Minor comment: I think "latter" and "former" are the wrong way around in the following passage?

By contrast, I think the AI takeover scenarios that this report focuses on have received much more scrutiny - but still, as discussed previously, have big question marks surrounding some of the key premises. However, it’s important to distinguish the question of how likely it is that the second species argument is correct, from the question of how seriously we should take it. Often people with very different perspectives on the latter actually don’t disagree very much on the former.

(I.e., I think you probably mean that, of people who've thought seriously about the question, probability estimates vary wildly but (a) tend to be above (say) 1 percentage point of x-risk from a second species risk scenario and (b) thus tend to suffice to make the people think humanity should put a lot more resources into understanding and mitigating the risk than we currently do. Rather than that people tend to wildly disagree on how much effort to put into this risk yet agree on how likely the risk is. Though I'm unsure, since I'm just guessing from context that "how seriously we should take it" means "how much resources should be spent on this issue", but in other contexts it'd mean "how likely is this to be correct" or "how big a deal is this", which people obviously disagree on a lot.)

FWIW, I feel that this entry doesn't capture all/most of how I see "meta-level" used. 

Here's my attempted description, which I wrote for another purpose. Feel free to draw on it here and/or to suggest ways it could be improved.

  • Meta-level and object-level = typically, “object-level” means something like “Concerning the actual topic at hand” while “Meta-level” means something like “Concerning how the topic is being tackled/researched/discussed, or concerning more general principles/categories related to this actual topic”
    • E.g., “Meta-level: I really appreciate this style of comment; I think you having a policy of making this sort of comment is quite useful in expectation. Object-level: I disagree with your argument because [reasons]”
Load More