Epistemic status: I've been thinking about this for a couple months and finally wrote it down. I don't think I'm saying anything new, but I think it's worth repeating loudly. My sample is skewed toward AI governance fellows; I've interacted with fewer technical AI safety researchers, so my inferences are fuzzier there. I more strongly endorse this argument for the governance crowd.
I've had 1-on-1's with roughly 75 fellows across the ERA, IAPS, GovAI, LASR, and Pivotal fellowships. These are a mix of career chats, research feedback, and casual conversations. I've noticed that in some fraction of these chats, the conversation gradually veers toward high-level, gnarly questions. "How hard is alignment, actually?" "How bad is extreme power concentration, really?"
Near the end of these conversations, I usually say something like: "idk, these questions are super hard, and I struggle to make progress on them, and when I do try my hand at tackling them, I feel super cognitively exhausted, and this makes me feel bad because it feels like a lot of my research and others' research are predicated on answers to these questions."
And then I sheepishly recommend Holden's essays on minimal-trust investigations and learning by writing. And then I tell them to actually do the thing.
The thing
By "the thing," I mean something like developing a first-principles understanding of why you believe AI is dangerous, such that you could reconstruct the argument from scratch without appealing to authority. Concretely, this might look like:
* Being able to coherently walk someone through at least one AI x-risk threat model, at a gears level
* Being able to simulate a top alignment researcher's worldview well enough that you could predict their takes on novel questions
* Writing down your own threat model and noticing where you get stuck, where you're confused, where you're deferring
I think a large fraction of researchers in AI safety/governance fellowships cannot do any of these thi