If you don't feel deeply confused about AGI risk, something's wrong

Dave Banerjee

Epistemic status: I've been thinking about this for a couple months and finally wrote it down. I don't think I'm saying anything new, but I think it's worth repeating loudly. My sample is skewed toward AI governance fellows; I've interacted with fewer technical AI safety researchers, so my inferences are fuzzier there. I more strongly endorse this argument for the governance crowd.

I've had 1-on-1's with roughly 75 fellows across the ERA, IAPS, GovAI, LASR, and Pivotal fellowships. These are a mix of career chats, research feedback, and casual conversations. I've noticed that in some fraction of these chats, the conversation gradually veers toward high-level, gnarly questions. "How hard is alignment, actually?" "How bad is extreme power concentration, really?"

Near the end of these conversations, I usually say something like: "idk, these questions are super hard, and I struggle to make progress on them, and when I do try my hand at tackling them, I feel super cognitively exhausted, and this makes me feel bad because it feels like a lot of my research and others' research are predicated on answers to these questions."

And then I sheepishly recommend Holden's essays on minimal-trust investigations and learning by writing. And then I tell them to actually do the thing.

The thing

By "the thing," I mean something like developing a first-principles understanding of why you believe AI is dangerous, such that you could reconstruct the argument from scratch without appealing to authority. Concretely, this might look like:

Being able to coherently walk someone through at least one AI x-risk threat model, at a gears level
Being able to simulate a top alignment researcher's worldview well enough that you could predict their takes on novel questions
Writing down your own threat model and noticing where you get stuck, where you're confused, where you're deferring

I think a large fraction of researchers in AI safety/governance fellowships cannot do any of these things. Here's the archetype:

If this describes you, you are likely in the modal category. FWIW, this archetype is basically me, so I'm also projecting a bit!

Why this happens

I think the default trajectory of an AI safety/governance fellow is roughly: absorb the vibes, pick a project, execute, produce output. The "step back and build a first-principles understanding" phase gets skipped, and it gets skipped for predictable, structural reasons:

Time pressure. Fellowships are 8-12 weeks. That's barely enough time to get a research project off the ground, let alone interrogate your foundational assumptions. There's no time, just sprint sprint sprint!
Mentorship structure. Most fellowships pair you with a mentor who has a specific research agenda. The implicit (sometimes explicit) deal is: work on something in my agenda. This is often great for learning research skills! But it's not really compatible with "I spent three weeks questioning whether this whole frame is right." The incentive is to be a good mentee, which means executing on a well-scoped project, not pulling at foundational threads. This doesn't always happen though—it seems like a decent chunk of mentors let their fellows do roughly whatever they want.
Legibility incentives. The point of a fellowship is to get you a job! A concrete paper or report is legible, and this is a very useful signal to future employers. During a job application, it's hard to get by just saying "I developed a much more nuanced understanding of when alignment is hard" (although I think that orgs with good hiring practices would positively reward such a proclamation! I'm not sure if all orgs are like this but I get the sense that it's hard to screen for these things).
Social pressure. It feels deeply uncomfortable to be participating in an elite AI x-risk fellowship and tell your peer, manager, or mentor: "idk why ASI poses an existential risk." There's a kind of adverse selection in who communicates confusion. The people who are most confused are the least likely to say so, because saying so feels like admitting you don't belong.

That said, I think a valid counterargument is: maybe the best way to build an inside view is to just do a ton of research. If you just work closely with good mentors, run experiments, hit dead ends, then the gears-level understanding will naturally emerge.

I think this view is partially true. Many researchers develop their best intuitions through the research process, not before it. And the fellowship that pressures people to produce output is probably better, on the margin, than one that produces 30 deeply confused people and zero papers. I don't want to overcorrect. The right answer is probably "more balance" rather than "eliminate paper/report output pressure."

Why it matters

In most research fields, it's fine to not do the thing. You can be a productive chemist without having a first-principles understanding of why chemistry matters. Chemistry is mature and paradigmatic. The algorithm for doing useful work is straightforward: figure out what's known, figure out what's not, run experiments on the unknown.

AI safety doesn't work like this. We're not just trying to advance a frontier of knowledge. We're trying to do the research with the highest chance of reducing P(doom), in a field that's still pre-paradigmatic, where the feedback loops are terrible and the basic questions remain unsettled. If you're doing alignment research and you can't articulate why you think alignment is hard, you're building on a foundation you haven't examined. You can't tell whether your project actually matters. You're optimizing for a metric you can't justify.

You can get by for a while by simply deferring to 80,000 Hours and Coefficient Giving's recommendations. But deferral has a ceiling, and the most impactful researchers are the ones who've built their own models and found the pockets of alpha.

And I worry that this problem will get worse over time. As we get closer to ASI, the pressure to race ahead with your research agenda without stepping back will only intensify. The feeling of urgency will crowd out curiosity. And the field will become increasingly brittle precisely when it most needs to be intellectually nimble.

What should you do?

If you don't feel deeply confused about AI risk, something is wrong. You've likely not stared into the abyss and confronted your assumptions. The good news is that there are concrete things you can do. The bad news is that none of them are easy. They all require intense cognitive effort and time.

Strategy 1: Write your own threat model from scratch. Sit down with a blank document and try to write a coherent argument for why AI poses an existential risk. Don't consult references. Just write what you actually believe and why. You will get stuck. The places where you get stuck are the most valuable information you'll get from this exercise. Those are the load-bearing assumptions you've been deferring on. Once you've identified them, you can actually go investigate them.
Strategy 2: Learn to simulate a senior researcher. Pick someone with a lot of public writing (e.g., Paul Christiano, Richard Ngo, Eliezer Yudkowsky, Joe Carlsmith). Dedicate maybe 5 hours per week to reading their work very carefully, taking extensive notes. Keep a running doc with all your open questions and uncertainties. The goal is to be able to predict what they'd say about a novel question and, crucially, to understand why they'd say it. This is different from building your own inside view, but it's a useful complement. You learn a lot about the structure of the problem by trying to inhabit someone else's model of it.
Strategy 3: Set a concrete confusion-reduction goal. By the end of your fellowship, you should be able to coherently explain at least one AI x-risk threat model to a smart person outside the field. Not "AI might be dangerous because Eliezer says so" but an actual mechanistic story. If you can't do this after 8-12 weeks of intensive engagement with AI safety, that's a signal worth paying attention to.

For fellowship directors and research managers, I'd suggest making space for this.^[1] One thing that could be useful is to encourage fellows to set a concrete confusion-reduction goal like what I've described above, in addition to the normal fellowship goals like networking and research.

Concluding thoughts

I don't want this post to read as "you should feel bad." The point is that confusion is undervalued and undersupplied in this field. Noticing that you can't reconstruct your beliefs from scratch isn't a failure in itself. It's only bad if you don't do anything about it!

I'm still working on this problem myself. And I imagine many others are too.

^{^}
Though I assume that fellowship directors have noticed this issue and have tried to solve the problem and it turned out that solving it is hard.

My threat model is simple. If you build something which:

Is smarter than any human alive,
Pursues goals,
Learns from experience, ^[1]
And can be replicated at low cost, ^[2]

...then you've wound up on the wrong side of Darwin. Your physical and intellectual labor is an inefficient use of resources, you don't actually understand anything that's going on, and who/whatever is in charge doesn't need you for anything. You're economic ^[3] and evolutionary dead weight.

Now, you might not die. Dogs don't understand what's going on, and they have almost zero ability to affect human decisions. But we like dogs, so we keep them around as pets and breed them to better suit our preferences. Sometimes this breeding produces happy, healthy dogs, and sometimes it produces ridiculous looking animals with crippling health problems. Similarly, chimpanzees don't understand Homo sapiens, and they definitely have zero ability to affect our decisions. Still, we'd be sad if chimpanzees went extinct, so we preserve a tiny amount of wildlife habitat and keep some of them in zoos.

So my most optimistic scenario for superintelligence is that humans wind up as beloved house pets. We have no control beyond what our masters choose to grant us, and we understand basically nothing about what's going on. Then, in increasing order of badness, you get the "chimps" scenario, where the AIs keep a few of us around in marginal habitat, or the "Homo erectus" scenario, where we just go extinct. After that, you start to get into "fate worse than death" territory.

I don't think there's anything particularly deep of confusing about this model? It assumes that you can't actually control anything that's much smarter than you. And it assumes that losing power over your life to something with its own goals generally sucks in the long run. On the plus side, I can usually explain this model to anyone who has a rough grasp of either evolutionary biology or the history of colonialism.

Unfortunately, my model cashes out with frustrating recommendations:

Don't build superintelligence. Seriously, how about just not doing it?
If you must build superintelligence, then assume that you're inevitably going to lose control over the future, and that your best hope is to build the best "pet owner" you can. This is the "raise your teenagers well because they'll be choosing your retirement home" school of alignment.
If you can't stop other people from building superintelligence, then hug your kids and enjoy your remaining time as best you can.

I really wish we didn't have to do this.

"Learns from experience" is actually doing some heavy lifting here. Essentially, my belief is that intelligence is a "giant inscrutable matrix" with some spicy non-linearities, mapping from ambiguous sensor readings to probabilistic conclusions about the state of the world, and to probabilistic recommendations of what to do next. Simply put, this is not the sort thing that allows any bright-line guarantees. Then, on top of this, we add the ability to learn and change over time, which means that you now need to predict the future state of a giant, self-modifying inscrutable matrix with spicy non-linearities. ↩︎
Mutation (aka "learning") and differential replication of more successful mutations means you have successfully invoked the power of natural selection, which generally favors the most efficient replicators. Even multicellular organisms often die of cancer, because aligning mutable replicators is intractable in the long run. ↩︎
The Law of Comparative Advantage won't save you, because it assumes that the more productive and efficient entity can't just be copy-pasted to replace all labor. ↩︎

Don't build superintelligence. Seriously, how about just not doing it?

I feel there's some sort of circular misunderstanding when I hear this. Humans aren't building AI, humans selected by large scale processes are following local rewards. Moralizing at a cancer cell for pursuing a glucose gradient would be recognized as weird.

This is the very point where the governments, including China, have to step in and put a stop to anyone trying to build the ASI. Unfortunately, this is a hard-to-sell decision unless there emerge some warning shots like a deployed AI making a fatal mistake or Agent-4 being caught misaligned.

It is easier to implement a policy where all AI-related companies above a certain threshold are overseen by some international organ or the governments so that no human would be able to avoid thoroughly checking that the models are actually aligned.

As for the second point^[1] that @Random Developer makes, it seems to be missing the fact that most AI researchers believe that alignment to any target like the Oversight Committee's will is soluble in principle, which happens in a scenario illustrating the Intelligence Curse. If alignment does end up solved, then it's up to the governance to ensure that the creators of the aligned ASI point it at the target which lets the humans exert their power to formulate the instructions which the ASI would execute. If the ASI is aligned to the OC and oligarchs possessing all the resources, then they would be unlikely to need to keep any other humans around. Maybe one should use this fact instead and try to ensure that the government does intervene with any AI companies so that no one tried to conduct AI-assisted coups or to use AI to displace the workers without ensuring that the workers displaced receive the same ratio of the GDP?

^{^}
Quoting Random Developer, "If you must build superintelligence, then assume that you're inevitably going to lose control over the future, and that your best hope is to build the best "pet owner" you can."

I like the advice you've given. I recently wrote a message of final advice to the PIBBSxILIAD fellowship. It seems fairly closely related.

Final Advice from Jeremy

I have some final advice for your future careers as alignment researchers:

Keep your eye on the real problems and a full pathway to solving them. Don't be distracted by the short term proxies of success. Don't trust your employers, mentors, colleagues or funders to do this for you, they won't. Always strive to understand the entire stack of motivations for your work and how it fits into a full solution to the problem. If you discover that some part of the stack isn't as valid as you thought, switch research topic. In most fields your current set of skills is the most important factor in deciding what to work on. This is a field where catching up to the frontier of a subfield is relatively easy, so usually you should prioritize solving important problems over how well the problem fits with your current skills.

The real problem is building a superintelligence that you understand at a very deep level, such that you know it will act as you intended. You need to understand things like: How its goals are stored and why they will stay the same as its world model updates. How much it can rely on its world model. When and how you can specify goals that have easy-to-predict and safe consequences. How it might be motivated to improve itself, and why you can trust it to do this.

Your work will almost always be several steps upstream of this goal, and only solve a small subproblem. This is fine, as long as the connection is known and clearly communicated to others, so that the research community can prioritize necessary but neglected problems.

Have ambition. Hold yourself to a higher standard than the current field exemplifies. Your work kinda only matters if it makes significant advances. So take risks. Avoid low value experiments done just for publication. Think about the deep conceptual questions and allow them to motivate your research.

This attitude is somewhat associated with becoming a crackpot. To balance this out: Break down your research plans into small steps. Take care to communicate your ideas clearly and frequently. Work on small problems and help advance other people's research, but treat this as training for the real work.

I wonder if some of the confusion operates like this: the better you understand how insanely dangerous it would be to create a superintelligence, the more you also understand how insanely difficult it is to make a benign one, or even to say what a benign one would be. Those who speak do not know, but those who know cannot speak.

Eliezer has written of the dangers and the difficulty, and that the only near-term strategy is to shut it all down. But I have not seen what steps forward he would take from there.

In the fictional word of dath ilan, as described in planecrash, the guardians of that civilisation have succeeded in shutting it all down, and in a secret establishment called The Basement Of The World are very, very cautiously studying the problem. But we see nothing of their work.

Possible spoiler for planecrash:

Given what the gods of Golarion really are, maybe more is revealed later in the story than I have read up to.