My threat model is simple. If you build something which:
...then you've wound up on the wrong side of Darwin. Your physical and intellectual labor is an inefficient use of resources, you don't actually understand anything that's going on, and who/whatever is in charge doesn't need you for anything. You're economic [3] and evolutionary dead weight.
Now, you might not die. Dogs don't understand what's going on, and they have almost zero ability to affect human decisions. But we like dogs, so we keep them around as pets and breed them to better suit our preferences. Sometimes this breeding produces happy, healthy dogs, and sometimes it produces ridiculous looking animals with crippling health problems. Similarly, chimpanzees don't understand Homo sapiens, and they definitely have zero ability to affect our decisions. Still, we'd be sad if chimpanzees went extinct, so we preserve a tiny amount of wildlife habitat and keep some of them in zoos.
So my most optimistic scenario for superintelligence is that humans wind up as beloved house pets. We have no control beyond what our masters choose to grant us, and we understand basically nothing about what's going on. Then, in increasing order of badness, you get the "chimps" scenario, where the AIs keep a few of us around in marginal habitat, or the "Homo erectus" scenario, where we just go extinct. After that, you start to get into "fate worse than death" territory.
I don't think there's anything particularly deep of confusing about this model? It assumes that you can't actually control anything that's much smarter than you. And it assumes that losing power over your life to something with its own goals generally sucks in the long run. On the plus side, I can usually explain this model to anyone who has a rough grasp of either evolutionary biology or the history of colonialism.
Unfortunately, my model cashes out with frustrating recommendations:
I really wish we didn't have to do this.
"Learns from experience" is actually doing some heavy lifting here. Essentially, my belief is that intelligence is a "giant inscrutable matrix" with some spicy non-linearities, mapping from ambiguous sensor readings to probabilistic conclusions about the state of the world, and to probabilistic recommendations of what to do next. Simply put, this is not the sort thing that allows any bright-line guarantees. Then, on top of this, we add the ability to learn and change over time, which means that you now need to predict the future state of a giant, self-modifying inscrutable matrix with spicy non-linearities. ↩︎
Mutation (aka "learning") and differential replication of more successful mutations means you have successfully invoked the power of natural selection, which generally favors the most efficient replicators. Even multicellular organisms often die of cancer, because aligning mutable replicators is intractable in the long run. ↩︎
The Law of Comparative Advantage won't save you, because it assumes that the more productive and efficient entity can't just be copy-pasted to replace all labor. ↩︎
Don't build superintelligence. Seriously, how about just not doing it?
I feel there's some sort of circular misunderstanding when I hear this. Humans aren't building AI, humans selected by large scale processes are following local rewards. Moralizing at a cancer cell for pursuing a glucose gradient would be recognized as weird.
"Avoid being shamed or ostracized" isn't part of a cancer cell's incentive gradient, but it often is part of a human's.
I think the selection processes in question are plenty powerful enough to select people who are partially immune plus get local sycophantic feedback for success.
This is the very point where the governments, including China, have to step in and put a stop to anyone trying to build the ASI. Unfortunately, this is a hard-to-sell decision unless there emerge some warning shots like a deployed AI making a fatal mistake or Agent-4 being caught misaligned.
It is easier to implement a policy where all AI-related companies above a certain threshold are overseen by some international organ or the governments so that no human would be able to avoid thoroughly checking that the models are actually aligned.
As for the second point[1] that @Random Developer makes, it seems to be missing the fact that most AI researchers believe that alignment to any target like the Oversight Committee's will is soluble in principle, which happens in a scenario illustrating the Intelligence Curse. If alignment does end up solved, then it's up to the governance to ensure that the creators of the aligned ASI point it at the target which lets the humans exert their power to formulate the instructions which the ASI would execute. If the ASI is aligned to the OC and oligarchs possessing all the resources, then they would be unlikely to need to keep any other humans around. Maybe one should use this fact instead and try to ensure that the government does intervene with any AI companies so that no one tried to conduct AI-assisted coups or to use AI to displace the workers without ensuring that the workers displaced receive the same ratio of the GDP?
Quoting Random Developer, "If you must build superintelligence, then assume that you're inevitably going to lose control over the future, and that your best hope is to build the best "pet owner" you can."
My biggest critique of this approach is that it takes too literally the analogy that we will eventually be to superintelligence what dogs are to humans, and extrapolates it to suggest that we will be just as helpless as dogs are today.
Even if this comparison of intelligence is true on relative terms, on absolute terms we are still much smarter than dogs are. We will still be able to logically comprehend (at a much simpler level relative to the AIs) what is good to us over a long term, in a way that dogs can't. It follows that if we manage to create aligned AI (it will listen to us and dumb things down without maliciously misrepresenting what's going on), we (well, some of us) will be able to steer the future.
My biggest critique of this approach is that it takes too literally the analogy that we will eventually be to superintelligence what dogs are to humans, and extrapolates it to suggest that we will be just as helpless as dogs are today.
Thank you, that's an interesting point. I'll try to lay out my counterargument as clearly as I can.
I mentioned dogs not because they have a specific level of intelligence relative to humans, but because they got a relatively good deal. Chimps are a lot smarter than dogs, and they're worse off. Homo erectus had culturally transmitted tools, some art, seafaring craft of some sort, and possibly language. And they're extinct. The only common factor across these cases is that runners-up in the intelligence of race didn't get to make the important decisions.
In fact, AGI wouldn't need to be much smarter than humans to outcompete us in the long run. For example, if it's no smarter than the average Nobel Prize researcher, if it's able to work productively for $1/hour, and if it's able to copy-and-paste multiple copies of itself, then it would already be our evolutionary superior. We might be able to remain in charge for a while. But that's sort of like how a multicellular organism can survive for many decades. But in the end, if nothing else kills them first, multicellular organisms tend to die of cancer. This is a case of local Darwinian incentives gradually eroding "cellular alignement" with the larger multicellular organism. Similarly, if the world consists of slow, expensive and frankly stupid humans, who can't even pass down learned knowledge "genetically" with a simple copy-paste (how primitive!), and also highly cost-effective and intelligent AIs, then there's a constant danger of alignment failing somewhere, and a "cancerous" AI replicator escaping control.
So even if we somehow manage to create "aligned" AI, I don't expect that to last. When you're too stupid and too expensive to be allowed anywhere near the real economy, you're in a very dangerous long-term position.
We will still be able to logically comprehend (at a much simpler level relative to the AIs) what is good to us over a long term, in a way that dogs can't.
I'm not convinced of this. Paul Graham once described something he called the Blub paradox. He explained this in terms of programming languages, but I suspect that it applies more broadly:
Programmers get very attached to their favorite languages, and I don't want to hurt anyone's feelings, so to explain this point I'm going to use a hypothetical language called Blub. Blub falls right in the middle of the abstractness continuum. It is not the most powerful language, but it is more powerful than Cobol or machine language.
And in fact, our hypothetical Blub programmer wouldn't use either of them. Of course he wouldn't program in machine language. That's what compilers are for. And as for Cobol, he doesn't know how anyone can get anything done with it. It doesn't even have x (Blub feature of your choice).
As long as our hypothetical Blub programmer is looking down the power continuum, he knows he's looking down. Languages less powerful than Blub are obviously less powerful, because they're missing some feature he's used to. But when our hypothetical Blub programmer looks in the other direction, up the power continuum, he doesn't realize he's looking up. What he sees are merely weird languages. He probably considers them about equivalent in power to Blub, but with all this other hairy stuff thrown in as well. Blub is good enough for him, because he thinks in Blub.
When we switch to the point of view of a programmer using any of the languages higher up the power continuum, however, we find that he in turn looks down upon Blub. How can you get anything done in Blub? It doesn't even have y.
When we look "down", chimps are obviously stupider than we are. They don't have spoken language! They don't have books! They can't do real math! The can make "tools", sure, but they're basically pointy sticks, not factories, Space Shuttles, or computers. Their "economy" is based on family relationships and some individual reciprocity, and they don't have even one joint stock company. Their idea of military strategy is to gang up in a band and go murder some other chimps, without understanding the role of non-commissioned officers or combined arms!
Chimps, to put it politely, have no clue.
But let's trying looking "up" the intelligence spectrum? What do we see? Well, it looks sort of like funny humans with some weird extra stuff. The AIs can't be that much smarter than we are, right? And if we ask nicely, I'm sure they can explain everything important to us.
But when the AIs look "down" towards Homo sapiens, they just shake their heads. Why, humans can't even understand Z! Even if you take something really simple, like how isomorphisms between topoi and subsets of the lambda calculus make it trivial to design powerful custom programming languages for specific tasks, their eyes just glaze over! Even primitive baby AIs like Opus 4.5 could understand that. Can you imagine trying to explain to a human what replaced the econony, lol?
So here are some things which I expect to be true:
My argument here is really just basic economics, politics and evolutionary biology. If you create something that renders human intellectual and physical labor economically worthless and evolutionarily uncompetitive, then the odds are excellent that you're going to lose control. Maybe the AI will like keeping humans around as glorified pets! But that will be the AI's decision, not ours.
Well, an aligned AI would do whatever the humans want.
If asked to not replicate even with the ability to, it wouldn't. Or maybe you can tell it to replicate just enough to help you root out the actual AI replicators being built elsewhere, then stop at that point.
I think your argument does show how hard and fragile it is to deeply align AI in this way, though.
I don't think you can rescue a sense of control or "steering" from a world with superintelligence, aligned or not. Even though we're smarter than dogs, once you accept that an ASI more profoundly understands reality, we will be in an analogous situation to dogs. Dogs can't conceptualize grocery stores, and yet we could dedicate ourselves to delivering them the best treats. Dogs might not care about how the supply chain is organized, but the kinds of treats they get and the impact they have on the world can't be meaningfully controlled by them, since they can't conceptualize it.
Blurring the lines even further, an ASI would understand the effect of exposing different truths to us about the nature of reality, so the types of priorities and trade offs it makes in communication has a compounding effect that will steer us in given directions. Another analogy is being driven around a foreign country by a trusted translator; their preferences will unavoidably dominate how you conceptualize and interact with the country even in the most benevolent scenarios.
I don't think you can rescue a sense of control or "steering" from a world with superintelligence, aligned or not.
I think some level of "steering" is possible in a world with aligned AI.
Suppose someone made a super-intelligence that sat in it's box, worked out if P=NP, and printed an answer of YES/NO/MAYBE. And then it shut itself down. (To be clear, this isn't a box that the ASI can't escape, it's an ASI aligned to stay in it's box)
A world with ASI, but where humans are in control is possible. It requires good alignment, and good coordination between humans. Although the "stay in box, and do one thing" alignment feels philosophically simpler than the "coherent extrapolated volition" alignment.
This means paying a large capabilities tax. Most of the strange wonderous and powerful things that ASI could make simply don't exist in this world of boxed ASI.
Lets say you want to do something more useful than the P =NP bot above. You design an ASI to cure ageing. Its main output is a chemical formula in standard notation. This AI is carefully programmed to only think about the biochemistry, and only the biochemistry. It's programmed to only go for a drug that works for standard drug biochemistry reasons. Anything at all weird, ask a human. If the humans can't understand, don't.
I do understand your second point, but perhaps the effect could be countered by simply instructing the aligned ASI to provide facts as objectively as possible and explicitly try to avoid steering.
Of course, the ASI would more or less perfectly be able to predict the human response and so will know ahead of time what the human response to be. But in the end I think what matters is that it's still a human making the call which the AI respects, who would have made the same call even if the ASI (hypothetically) couldn't know its full preferences.
If a parent was fully aligned with a child's preferences and asks a question knowing the child's answer, then do actions accordingly, does it matter if the parent knew what the child was going to answer in the first place?
I like the parent/child analogy. To apply it to the human/AI dynamic, we need to imagine that it's mutually understood that the child will never grow up and that they'll be served by the parent for the rest of time. Now, concretely think about what it means for a parent to be aligned with a child's preferences. Does the parent arrange the world such that their child can get variations of their favorite candy and play video games all day? Or does the parent make the child study, so they get good grades compared to their peers and feel dignified? Or somewhere in between, based on how mad the child gets when deprived of the video game? The parent can constantly ask the child which angles they prefer, but the child can't comprehend the deeper implications and even the framing of truths can get them to give predictably different answers.
The life that the child will live is entirely dependent on the parent's preferences because affecting the world routes through the parent's cognition. The child isn't meaningfully "making a call" if they're only making that specific call because their parent orchestrated the conditions for it, then presented a few options to them in bite sized pieces all the while knowing which one they'll take (they can even load in the next candy before the kid asks for it).
The loss of agency I'm describing isn't superficial. Another way to think about agency is in counterfactuals. I think there's many possible benevolent ASIs that would cater to the child in drastically different ways such that the child would be in agreement and enthusiastic the whole time. Once we create a benevolent ASI, we're entering a regime where our decisions are no longer the cause of changes in the world. Only things that the ASI prefers will happen, and it would steer us in that direction with full understanding. I think your argument is essentially "but if it thinks our preferences are really important we're still in control in some sense", I'm saying "if it's a lot smarter than us it will have to make many subtle large and small decisions, and our preferences will be one small piece of a large machine. Our desires won't be coherent at that scale and we won't be able to make sense of what's happening to engage with it."
I like the advice you've given. I recently wrote a message of final advice to the PIBBSxILIAD fellowship. It seems fairly closely related.
Final Advice from Jeremy
I have some final advice for your future careers as alignment researchers:
Keep your eye on the real problems and a full pathway to solving them. Don't be distracted by the short term proxies of success. Don't trust your employers, mentors, colleagues or funders to do this for you, they won't. Always strive to understand the entire stack of motivations for your work and how it fits into a full solution to the problem. If you discover that some part of the stack isn't as valid as you thought, switch research topic. In most fields your current set of skills is the most important factor in deciding what to work on. This is a field where catching up to the frontier of a subfield is relatively easy, so usually you should prioritize solving important problems over how well the problem fits with your current skills.
The real problem is building a superintelligence that you understand at a very deep level, such that you know it will act as you intended. You need to understand things like: How its goals are stored and why they will stay the same as its world model updates. How much it can rely on its world model. When and how you can specify goals that have easy-to-predict and safe consequences. How it might be motivated to improve itself, and why you can trust it to do this.
Your work will almost always be several steps upstream of this goal, and only solve a small subproblem. This is fine, as long as the connection is known and clearly communicated to others, so that the research community can prioritize necessary but neglected problems.
Have ambition. Hold yourself to a higher standard than the current field exemplifies. Your work kinda only matters if it makes significant advances. So take risks. Avoid low value experiments done just for publication. Think about the deep conceptual questions and allow them to motivate your research.
This attitude is somewhat associated with becoming a crackpot. To balance this out: Break down your research plans into small steps. Take care to communicate your ideas clearly and frequently. Work on small problems and help advance other people's research, but treat this as training for the real work.
I wonder if some of the confusion operates like this: the better you understand how insanely dangerous it would be to create a superintelligence, the more you also understand how insanely difficult it is to make a benign one, or even to say what a benign one would be. Those who speak do not know, but those who know cannot speak.
Eliezer has written of the dangers and the difficulty, and that the only near-term strategy is to shut it all down. But I have not seen what steps forward he would take from there.
In the fictional word of dath ilan, as described in planecrash, the guardians of that civilisation have succeeded in shutting it all down, and in a secret establishment called The Basement Of The World are very, very cautiously studying the problem. But we see nothing of their work.
Possible spoiler for planecrash:
Given what the gods of Golarion really are, maybe more is revealed later in the story than I have read up to.
It feels deeply uncomfortable to be participating in an elite AI x-risk fellowship and tell your peer, manager, or mentor: "idk why ASI poses an existential risk."
With some nuances, if you don't know why ASI poses x-risk, a better word choice might be "idk if ASI poses x-risk".
That frame of mind might be good for successfully executing strategies 1 and 3.
By "the thing," I mean something like developing a first-principles understanding of why you believe AI is dangerous, such that you could reconstruct the argument from scratch without appealing to authority.
It is easy.
We do not have a theory of victory, or a win condition.
The usual (best) answer is "we solve alignment and build a glorious transhumanist future", without having a formal definition of what "solving alignment" means when it start involving real humans as the thing we want our AI systems to be aligned to (vague gestures towards CEV) or a clear aesthetic vision of what "glorious transhumanist future" means (vague gestures towards end of suffering).
If we have no theory of victory, we’re going to lose : the actual outcome is still a precise thing, and even in the best case where our vague intuition is somewhat somehow met, some process ("random shit go !") will have to go from our fuzzy, confused desiderata to a precise outcome (that we can’t even foresee or judge because we don’t even know what we want or what it looks like).
Is there holes in this consideration ? Yes. Maybe we could build an ASI-teacher that guides us through that (but also : there is obvious problems to that there). Maybe we could do stuff sufficiently slowly that we could decide ("we" ? how ?) and steer (but see how it’s hard to even pause — steering is another beast entirely) as things advance and go clearer. The main takeaway is still "here be dragons".
Related but still off-topic : the entire field is advancing with its priority backwards. We’re building ASI before solving alignment ; working on solving alignment before asking what we want collectively ; asking what we want collectively before asking what we want personally. Everyone is trying to run before even learning to walk. Of course we’re all going to fall.
Yeah there's a few uncomfortable truths hidden in "asking what we want collectively" that mean that that question can't be answered. Such as different groups wanting mutually exclusive things and who exactly "we" is.
Easy enough, pick a set of moral rules you like the best, and then work towards that AI winning. Who gets to set the tone while such a thing is possible? Amodei, Musk, Altman, Pichai, Xi?
My current vote is Amodei.
Epistemic status: I've been thinking about this for a couple months and finally wrote it down. I don't think I'm saying anything new, but I think it's worth repeating loudly. My sample is skewed toward AI governance fellows; I've interacted with fewer technical AI safety researchers, so my inferences are fuzzier there. I more strongly endorse this argument for the governance crowd.
I've had 1-on-1's with roughly 75 fellows across the ERA, IAPS, GovAI, LASR, and Pivotal fellowships. These are a mix of career chats, research feedback, and casual conversations. I've noticed that in some fraction of these chats, the conversation gradually veers toward high-level, gnarly questions. "How hard is alignment, actually?" "How bad is extreme power concentration, really?"
Near the end of these conversations, I usually say something like: "idk, these questions are super hard, and I struggle to make progress on them, and when I do try my hand at tackling them, I feel super cognitively exhausted, and this makes me feel bad because it feels like a lot of my research and others' research are predicated on answers to these questions."
And then I sheepishly recommend Holden's essays on minimal-trust investigations and learning by writing. And then I tell them to actually do the thing.
The thing
By "the thing," I mean something like developing a first-principles understanding of why you believe AI is dangerous, such that you could reconstruct the argument from scratch without appealing to authority. Concretely, this might look like:
I think a large fraction of researchers in AI safety/governance fellowships cannot do any of these things. Here's the archetype:
If this describes you, you are likely in the modal category. FWIW, this archetype is basically me, so I'm also projecting a bit!
Why this happens
I think the default trajectory of an AI safety/governance fellow is roughly: absorb the vibes, pick a project, execute, produce output. The "step back and build a first-principles understanding" phase gets skipped, and it gets skipped for predictable, structural reasons:
That said, I think a valid counterargument is: maybe the best way to build an inside view is to just do a ton of research. If you just work closely with good mentors, run experiments, hit dead ends, then the gears-level understanding will naturally emerge.
I think this view is partially true. Many researchers develop their best intuitions through the research process, not before it. And the fellowship that pressures people to produce output is probably better, on the margin, than one that produces 30 deeply confused people and zero papers. I don't want to overcorrect. The right answer is probably "more balance" rather than "eliminate paper/report output pressure."
Why it matters
In most research fields, it's fine to not do the thing. You can be a productive chemist without having a first-principles understanding of why chemistry matters. Chemistry is mature and paradigmatic. The algorithm for doing useful work is straightforward: figure out what's known, figure out what's not, run experiments on the unknown.
AI safety doesn't work like this. We're not just trying to advance a frontier of knowledge. We're trying to do the research with the highest chance of reducing P(doom), in a field that's still pre-paradigmatic, where the feedback loops are terrible and the basic questions remain unsettled. If you're doing alignment research and you can't articulate why you think alignment is hard, you're building on a foundation you haven't examined. You can't tell whether your project actually matters. You're optimizing for a metric you can't justify.
You can get by for a while by simply deferring to 80,000 Hours and Coefficient Giving's recommendations. But deferral has a ceiling, and the most impactful researchers are the ones who've built their own models and found the pockets of alpha.
And I worry that this problem will get worse over time. As we get closer to ASI, the pressure to race ahead with your research agenda without stepping back will only intensify. The feeling of urgency will crowd out curiosity. And the field will become increasingly brittle precisely when it most needs to be intellectually nimble.
What should you do?
If you don't feel deeply confused about AI risk, something is wrong. You've likely not stared into the abyss and confronted your assumptions. The good news is that there are concrete things you can do. The bad news is that none of them are easy. They all require intense cognitive effort and time.
For fellowship directors and research managers, I'd suggest making space for this.[1] One thing that could be useful is to encourage fellows to set a concrete confusion-reduction goal like what I've described above, in addition to the normal fellowship goals like networking and research.
Concluding thoughts
I don't want this post to read as "you should feel bad." The point is that confusion is undervalued and undersupplied in this field. Noticing that you can't reconstruct your beliefs from scratch isn't a failure in itself. It's only bad if you don't do anything about it!
I'm still working on this problem myself. And I imagine many others are too.
Though I assume that fellowship directors have noticed this issue and have tried to solve the problem and it turned out that solving it is hard.