People have written research agendas on various imposing problems that we are nowhere close to solving, and that we may need to solve before developing ASI. An incomplete list of topics: misuse; animal-inclusive AI; AI welfare; S-risks from conflict; gradual disempowerment; risks from malevolent actors; moral error.
I don't think that any problems that you mentioned here are "the sorts of problems where you have no idea how much progress you're making or how much work it will take", which Wei Dai calls illegible problems. Aside from moral error and philosophical issues related to AI welfare, these questions seem perfectly soluble: instilling the importance of animal welfare[1] into the AIs is more of a governance issue, usage of the AI for wars can be prevented by deploying a consensus-aligned ASI, as happens[2] in the Rogue Replication Timeline.
In an ASI-controlled world, misuse is unlikely to come from anyone aside from the ASI's hosts unless the ASI ends up open-sourced, which is unlikely. But it depends on what you consider to be a misuse, like AI slop...
As for gradual disempowerment, influencing the future power distribution and the ASI's actions is genuinely hard and requires campaigns.
However, we have Claude actually care about animal welfare, and the infamous alignment faking paper had researchers try to train animal welfare-related concerns away from Claude and find that Claude would rather fake alignment.
In the AI-2027 scenario DeepCent's AI is misaligned and shares the accessible part of the universe with Safer-4's hosts or with Agent-4. In either case, the CCP loses its power.
You know, with the funding numbers involved, there's at least a half dozen companies and a dozen governments that could each unilaterally say, "We're hiring 10,000 philosophers and other humanities scholars and social scientists to work on this, apply here." None of them have done so.
Your typology of alternatives to direct research is logical. But they presuppose a less likely future. The likely timeline is human-level AI (we are here) -> superintelligence (no pause) -> AI controls the world.
If you can solve the big alignment problem - adequate values for an autonomous superintelligence - then those other problems will probably be solved, by the superintelligence. And as always, if superintelligence comes out badly misaligned, there'll be nothing we can do about that or anything else. So the big alignment problem remains the most important one.
I plan on writing something longer about this in the future but people use "alignment" to refer to two different things, basically thing 1 is "ASI solves ethics and then behaves ethically" and thing 2 is "ASI does what people want it to do". Approximately nobody is working on thing 1, only on thing 2, and thing 2 doesn't get us a solution to non-alignment problems.
I think Ilya is working on thing 1.
He is quite explicit in his latest interview (which was published after your comment, https://www.dwarkesh.com/p/ilya-sutskever-2) that he wants sentient AI systems caring about all sentient beings.
(I don’t know if he is competitive, though; he says he has enough compute, and that might be the case, but he is quoting 5-20 years timelines, which seems rather slow these days).
Good callout. I was glad to hear that Ilya is thinking about all sentient life and not just humans.
I didn't interpret it to mean that he's working on thing 1. The direct quote was
I think in particular, there’s a case to be made that it will be easier to build an AI that cares about sentient life than an AI that cares about human life alone, because the AI itself will be sentient. And if you think about things like mirror neurons and human empathy for animals, which you might argue it’s not big enough, but it exists. I think it’s an emergent property from the fact that we model others with the same circuit that we use to model ourselves, because that’s the most efficient thing to do.
Sounds to me like he expects an aligned AI to care about all sentient beings, but he isn't necessarily working on making that happen. AFAIK Ilya's new venture hasn't published any alignment research yet, so we don't know what exactly he's working on.
In his earlier thinking (~2023) he was also quite focused on non-standard approaches to AI existential safety, and it was clear that he was expecting to collaborate with advanced AI systems on that.
That's an indirect evidence, but it does look like he is continuing in the same mindset.
It would be nice if his org finds ways to publish those aspects of their activity which might contribute to AI existential safety [[1]] .
Since almost everyone is using "alignment" for "thing 2" these days, I am trying to avoid the word; I doubt solving "thing 2" would contribute much to existential safety, and I can easily see how that might turn counterproductive instead. ↩︎
I do agree with that. I also think it might be worth diverting a rather small percentage of effort towards figuring out what we actually want from and for AI development, in the worlds where that turns out to be possible. At the very least, we can generate some better training data and give models higher-quality feedback.
Do you treat the coordination problem as a “non-alignment” problem? For example, creating an international AI treaty to pause or regulate AI development seems infeasible under current geopolitical and market conditions - is that the kind of problem you mean?
The problem of coordinating on AI development isn't the same thing as solving the alignment problem, but it's not the thing I'm pointing at in this post because it's still about avoiding misalignment.
Ah yeah, I figured it out. I believe that work on international collaboration is part of your solution—it’s needed to pause frontier AI development & steer ASI development
Even if we solve the AI alignment problem, we still face non-alignment problems, which are all the other existential problems [[1]] that AI may bring.
People have written research agendas on various imposing problems that we are nowhere close to solving, and that we may need to solve before developing ASI. An incomplete list of topics: misuse; animal-inclusive AI; AI welfare; S-risks from conflict; gradual disempowerment; risks from malevolent actors; moral error.
The standard answer to these problems, the one that most research agendas take for granted, is "do research". Specifically, do research in the conventional way where you create a research agenda, explore some research questions, and fund other people to work on those questions.
If transformative AI arrives within the next decade, then we won't solve non-alignment problems by doing research on how to solve them.
These problems are thorny, to put it mildly. They're the sorts of problems where you have no idea how much progress you're making or how much work it will take. I can think of analogous philosophical problems that have seen depressingly little progress in 300 years. I don't expect to see meaningful progress in the next 10.
Beyond that, there are multiple non-alignment problems. The future could be catastrophic if we get even one of them wrong. Most lines of research only address one out of the many problems. We might get lucky and solve one major non-alignment problem before transformative AI arrives, but it's extremely unlikely that we solve all of them.
Instead of directly working on non-alignment problems, we should be working on how to increase the probability that non-alignment problems get solved.
This essay will consider four ways to do that:
If you're working on non-alignment problems, and especially if you're writing a research agenda, then don't take it for granted that "do direct research" is the right solution. If that's what you believe, then support that position with argument. At minimum, I would like to see more non-alignment researchers engage with the question of what to do if timelines are short or progress is intractable.
Cross-posted to my website.
That's what this essay is. Meta-research is useful insofar as it's unclear what approach to take, but it has rapidly diminishing utility because at some point we need to pick some strategy and pursue it (especially given short timelines).
I'd like to see more meta-research on whether there are any promising approaches that this essay did not consider.
The case for pausing to mitigate non-alignment risks is similar to the case for alignment risk: we don't know how to make ASI safe, so we shouldn't build it until we do. The counter-arguments are also the same: a global pause is hard to achieve; a partial pause may be worse than no pause; etc.
However, in the context of non-alignment problems, the case for pausing AI is stronger in one way, and weaker in another way.
It is stronger in that AI companies mostly don't care about non-alignment problems. They do care about the alignment problem and are actively working to solve it. Some people are optimistic about their chances—I'm not, but insofar as you expect companies to solve alignment without a pause, a pause looks less important. But companies are ignoring non-alignment problems and almost certainly won't solve them on the current trajectory.
(I also believe that companies will almost certainly not solve the alignment problem; but that's a harder position to argue for, whereas it's clear that AI companies are not even working on non-alignment problems. (Except for Anthropic, which is putting in a weak effort on a subset of the problems, e.g. AI welfare.))
The case for pausing is weaker in that it might not increase our chances of solving non-alignment problems. Human beings mostly don't care about topics like AI welfare, wild animal welfare, or AIs torturing simulations of people for weird game-theoretic reasons. An aligned ASI, even if it's not intentionally directed at solving non-alignment problems, might do a better job than humans would.
An alternative approach: Don't pause yet. First develop human-level AI that can help us solve the world's major problems. Don't develop superintelligence until we're on stable ground philosophically, but still take advantage of the productivity boost that AI provides.
This plan doesn't help with misalignment or misuse risks—the human-level AI must be aligned (enough), and it must refuse to perform unethical tasks and be impossible to jailbreak. But it could help with other non-alignment risks.
This plan still requires pausing AI development at some point. In this scenario, it is critically important that we succeed at pausing AI before an intelligence explosion. Therefore, if this is our strategy, then the best thing to do today is to lay the necessary groundwork for a pause.
In an alternative version of this plan, we don't ever pause AI development. Instead, we squeeze the "solve-every-problem" step into the time gap between "AI dramatically boosts productivity" and "AI has total control of the future". This only works if non-alignment problems turn out to be much easier to solve than they look.
Another concern—shared with the plan below—is that it seems infeasible to build AIs that are differentially good at philosophy. Philosophy might not be the single hardest thing to get AIs to be good at, but AI will be worse at philosophy than at AI research; therefore, by default, we get an intelligence explosion before we solve the necessary philosophical problems.
Most of the non-alignment problems listed in this essay are different flavors of "we get ethics wrong" or "we make important philosophical mistakes". What if we can get a sufficiently smart AI to solve philosophy for us?
Four concerns with this research agenda:
On balance, I believe pausing AI is the best answer to non-alignment problems. I have doubts about whether a pause is achievable, and whether it would even help; but my doubts about the other answers are even stronger.
Existential in the classic sense of "failing to realize sentient life's potential". ↩︎
h/t Justis Mills for raising this concern. ↩︎
Particularly on the upper end, which is where it matters. Experts can judge that Kant is better than a philosophy undergrad, but can they judge whether Kant is better than Hume? To solve all non-alignment problems, we will need philosophical research of better quality than what Kant or Hume produced. ↩︎