In previous pieces, I argued that there's a real and large risk of AI systems' developing dangerous goals of their own and defeating all of humanity - at least in the absence of specific efforts to prevent this from happening. I discussed why it could be hard to build AI systems without this risk and how it might be doable.
The “AI alignment problem” refers1 to a technical problem: how can we design a powerful AI system that behaves as intended, rather than forming its own dangerous aims? This post is going to outline a broader political/strategic problem, the “deployment problem”: if you’re someone who might be on the cusp of developing extremely powerful (and maybe dangerous) AI systems, what should you … do?
The basic challenge is this:
My current analogy for the deployment problem is racing through a minefield: each player is hoping to be ahead of others, but anyone moving too quickly can cause a disaster. (In this minefield, a single mine is big enough to endanger all the racers.)
This post gives a high-level overview of how I see the kinds of developments that can lead to a good outcome, despite the “racing through a minefield” dynamic. It is distilled from a more detailed post on the Alignment Forum.
First, I’ll flesh out how I see the challenge we’re contending with, based on the premises above.
Next, I’ll list a number of things I hope that “cautious actors” (AI companies, governments, etc.) might do in order to prevent catastrophe.
Many of the actions I’m picturing are not the kind of things normal market and commercial incentives would push toward, and as such, I think there’s room for a ton of variation in whether the “racing through a minefield” challenge is handled well. Whether key decision-makers understand things like the case for misalignment risk (and in particular, why it might be hard to measure) - and are willing to lower their own chances of “winning the race” to improve the odds of a good outcome for everyone - could be crucial.
This piece is going to lean on previous pieces and assume all of the following things:
So, one can imagine a scenario where some company is in the following situation:
That seems like a tough enough, high-stakes-enough, and likely enough situation that it’s worth thinking about how one is supposed to handle it.
One simplified way of thinking about this problem:
In this setup, cautious actors need to move fast enough that they can’t be overpowered by others’ AI systems, but slowly enough that they don’t cause disaster themselves. Hence the “racing through a minefield” analogy.
In a non-Cold-Takes piece, I explore the possible actions available to cautious actors to win the race through a minefield. This section will summarize the general categories - and, crucially, why we shouldn’t expect that companies, governments, etc. will do the right thing simply from natural (commercial and other) incentives.
I’ll be going through each of the following:
I previously wrote about some of the ways we might reduce the dangers of advanced AI systems. Broadly speaking:
A key point here is that making AI systems safe enough to commercialize (with some initial success and profits) could be much less (and different) effort than making them robustly safe (no lurking risk of global catastrophe). The basic reasons for this are covered in my previous post on difficulties with AI safety research In brief:
Well-meaning AI companies with active ethics boards might do a lot of AI safety work, by training AIs not to behave in unhelpful or dangerous ways. But if they want to address the risks I’m focused on here, this could require safety measures that look very different - e.g., measures more reliant on “checks and balances” and “digital neuroscience.”
In addition to making AI systems safer, cautious actors can also put effort into measuring and demonstrating how dangerous they are (or aren’t).
For the same reasons given in the previous section, it could take special efforts to find and demonstrate the kinds of dangers I’ve been discussing. Simply monitoring AI systems in the real world for bad behavior might not do it. It may be necessary to examine (or manipulate) their digital brains,3 design AI systems specifically to audit other AI systems for signs of danger; deliberately train AI systems to demonstrate particular dangerous patterns (while not being too dangerous!); etc.
Learning and demonstrating that the danger is high could help convince many actors to move more slowly and cautiously. Learning that the danger is low could lessen some of the tough tradeoffs here and allow cautious actors to move forward more decisively with developing advanced AI systems; I think this could be a good thing in terms of what sorts of actors lead the way on transformative AI.
Here’s a dynamic I’d be sad about:
(Similar dynamics could apply to Country A and B, with national AI development projects.)
If Companies A and B would both “love to move slowly and be careful” if they could, it’s a shame that they’re both racing to beat each other. Maybe there’s a way to avoid this dynamic. For example, perhaps Companies A and B could strike a deal - anything from “collaboration and safety-related information sharing” to a merger. This could allow both to focus more on precautionary measures rather than on beating the other. Another way to avoid this dynamic is discussed below, under standards and monitoring.
“Finding ways to avoid a furious race” is not the kind of dynamic that emerges naturally from markets! In fact, working together along these lines would have to be well-designed to avoid running afoul of antitrust regulation.
Cautious actors might want to share certain kinds of information quite widely:
At the same time, as long as there are incautious actors out there, information can be dangerous too:
The lines between these categories of information might end up fuzzy. Some information might be useful for demonstrating the dangers and capabilities of cutting-edge systems, or useful for making systems safer and for building them in the first place. So there could be a lot of hard judgment calls here.
This is another area where I worry that commercial incentives might not be enough on their own. For example, it is usually important for a commercial project to have some reasonable level of security against hackers, but not necessarily for it to be able to resist well-resourced attempts by states to steal its intellectual property.
Ideally, cautious actors would learn of every case where someone is building a dangerous AI system (whether purposefully or unwittingly), and be able to stop the project. If this were done reliably enough, it could take the teeth out of the threat; a partial version could buy time.
Here’s one vision for how this sort of thing could come about:
If the situation becomes very dire - i.e., it seems that there’s a high risk of dangerous AI being deployed imminently - I see the latter bullet point as one of the main potential hopes. In this case, governments might have to take drastic actions to monitor and stop dangerous projects, based on limited information.
I’ve emphasized the importance of caution: not deploying AI systems when we can’t be confident enough that they’re safe.
But when confidence can be achieved (how much confidence? See footnote5), powerful-and-safe AI can help reduce risks from other actors in many possible ways.
Some of this would be by helping with all of the above. Once AI systems can do a significant fraction of the things humans can do today, they might be able to contribute to each of the activities I’ve listed so far:
Additionally, if safe AI systems are in wide use, it could be harder for dangerous (similarly powerful) AI systems to do harm. This could be via a wide variety of mechanisms. For example:
I’ve gone into some detail about why we might have a challenging situation (“racing through a minefield”) if powerful AI systems (a) are developed fairly soon; (b) present significant risk of misalignment leading to humanity being defeated; (c) are not particularly easy to measure the safety of.
I’ve also talked about what I see as some of the key ways that “cautious actors” concerned about misaligned AI might navigate this situation.
I talk about some of the implications in my more detailed piece. Here I’m just going to name a couple of observations that jump out at me from this analysis:
This seems hard. If we end up in the future envisioned in this piece, I imagine this being extremely stressful and difficult. I’m picturing a world in which many companies, and even governments, can see the huge power and profit they might reap from deploying powerful AI systems before others - but we’re hoping that they instead move with caution (but not too much caution!), take the kinds of actions described above, and that ultimately cautious actors “win the race” against less cautious ones.
Even if AI alignment ends up being relatively easy - such that a given AI project can make safe, powerful systems with about 10% more effort than making dangerous, powerful systems - the situation still looks pretty nerve-wracking, because of how many different players could end up trying to build systems of their own without putting in that 10%.
A lot of the most helpful actions might be “out of the ordinary.” When racing through a minefield, I hope key actors will:
As such, it could be very important whether key decision-makers (at both companies and governments) understand the risks and are prepared to act on them. Currently, I think we’re unfortunately very far from a world where this is true.
Additionally, I think AI projects can and should be taking measures today to make unusual-but-important measures more practical in the future. This could include things like:
Generally, or at least, this is what I’d like it to refer to. ↩
Thanks to beta reader Ted Sanders for suggesting this analogy in place of the older one, “removing mines from the minefield.”
One genre of testing that might be interesting: manipulating an AI system’s “digital brain” in order to simulate circumstances in which it has an opportunity to take over the world, and seeing whether it does so. This could be a way of dealing with the King Lear problem. More here. ↩
Modern AI systems tend to be trained with lots of trial-and-error. The actual code that is used to train them might be fairly simple and not very valuable on its own; but an expensive training process then generates a set of “weights” which are ~all one needs to make a fully functioning, relatively cheap copy of the AI system. ↩
I mean, this is part of the challenge. In theory, you should deploy an AI system if the risks of not doing so are greater than the risks of doing so. That’s going to depend on hard-to-assess information about how safe your system is and how dangerous and imminent others’ are, and it’s going to be easy to be biased in favor of “My systems are safer than others’; I should go for it.” Seems hard. ↩
Sometimes I despair that our current world seems like it lacks the "civilizational adequacy" to handle many of the deployment issues raised here, like implementing competent global monitoring, or even just navigating around our own antitrust laws to allow AI developers to avoid races... or even just building common knowledge that AI misalignment is a problem in the first place!
I think some other rationalists share this pessimistic inclination, which leads them to think that we had better just get AI right early on, when it is still under the control of a single tech company and we aren't forced to deal with the pandora's box of geopolitical/coordination issues around deployment. (I think this is the wrong attitude -- getting AI right early on would be great, but we still need to think about all these problems as a "Plan B" in case not everything goes swimmingly.)
Since the cause of my despair is the lack of "civilizational adequacy", I find myself drawn to the idea of new institutions (like prediction markets, charter cities, improved voting systems, etc) which might be able to help our society make better decisions. (For instance, if prediction markets were more widely used, society might be quicker to build common knowledge about the danger of misalignment risk. As a stretch goal, maybe prediction markets could actually help us evaluate and quickly implement good policies in response to the danger, preventing us from flailing around a la the covid-19 response! For more detail along these lines, see my winning entry in the Future of Life Institute's "A.I. Worldbuilding Competition", which was all about my hope that improved institutional designs could help create a wiser civilization better able to safely develop AI.)
One major drawback of this hope is that the timelines might not match up -- advanced AI might be developed soon, while these speculative ideas might take decades to make their way from small-scale experiments to the level of maturity where they can powerfully influence national decisionmaking. Nevertheless, it still seems like a promising strategy to have in one's portfolio, if only to help in scenarios where AI is developed in the second half of this century or later.
How do you think about this issue? Is the idea of creating new experimental institutions and leveling up civilizational adequacy too indirect/dilute/meta compared to trying to directly influence existing institutions / decisionmakers? Too slow-acting, as mentioned above? Maybe prediction markets (and other ideas) just aren't promising enough, or are too intractable because of political opposition?
(Apologies for the late reply!) I think working on improved institutions is a good goal that could potentially help, and I'm excited about some of the work going on in general categories you mentioned. It's not my focus because (a) I do think the "timelines don't match up" problem is big; (b) I think it's really hard to identify specific interventions that would improve all decision-making - it's really hard to predict the long-run effects of any given reform (e.g., a new voting system) as the context changes. Accordingly, what feels most pressing to me is getting more clarity on specific measures that can be taken to reduce the biggest risks to humanity, and then looking specifically at which institutional changes would make the world better-positioned to evaluate and act on those types of measures. Hence my interest in AI strategy "nearcasting" and in AI safety standards.