AI Governance across Slow/Fast Takeoff and Easy/Hard Alignment spectra

Davidmanheim

It has been suggested that in a rapid enough takeoff scenario, governance would not be useful, because the transition to superintelligence would be too rapid for human actors - whether governments, corporations, or individuals - to respond to. This seems to imply that we only care about takeoff speed. And if that is the only relevant factor, the case for governance only applies if you believe slow takeoff is likely. Of course, it also matters how long we have until takeoff - but even so, I think this leaves a fair amount on the table in terms of what governance could do, and I want to try to make the case that even in that world, governance (still defined broadly¹) is important - though in different ways.

The Easy/Hard Spectrum

To make the argument, I will lay out three possibilities about AI alignment which are orthogonal to takeoff speed and timing; alignment-by-default, prosaic alignment, and provable alignment. These are actually somewhat of a spectrum, with the three scenarios spaced along it. In any case, for each possibility, governance needs to accomplish very different things in order to be successful, according to the above definition - and the relationship with takeoff speeds seems important, but not fully determinative.

The first possibility, alignment-by-default, is that if we train systems via reinforcement learning or similar, then even without particular effort to solve alignment, all systems which are successful end up learning policies and goals close enough to human values that they are beneficial and influenceable. In the slower takeoff case, initially, governance looks a lot like human governance, making sure that actors, both human and AI, can cooperate and follow mutually understood and agreed upon rules. Later, and in the faster takeoff case, our efforts towards governance become irrelevant as the AI systems replace human structures, or improve them.

The second possibility, prosaic alignment, is that alignment of artificial intelligence systems is somewhat difficult, but achievable via approaches which can be developed. So some systems will be aligned, but without oversight, unaligned systems are possible or likely. In this case, the key task of governance is to ensure that all early HLMI/PASTA/AGI systems undergo robust alignment procedures. Prior to the emergence of such systems, many tasks will be useful for ensuring this outcome, including monitoring progress, developing standards, and building norms about safety. But as above, later and/or in the faster takeoff cases, governance becomes less relevant. Note, however, that this means more emphasis is needed on pre-emergence and early stage efforts, rather than eliminating the need for governance.

The final possibility is that the only way alignment can occur is via currently-impossible provable alignment. In this case, it may be that there are few potential ways to train safe AGI, and almost all earlier attempts are dangerous. Somewhat similar to the previous case, the key task is to prevent misaligned systems. In a fast takeoff case, the entirety of the usefulness of governance is prior to emergence, perhaps via intensive monitoring or limits of compute, while in slow takeoff case, there is some chance that governance can prevent disaster while allowing work in AI, perhaps via some sort of policing, a la lsusr’s Bayeswatch.

Along the different spectra

There are now three different dimensions being discussed. The first is how long we have until takeoff begins, which determines how much time we have to solve the various problems. The second is difficulty of alignment, which I argued above determines the key task of governance, whether it is to prevent unaligned systems, or it is to ensure that systems are aligned. And lastly, there is the speed of takeoff, which determines how much time governance has to act once takeoff begins.

In this model, along the second two dimensions, as either speed or difficulty increases, the relative emphasis on pre-AGI governance increases, and the usefulness of governance during the transition decreases. This leaves us with effectively a single dimension, albeit still one that is orthogonal to when takeoff occurs. And while there are certainly a class of interventions which are helpful towards one end of the spectrum, but harmful on the other², there is also the real possibility that we can find approaches which are beneficial in both cases.

As a few small examples of what these might look like, regardless of where on the spectrum we are, governance can reduce risks by 1) monitoring compute usage and capabilities to enable response, 2) vastly improving computer security for AI labs which could prevent or slow at least some forms of takeoff, and 3) building norms around care taken in development, testing, and deployment of proto-AGI systems.

1) Allan Dafoe has suggested that “AI governance concerns how humanity can best navigate the transition to a world with advanced AI systems.” This seems broadly correct, and to add to it, he has suggested it concerns “norms and institutions shaping how AI is built and deployed, as well as the policy and research efforts to make it go well.”

2) This analysis implies that the vast majority of governance efforts matter in slow takeoff / relatively easy alignment worlds, but are irrelevant or in some cases even harmful in faster takeoff / harder alignment worlds. This is an issue, but the existence of such tradeoffs alone does not imply that these approaches should not be seriously considered or pursued.

Thanks to Allan Dafoe for very helpful feedback on an earlier version of this.

three possibilities about AI alignment which are orthogonal to takeoff speed and timing

I think "AI Alignment difficulty is orthogonal to takeoff speed/timing" is quite conceptually tricky to think through, but still isn't true. It's conceptually tricky because the real truth about 'alignment difficulty' and takeoff speed, whatever it is, is probably logically or physically necessary: there aren't really alternative outcomes there. But we have a lot of logical uncertainty and conceptual confusion, so it still looks like there are different possibilities. Still, I think they're correlated.

First off, takeoff speed and timing are correlated: if you think HLMI is sooner, you must think progress towards HLMI will be faster, which implies takeoff will also be faster.

The faster we expect takeoff to go, the more likely it is that alignment is also difficult. There are two reasons for this. One is practical: the faster takeoff is, the less time you have to solve the problem before unaligned competitors become a problem. But the second is about the intrinsic difficulty of alignment (which I think is what you're talking about here).

Much of the reason that alignment pessimists like Eliezer think that prosaic alignment can't work, is that they expect that when we reach a capability discontinuity/find the core of general intelligence/enter the regime where AI capabilities start generalizing much further than they were before, whatever we were using to ensure corrigibility will suddenly break on us and probably trigger deceptive alignment immediately with no intermediate phase.

The more gradual and continuous you expect this scaling up to be, the more confident you should be in prosaic alignment, or alignment by default. There are other variables at play, the two aren't in direct correlation, but they aren't orthogonal.

(Also, the whole idea of getting assistance from AI tools on alignment research is in the mix here as well. If there's a big capability discontinuity when we find the core of generality, that causes systems to generalize really far, and also breaks corrigibility, then plausibly but not necessarily, all the capabilities we need to do useful alignment research in time to avoid unaligned AI disasters are on the other side of that discontinuity, creating a chicken-and-egg problem.)

Another way of picking up on this fact is that many of the analogy arguments used for fast takeoff (for example, that human evolution gives us evidence for giant qualitative jumps in capability) also in very similar form are used to argue for difficult alignment (e.g. that when humans started ramping up in intelligence suddenly we also started ignoring the goals of our 'outer optimiser').

In the post, I wanted to distinguish between two things you're now combining; how hard alignment is, and how long we have. And yes, combining these, we get the issue of how hard it will be to solve alignment in the time frame we have until we need to solve it. But they are conceptually distinct.

And neither of these directly relates to takeoff speed, which in the current framing is something like the time frame from when we have systems that are near-human until they hit a capability discontinuity. You said "First off, takeoff speed and timing are correlated: if you think HLMI is sooner, you must think progress towards HLMI will be faster, which implies takeoff will also be faster." This last implication might be true, or might not. I agree that there are many worlds in which they are correlated, but there are plausible counter-examples. For instance, we may continue with fast progress and get to HLMI and a utopian freedom from almost all work, but then hit a brick wall on scaling deep learning, and have another AI winter until we figure out how to make actually AGI which can then scale to ASI - and that new approach could lead to either a slow or a fast takeoff. Or we may have progress slow to a crawl due to costs of scaling input and compute until we get to AGI, at which point self-improvement takeoff could be near-immediate, or could continue glacially.

And I agree with your claims about why Eliezer is pessimistic about prosaic alignment - but that's not why he's pessimistic about governance, which is a mostly unrelated pessimism.

Like I said in my first comment, the in practice difficulty of alignment is obviously connected to timeline and takeoff speed.

But you're right that you're talking about the intrinsic difficulty of alignment Vs takeoff speed in this post, not the in practice difficulty.

But those are also still correlated, for the reasons I gave - mainly that a discontinuity is an essential step in Eleizer style pessimism and fast takeoff views. I'm not sure how close this correlation is.

Do these views come apart in other possible worlds? I.e. could you believe in a discontinuity to a core of general intelligence but still think prosaic alignment can work?

I think that potentially you can - if you think that still enough capabilities in pre-HLMI AI (pre discontinuity) to help you do alignment research before dangerous HLMI shows up. But prosaic alignment seems to require more assumptions to be feasible assuming a discontinuity, like that the discontinuity doesn't occur before all the important capabilities you need to do good alignment research.

I'm not sure I agree with the compatibility of discontinuity and prosaic alignment, though you make a reasonable case, but I do think there is compatibility between slower governance approaches and discontinuity, if it is far enough away.

It seems like time to start focusing resources on a portfolio of serious prosaic alignment approaches, as well as effective interdisciplinary management. In my inside view, the highest-marginal-impact interventions involve making multiple different things go right simultaneously for the first AGIs, which is not trivial, and the stakes are astronomical.

Little clear progress has been made on provable alignment after over a decade of trying. My inside view is that it got privileged attention because the first people to take the problem seriously happened to be highly abstract thinkers. Then they defined the scope and expectations of the field, alienating other perspectives and creating a self-reinforcing trapped prior.

First, I think it's ludicrous to say "Little clear progress has been made on provable alignment after over a decade of trying." The progress is actually quite amazing - yes, we're decades away from a solution to provable alignment, if one is possible at all, but not only has there been some really amazing and groundbreaking work coming out of MIRI, but you aren't paying attention if you don't see all of the contributions that work made to all of the questions which "prosaic alignment" is now trying to answer.

Second, "It seems like time to start focusing resources on a portfolio of serious prosaic alignment approaches," is correct, but several years too late, given that it's a majority of the work which is being done already.

three possibilities about AI alignment which are orthogonal to takeoff speed and timing

First off, takeoff speed and timing are correlated: if you think HLMI is sooner, you must think progress towards HLMI will be faster, which implies takeoff will also be faster.

Like I said in my first comment, the in practice difficulty of alignment is obviously connected to timeline and takeoff speed.

But you're right that you're talking about the intrinsic difficulty of alignment Vs takeoff speed in this post, not the in practice difficulty.

Do these views come apart in other possible worlds? I.e. could you believe in a discontinuity to a core of general intelligence but still think prosaic alignment can work?

27

AI Governance across Slow/Fast Takeoff and Easy/Hard Alignment spectra

27

Ω 12

The Easy/Hard Spectrum

Along the different spectra

27

Ω 12

27

Ω 12