Manhattan project for aligned AI

Chris van Merwijk

One possible thing that I imagine might happen, conditional on an existential catastrophe not occurring, is a Manhattan project for aligned AGI. I don’t want to argue that this is particularly likely or desirable. The point of this post is to sketch the scenario, and briefly discuss some implications for what is needed from current research.

Imagine the following scenario: It is only late that top AI scientists take the existential risk of AGI seriously, and there hasn't yet been a significant change in the effort put into AI safety relative to our current trajectory. At some point, there is a recognition among AI scientists and relevant decision-makers that AGI will be developed soon by one AI lab or another (within a few months/years), and that without explicit effort there is a large probability of catastrophic results. A project is started to develop AGI:

It has an XX B$ or XXX B$ budget.
Dozens of the top AI scientists are part of the project, and many more assistants. People you might recognize or know from top papers and AI labs join the project.
A fairly constrained set of concepts, theories and tools are available that give a broad roadmap for building aligned AGI.
There is a consensus understanding among management and the research team that without this project, AGI will plausibly be developed relatively soon, and that without explicitly understanding how to build the system safely it will pose an existential risk.

It seems to me that it is useful to backchain from this scenario to see what is needed, assuming that this kind of alignment Manhattan project is indeed what should happen.

Firstly, my view is that if this Manhattan project would start in intellectual conditions similar to today’s, there wouldn't be very many top AI scientists significantly motivated to work on the problem, and it would not be taken seriously. Even very large sums of money would not suffice, since there wouldn't be enough of a common understanding about what the problem is for it to work.

Secondly, it seems to me that there isn't enough of a roadmap for building aligned AGI for such a project to succeed in a short time-frame of months to years. I expect some people to disagree with this, but looking at current rates of progress in our understanding of AI safety, and my model of the practical parallelizability of conceptual progress, I am skeptical that the problem can be solved in a few years even by a group of 40 highly motivated and financed top AI scientists. It is plausible that this will look different closer to the finish line, but I am skeptical.

On this model, I have in mind basically two kinds of work that contribute to good outcomes. This is not a significant change relative to my prior view, but in my mind it constrains the motivation behind such work to some degree:

Research that makes the case for AGI x-risk clearer, and constrains how we believe the problem occurs, in order to make it eventually easier to convince top AI scientists that working in such an alignment Manhattan project is reasonable, and to make sure there is a team that's on the same page as to what the problem is.
Research that constrains the roadmap for building aligned AGI. I'm thinking mostly of conceptual/theoretical/empirical work that helps us converge to an approach that can then be developed/refined and scaled by a large effort over a short time period.

I suspect this mostly shouldn't change my general picture of what needs to be done, but it does shift my emphasis somewhat.

Research that makes the case for AGI x-risk clearer

I ended up going into detail on this, in the process of making an entry to the FLI's aspirational worldbuilding contest. So, it'll be posted in full about a month from now. But for now, I'll summarize:

We should prepare stuff in advance for identifying and directly manipulating the components of an AGI that engage in ruminative thought. This should be possible, there are certain structures of questions and answers that will reliably emerge, "what is the big blank blue thing at the top of the image" "it's probably the sky", and such. We wont know how to read or speak its mentalese, at first, but we will be able to learn it by looking for known claims and going from there.
Once we have AGI, we should use this stuff to query the AGI's own internal beliefs about whether certain catastrophic outcomes would come about, under the condition that it had been given internet access.
If the queries return true, then we have clear evidence of the presence of immense danger. We have a Demonstration of Cataclysmic Trajectory. This is going to be much more likely to get the world to take notice and react, than the loads of abstract reasoning about fundamental patterns of rational agency or whatever, that we've offered them so far. (Normal people don't trust abstract reasoning, and they mostly shouldn't! It's tricksy!)
From there, national funding for a global collaboration for alignment, and a means to convince security-minded parts of the government to implement the pretty tough global security policies required, so that the alignment project will no longer need to solve the problem in 5 years, and can instead take, say, 30.

(And then we solve the symbol grounding problem, and then we figure out value learning, and then we learn how best to aggregate the learned values, and then we'll have solved the alignment problem)

Here’s a question—if you were a researcher of atomic theory right before the Manhattan project began, would you have predicted it would be successful? Conditional on success, how long would you have expected it to take given the budget they had?

As I understand, theory of atomic bomb was considerably more advanced at the beginning of Manhattan project compared to our understanding of theory of aligned AGI.

To somewhat simplify, there were two unknown parameters. The critical mass of uranium-235, and the rate of uranium isotope separation. Given these two parameters, you could calculate how long it would take by simple division. Remember Little Boy was not tested at all: theory was that solid. Success was basically guaranteed if you had enough time, although success in 100 years would have been rightfully considered failure.

What about nuclear reactor, plutonium, and implosion device? Those were gambles to speed things up, because they thought it would take too long. (They were right: war in Europe ended first.) But Manhattan project would have succeeded without them, in the sense of producing fission weapons.

Another thing they tried to speed things up was better isotope separation. Electromagnetic separation was well understood and basically worked as designed. They gambled on developing gaseous diffusion, and it ended up more efficient, but development took too long so it didn't shorten the timeline at all.

In retrospect, they should have gambled on centrifuges, which is the current preferred method. What was missing was a clever innovation, not an advanced material or other things of that nature. Manhattan project could have been finished a lot faster if only they had known about Zippe centrifuge.

In fact there is an alternate history novel based on this, The Berlin Project by Gregory Benford (recommended). The author's estimate, which seemed reasonable to me, is that centrifuge would have shorten the timeline by one year, finishing in 1944. As a result, as the title suggests, atomic bomb is dropped on Berlin.

So, let me answer the question. I will define success as producing fission weapons before the end of war in Europe. (This is reasonable interpretation of statements by scientists who worked on Manhattan project.) The real world Manhattan project failed.

No one could predict anything before the necessary experiments were done to figure out the critical mass. Rough estimates varied by one order of magnitude, implying one to ten years. Once critical mass was figured out, electromagnetic separation implied three years (1942~1945), which was felt to be about 50% success rate based on guesses about how war would progress. They tried hard to speed things up and shorten the timeline, but they failed. Choosing centrifuge would have led to success in 1944 but there was no reasonable way to know that and unlucky choice was made.

tangential comment: Regarding "I will define success as producing fission weapons before the end of war in Europe". I'm not sure if this is the right criterion for success for the purpose of analogizing to AGI. It seems to me that "producing fission weapons before an Axis power does" is more appropriate.

And this seems overwhelmingly the case, yes: "theory of atomic bomb was considerably more advanced at the beginning of Manhattan project compared to our understanding of theory of aligned AGI"

I'm not sure I understand the motivation behind question. How much of my modern knowledge am I supposed to throw away? Note I am not in fact an atomic theorist who has the state of knowledge of atomic theory in 1942 so it's hard to know what I'd think, but I can imagine assigning somewhere between 5% and 95% depending on how informed of an atomic theorist I actually was and what it was actually like in 1942. Maybe I could give a better answer if you clarify the motivation behind the question?

I’m asking to try to imagine yourself as an atomic theorist who has access to the state of knowledge of atomic theory in 1942. Obviously that can’t be done perfectly, but my thought was that by modeling what you would have predicted vs what actually happened, some insight can be had about how “unknown unknowns” effect projects of that scale.

Research that makes the case for AGI x-risk clearer

We should prepare stuff in advance for identifying and directly manipulating the components of an AGI that engage in ruminative thought. This should be possible, there are certain structures of questions and answers that will reliably emerge, "what is the big blank blue thing at the top of the image" "it's probably the sky", and such. We wont know how to read or speak its mentalese, at first, but we will be able to learn it by looking for known claims and going from there.
Once we have AGI, we should use this stuff to query the AGI's own internal beliefs about whether certain catastrophic outcomes would come about, under the condition that it had been given internet access.
If the queries return true, then we have clear evidence of the presence of immense danger. We have a Demonstration of Cataclysmic Trajectory. This is going to be much more likely to get the world to take notice and react, than the loads of abstract reasoning about fundamental patterns of rational agency or whatever, that we've offered them so far. (Normal people don't trust abstract reasoning, and they mostly shouldn't! It's tricksy!)
From there, national funding for a global collaboration for alignment, and a means to convince security-minded parts of the government to implement the pretty tough global security policies required, so that the alignment project will no longer need to solve the problem in 5 years, and can instead take, say, 30.

(And then we solve the symbol grounding problem, and then we figure out value learning, and then we learn how best to aggregate the learned values, and then we'll have solved the alignment problem)

As I understand, theory of atomic bomb was considerably more advanced at the beginning of Manhattan project compared to our understanding of theory of aligned AGI.

36

Manhattan project for aligned AI

36

36

36