This is basically just an exposition of the content of Dwarkesh Patel, I thought it might be useful to have a slightly edited full version here. Full credit to Dwarkesh and Carl.

I find it easier to ingest the information this way and it is a bit quicker too. I did not find the transcripts of Dwarkesh until I had already packaged the subtitles of the videos into one pdf with some formatting. Then I thought I might as well use all the links he provided. So Dwarkesh has really done all the work here except some minor formatting and packaging, if he is not cool with this I will take this down obviously.

But as momentous as this interview seemed to me it might be useful for other people too as easily accessible text so here is the full interview in such a format.

Original Full Interviews can be found here:

 

Original Transcripts:

https://www.dwarkeshpatel.com/p/carl-shulman

 https://www.dwarkeshpatel.com/p/carl-shulman-2

Carl Shulman: Intelligence Explosion, Primate Evolution, Robot Doublings, & Alignment

Intelligence Explosion

D Today I have the pleasure of speaking with Carl Shulman. Many of my former guests, and this is not an exaggeration, have told me that a lot of their biggest ideas have come directly from Carl especially when it has to do with the intelligence explosion and its impacts. So I decided to go directly to the source and we have Carl today on the podcast. He keeps a super low profile but is one of the most interesting intellectuals I've ever encountered and this is actually his second podcast ever. We're going to go deep into the heart of many of the most important ideas that are circulating right now directly from the source. Carl is also an advisor to the Open Philanthropy project which is one of the biggest funders on causes having to do with AI and its risks, not to mention global health and well being. And he is a research associate at the Future of Humanity Institute at Oxford. So Carl, it's a huge pleasure to have you on the podcast. Thanks for coming.

C Thank you Dwarkesh. I've enjoyed seeing some of your episodes recently and I'm glad to be on the show.

D Excellent, let's talk about AI. Before we get into the details, give me the big picture explanation of the feedback loops and just general dynamics that would start when you have something that is approaching human-level intelligence.

C The way to think about it is — we have a process now where humans are developing new computer chips, new software, running larger training runs, and it takes a lot of work to keep Moore's law chugging (while it was, it's slowing down now). And it takes a lot of work to develop things like transformers, to develop a lot of the improvements to AI neural networks. The core method that I want to highlight on this podcast, and which I think is underappreciated, is the idea of input-output curves. We can look at the increasing difficulty of improving chips and sure, each time you double the performance of computers it’s harder and as we approach physical limits eventually it becomes impossible. But how much harder?

There's a paper called “Are Ideas Getting Harder to Find?" that was published a few years ago. 10 years ago at MIRI, I did an early version of this analysis using data mainly from Intel and the large semiconductor fabricators. In this paper they cover a period where the productivity of computing went up a million fold, so you could get a million times the computing operations per second per dollar, a big change but it got harder. The amount of investment and the labor force required to make those continuing advancements went up and up and up. It went up 18 fold over that period. Some take this to say — “Oh, diminishing returns. Things are just getting harder and harder and so that will be the end of progress eventually.” However in a world where AI is doing the work, that doubling of computing performance, translates pretty directly to a doubling or better of the effective labor supply. That is, if when we had that million-fold compute increase we used it to run artificial intelligences who would replace human scientists and engineers, then the 18x increase in the labor demands of the industry would be trivial.

We're getting more than one doubling of the effective labor supply than we need for each doubling of the labor requirement and in that data set, it's over four. So when we double compute we need somewhat more researchers but a lot less than twice as many. We use up some of those doublings of compute on the increasing difficulty of further research, but most of them are left to expedite the process. So if you double your labor force, that's enough to get several doublings of compute. You use up one of them on meeting the increased demands from diminishing returns. The others can be used to accelerate the process so you have your first doubling take however many months, your next doubling can take a smaller fraction of that, the next doubling less and so on. At least in so far as the outputs you're generating, compute for AI in this story, are able to serve the function of the necessary inputs. If there are other inputs that you need eventually those become a bottleneck and you wind up more restricted on this.

D Got it. The bloom paper said there was a 35% increase in transistor density and there was a 7% increase per year in the number of researchers required to sustain that pace.

C Something in the vicinity, yeah. Four to five doublings of compute per doubling of labor inputs.

D I guess there's a lot of questions you can delve into in terms of whether you would expect a similar scale with AI and whether it makes sense to think of AI as a population of researchers that keeps growing with compute itself. Actually, let's go there. Can you explain the intuition that compute is a good proxy for the number of AI researchers so to speak?

C So far I've talked about hardware as an initial example because we had good data about a past period. You can also make improvements on the software side and when we think about an intelligence explosion that can include — AI is doing work on making hardware better, making better software, making more hardware. But the basic idea for the hardware is especially simple in that if you have an AI worker that can substitute for a human, if you have twice as many computers you can run two separate instances of them and then they can do two different jobs, manage two different machines, work on two different design problems. Now you can get more gains than just what you would get by having two instances. We get improvements from using some of our compute not just to run more instances of the existing AI, but to train larger AIs.

There's hardware technology, how much you can get per dollar you spend on hardware and there's software technology and the software can be copied freely. So if you've got the software it doesn't necessarily make that much sense to say that — “Oh, we've got you a hundred Microsoft Windows.” You can make as many copies as you need for whatever Microsoft will charge you. But for hardware, it’s different. It matters how much we actually spend on the hardware at a given price. And if we look at the changes that have been driving AI recently, that is the thing that is really off-trend. We are spending tremendously more money on computer hardware for training big AI models.

D Okay so there's the investment in hardware, there's the hardware technology itself, and there's the software progress itself. The AI is getting better because we're spending more money on it because our hardware itself is getting better over time and because we're developing better models or better adjustments to those models. Where is the loop here?

C The work involved in designing new hardware and software is being done by people now. They use computer tools to assist them, but computer time is not the primary cost for NVIDIA designing chips, for TSMC producing them, or for ASML making lithography equipment to serve the TSMC fabs. And even in AI software research that has become quite compute intensive we're still in the range where at a place like DeepMind salaries were still larger than compute for the experiments. Although more recently tremendously more of the expenditures were on compute relative to salaries. If you take all the work that's being done by those humans, there's like low tens of thousands of people working at Nvidia designing GPUs specialized for AI.

There's more than 70,000 people at TSMC which is the leading producer of cutting-edge chips. There's a lot of additional people at companies like ASML that supply them with the tools they need and then a company like DeepMind, I think from their public filings, they recently had a thousand people. OpenAI is a few hundred people. Anthropic is less. If you add up things like Facebook AI research, Google Brain, other R&D, you get thousands or tens of thousands of people who are working on AI research.

We would want to zoom in on those who are developing new methods rather than narrow applications. So inventing the transformer definitely counts but optimizing for some particular businesses data set cleaning probably not. So those people are doing this work, they're driving quite a lot of progress. What we observe in the growth of people relative to the growth of those capabilities is that pretty consistently the capabilities are doubling on a shorter time scale than the people required to do them are doubling. We talked about hardware and how it was pretty dramatic historically. Like four or five doublings of compute efficiency per doubling of human inputs. I think that's a bit lower now as we get towards the end of Moore's law although interestingly not as much lower as you might think because the growth of inputs has also slowed recently. On the software side there's some work by Tamay Besiroglu and collaborators; it may have been his thesis.

It's called Are models getting harder to find? and it's applying the same analysis as the “Are ideas getting harder to find?” and you can look at growth rates of papers, from citations, employment at these companies, and it seems like the doubling time of these like workers driving the software advances is like several years whereas the doubling of effective compute from algorithmic progress is faster. There's a group called Epoch, they've received grants from open philanthropy, and they do work collecting datasets that are relevant to forecasting AI progress. Their headline results for what's the rate of progress in hardware and software, and growth in budgets are as follows — For hardware, they're looking at a doubling of hardware efficiency in like two years. It's possible it’s a bit better than that when you take into account certain specializations for AI workloads. For the growth of budgets they find a doubling time that's something like six months in recent years which is pretty tremendous relative to the historical rates. We should maybe get into that later and then on the algorithmic progress side, mainly using Imagenet type datasets right now they find a doubling time that's less than one year. So when you combine all of these things the growth of effective compute for training big AIs is pretty drastic.

D I think I saw an estimate that GPT-4 cost like 50 million dollars or around that range to train. Now suppose that AGI takes a 1000x that, if you were just a scale of GPT-4 it might not be that but just for the sake of example, some part of that will come from companies just spending a lot more to train the models and that’s just greater investment. Part of that will come from them having better models.You get the same effect of increasing it by 10x just from having a better model. You can spend more money on it to train a bigger model, you can just have a better model, or you can have chips that are cheaper to train so you get more compute for the same dollars. So those are the three you are describing the ways in which the “effective compute” would increase?

C Looking at it right now, it looks like you might get two or three doublings of effective compute for this thing that we're calling software progress which people get by asking — how much less compute can you use now to achieve the same benchmark as you achieved before? There are reasons to not fully identify this with software progress as you might naively think because some of it can be enabled by the other. When you have a lot of compute you can do more experiments and find algorithms that work better. We were talking earlier about how sometimes with the additional compute you can get higher efficiency by running a bigger model. So that means you're getting more for each GPU that you have because you made this larger expenditure. That can look like a software improvement because this model is not a hardware improvement directly because it's doing more with the same hardware but you wouldn't have been able to achieve it without having a ton of GPUs to do the big training run.

D The feedback loop itself involves the AI that is the result of this greater effect of compute helping you train better AI or use less effective compute in the future to train better AI?

C It can help with the hardware design. NVIDIA is a fab-less chip design company. They don't make their own chips. They send files of instructions to TSMC which then fabricates the chips in their own facilities. If you could automate the work of those 10,000+ people and have the equivalent of a million people doing that work then you would pretty quickly get the kind of improvements that can be achieved with the existing nodes that TSMC is operating on and get a lot of those chip design gains. Basically doing the job of improving chip design that those people are working on now but get it done faster. While that's one thing I think that's less important for the intelligence explosion. The reason being that when you make an improvement to chip design it only applies to the chips you make after that. If you make an improvement in AI software, it has the potential to be immediately applied to all of the GPUs that you already have. So the thing that I think is most disruptive and most important and has the leading edge of the change from AI automation of the inputs to AI is on the software side

Can AIs do AI research?

D At what point would it get to the point where the AIs are helping develop better software or better models for future AIs? Some people claim today, for example, that programmers at OpenAI are using Copilot to write programs now. So in some sense you're already having that feedback loop but I'm a little skeptical of that as a mechanism. At what point would it be the case that the AI is contributing significantly in the sense that it would almost be the equivalent of having additional researchers to AI progress and software?

C The quantitative magnitude of the help is absolutely central. There are plenty of companies that make some product that very slightly boosts productivity. When Xerox makes fax machines, it maybe increases people's productivity in office work by 0.1% or something. You're not gonna have explosive growth out of that because 0.1% more effective R&D at Xerox and any customers buying the machines is not that important. The thing to look for is — when is it the case that the contributions from AI are starting to become as large as the contributions from humans? So when this is boosting their effective productivity by 50 or 100% and if you then go from like eight months doubling time for effective compute from software innovations, things like inventing the transformer or discovering chinchilla scaling and doing your training runs more optimally or creating flash attention. If you move that from 8 months to 4 months and then the next time you apply that it significantly increases the boost you're getting from the AI. Now maybe instead of giving a 50% or 100% productivity boost now it's more like 200%.

It doesn't have to have been able to automate everything involved in the process of AI research. It can be that it's automated a bunch of things and then those are being done in extreme profusion. A thing AI can do, you can have it done much more often because it's so cheap. And so it's not a threshold of — this is human level AI, it can do everything a human can do with no weaknesses in any area. It's that, even with its weaknesses it's able to bump up the performance. So that instead of getting the results we would have with the 10,000 people working on finding these innovations, we get the results that we would have if we had twice as many of those people with the same kind of skill distribution.

It’s a demanding challenge, you need quite a lot of capability for that but it's also important that it's significantly less than — this is a system where there's no way you can point at it and say in any respect it is weaker than a human. A system that was just as good as a human in every respect but also had all of the advantages of an AI, that is just way beyond this point. If you consider that the output of our existing fabs make tens of millions of advanced GPUs per year. Those GPUs if they were running AI software that was as efficient as humans, it is sample efficient, it doesn't have any major weaknesses, so they can work four times as long, the 168 hour work week, they can have much more education than any human. A human, you got a PhD, it's like 20 years of education, maybe longer if they take a slow route on the PhD. It's just normal for us to train large models by eat the internet, eat all the published books ever, read everything on GitHub and get good at predicting it. So the level of education vastly beyond any human, the degree to which the models are focused on task is higher than all but like the most motivated humans when they're really, really gunning for it.

So you combine the things tens of millions of GPUs, each GPU is doing the work of the very best humans in the world and the most capable humans in the world can command salaries that are a lot higher than the average and particularly in a field like STEM or narrowly AI, like there's no human in the world who has a thousand years of experience with TensorFlow or let alone the new AI technology that was invented the year before but if they were around, yeah, they'd be paid millions of dollars a year. And so when you consider this — tens of millions of GPUs. Each is doing the work of 40, maybe more of these existing workers, is like going from a workforce of tens of thousands to hundreds of millions. You immediately make all kinds of discoveries, then you immediately develop all sorts of tremendous technologies. Human level AI is deep, deep into an intelligence explosion. Intelligence explosion has to start with something weaker than that.

D Yeah, what is the thing it starts with and how close are we to that? Because to be a researcher at OpenAI is not just completing the hello world Prompt that Copilot does right? You have to choose a new idea, you have to figure out the right way to approach it, you perhaps have to manage the people who are also working with you on that problem. It's an incredibly complicated portfolio of skills rather than just a single skill. What is the point at which that feedback loop starts where you're not just doing the 0.5% increase in productivity that an AI tool might do but is actually the equivalent of a researcher or close to it?

C Maybe a way is to give some illustrative examples of the kinds of capabilities that you might see. Because these systems have to be a lot weaker than the human-level things, what we'll have is intense application of the ways in which AIs have advantages partly offsetting their weaknesses. AIs are cheap so we can call a lot of them to do many small problems. You'll have situations where you have dumber AIs that are deployed thousands of times to equal one human worker. And they'll be doing things like voting algorithms where with an LLM you generate a bunch of different responses and take a majority vote among them that improves some performance. You'll have things like the AlphaGo kind of approach where you use the neural net to do search and you go deeper with the search by plowing in more compute which helps to offset the inefficiency and weaknesses of the model on its own. You'll do things that would just be totally impractical for humans because of the sheer number of steps, an example of that would be designing synthetic training data. Humans do not learn by just going into the library and opening books at random pages, it's actually much much more efficient to have things like schools and classes where they teach you things in an order that makes sense, focusing on the skills that are more valuable to learn.

They give you tests and exams. They're designed to try and elicit the skill they're actually trying to teach. And right now we don't bother with that because we can hoover up more data from the internet. We're getting towards the end of that but yeah, as the AIs get more sophisticated they'll be better able to tell what is a useful kind of skill to practice and to generate that. We've done that in other areas like AlphaGo. The original version of AlphaGo was booted up with data from human Go play and then improved with reinforcement learning and Monte-carlo tree search but then AlphaZero, a somewhat more sophisticated model benefited from some other improvements but was able to go from scratch and it generated its own data through self play. Getting data of a higher quality than the human data because there are no human players that good available in the data set and also a curriculum so that at any given point it was playing games against an opponent of equal skill itself. It was always in an area when it was easy to learn. If you're just always losing no matter what you do, or always winning no matter what you do, it's hard to distinguish which things are better and which are worse?

And when we have somewhat more sophisticated AIs that can generate training data and tasks for themselves, for example if the AI can generate a lot of unit tests and then can try and produce programs that pass those unit tests, then the interpreter is providing a training signal and the AI can get good at figuring out what's the kind of programming problem that is hard for AIs right now that will develop more of the skills that I need and then do them. You're not going to have employees at Open AI write a billion programming problems, that's just not gonna happen. But you are going to have AIs given the task of producing the enormous number of programming challenges.

D In LLMs themselves, there's a paper out of Anthropic called Constitution AI where they basically had the program just talk to itself and say, "Is this response helpful? If not, how can I make this more helpful” and the responses improved and then you train the model on the more helpful responses that it generates by talking to itself so that it generates it natively and you could imagine more sophisticated or better ways to do that. But then the question is GPT-4 already costs like 50 million or 100 million or whatever it was. Even if we have greater effective compute from hardware increases and better models, it's hard to imagine how we could sustain four or five orders of magnitude greater effective size than GPT-4 unless we're dumping in trillions of dollars, the entire economies of big countries, into training the next version. The question is do we get something that can significantly help with AI progress before we run out of the sheer money and scale and compute that would require to train it? Do you have a take on that?

C First I'd say remember that there are these three contributing trends. The new H100s are significantly better than the A100s and a lot of companies are actually just waiting for their deliveries of H100s to do even bigger training runs along with the work of hooking them up into clusters and engineering the thing. All of those factors are contributing and of course mathematically yeah, if you do four orders of magnitude more than 50 or 100 million then you're getting to trillion dollar territory. I think the way to look at it is at each step along the way, does it look like it makes sense to do the next step? From where we are right now seeing the results with GPT-4 and ChatGPT companies like Google and Microsoft are pretty convinced that this is very valuable. You have talk at Google and Microsoft that it's a billion dollar matter to change market share in search by a percentage point so that can fund a lot. On the far end if you automate human labor we have a hundred trillion dollar economy  and most of that economy is paid out in wages, between 50 and 70 trillion dollars per year. If you create AGI it's going to automate all of that and keep increasing beyond that.

So the value of the completed project Is very much worth throwing our whole economy into it, if you're going to get the good version and not the catastrophic destruction of the human race or some other disastrous outcome. In between it's a question of — how risky and uncertain is the next step and how much is the growth in revenue you can generate with it? For moving up to a billion dollars I think that's absolutely going to happen. These large tech companies have R&D budgets of tens of billions of dollars and when you think about it in the relevant sense all the employees at Microsoft who are doing software engineering that’s contributing to creating software objects, it's not weird to spend tens of billions of dollars on a product that would do so much. And I think that it's becoming clearer that there is a market opportunity to fund the thing. Going up to a hundred billion dollars, that's the existing R&D budgets spread over multiple years. But if you keep seeing that when you scale up the model it substantially improves the performance, it opens up new applications, that is you're not just improving your search but maybe it makes self-driving cars work, you replace bulk software engineering jobs or if not replace them amplify productivity. In this kind of dynamic you actually probably want to employ all the software engineers you can get as long as they are able to make any contribution because the returns of improving stuff in AI itself gets so high.

But yeah, I think that can go up to a hundred billion. And at a hundred billion you're using a significant fraction of our existing fab capacity. Right now the revenue of NVIDIA is 25 billion, the revenue of TSMC is over 50 billion. I checked in 2021, NVIDIA was maybe 7.5%, less than 10% of TSMC revenue. So there's a lot of room and most of that was not AI chips. They have a large gaming segment, there are data center GPU's that are used for video and the like. There's room for more than an order of magnitude increase by redirecting existing fabs to produce more AI chips and they're just actually using the AI chips that these companies have in their cloud for the big training runs. I think that that's enough to go to the 10 billion and then combine with stuff like the H100 to go up to the hundred billion.

D Just to emphasize for the audience the initial point about revenue made. If it costs OpenAI 100 million dollars to train GPT-4 and it generates 500 million dollars in revenue, you pay back your expenses with 100 million and you have 400 million for your next training run. Then you train your GPT 4.5, you get let's say four billion dollars in revenue out of that. That's where the feedback group of revenue comes from. Where you're automating tasks and therefore you're making money you can use that money to automate more tasks. On the ability to redirect the fab production towards AI chips, fabs take a decade or so to build. Given the ones we have now and the ones that are going to come online in the next decade, is there enough to sustain a hundred billion dollars of GPU compute if you wanted to spend that on a training run?

C Yes, you definitely make the hundred billion one. As you go up to a trillion dollar run and larger, it's going to involve more fab construction and yeah, fabs can take a long a long time to build. On the other hand, if in fact you're getting very high revenue from the AI systems and you're actually bottlenecked on the construction of these fabs then their price could skyrocket and that could lead to measures we've never seen before to expand and accelerate fab production. If you consider, at the limit you're getting models that approach human-like capability, imagine things that are getting close to brain-like efficiencies plus AI advantages. We were talking before a cluster of GPU supporting AIs that do things, data parallelism.

If that can work four times as much as a highly skilled motivated focused human with levels of education that have never been seen in the human population, and if a typical software engineer can earn hundreds of thousands of dollars, the world's best software engineers can earn millions of dollars today and maybe more in a world where there's so much demand for AI. And then times four for working all the time. If you can generate close to 10 million dollars a year out of the future version H100 and it cost tens of thousands of dollars with a huge profit margin now. And profit margin could be reduced with large production. That is a big difference that that chip pays for itself almost instantly and you could support paying 10 times as much to have these fabs constructed more rapidly. If AI is starting to be able to contribute more of the skilled technical work that makes it hard for NVIDIA to suddenly find thousands upon thousands of top quality engineering hires.

If AI hasn't reached that level of performance then this is how you can have things stall out. A world where AI progress stalls out is one where you go to the 100 billion and then over succeeding years software progress turns out to stall. You lose the gains that you are getting from moving researchers from other fields. Lots of physicists and people from other areas of computer science have been going to AI but you tap out those resources as AI becomes a larger proportion of the research field. And okay, you've put in all of these inputs, but they just haven't yielded AGI yet. I think that set of inputs probably would yield the kind of AI capabilities needed for intelligence explosion but if it doesn't, after we've exhausted this current scale up of increasing the share of our economy that is trying to make AI. If that's not enough then after that you have to wait for the slow grind of things like general economic growth, population growth and such and so things slow. That results in my credences and this kind of advanced AI happening to be relatively concentrated, over the next 10 years compared to the rest of the century because we can't keep going with this rapid redirection of resources into AI. That's a one-time thing.

Primate evolution

D If the current scale up works we're going to get to AGI really fast, like within the next 10 years or something. If the current scale up doesn't work, all we're left with is just like the economy growing 2% a year, we have 2% a year more resources to spend on AI and at that scale you're talking about decades before just through sheer brute force you can train the 10 trillion dollar model or something. Let's talk about why you have your thesis that the current scale up would work. What is the evidence from AI itself or maybe from primate evolution and the evolution of other animals? Just give me the whole confluence of reasons that make you think that.

C Maybe the best way to look at that might be to consider, when I first became interested in this area, so in the 2000s which was before the deep learning revolution, how would I think about timelines? How did I think about timelines? And then how have I updated based on what has been happening with deep learning? Back then I would have said we know the brain is a physical object, an information processing device, it works, it's possible and not only is it possible it was created by evolution on earth. That gives us something of an upper bound in that this kind of brute force was sufficient. There are some complexities like what if it was a freak accident and that didn't happen on all of the other planets and that added some value. I have a paper with Nick Bostrom on this. I think basically that's not that important an issue. There's convergent evolution, octopi are also quite sophisticated. If a special event was at the level of forming cells at all, or forming brains at all, we get to skip that because we're choosing to build computers and we already exist. We have that advantage. So evolution gives something of an upper bound, really intensive massive brute force search and things like evolutionary algorithms can produce intelligence.

D Isn’t the fact that octopi and other mammals got to the point of being pretty intelligent but not human level intelligent some evidence that there's a hard step between a cephalopod and a human?

C Yeah, that would be a place to look but it doesn't seem particularly compelling. One source of evidence on that is work by Herculano-Houzel. She's a neuroscientist who has dissolved the brains of many creatures and by counting the nuclei she's able to determine how many neurons are present in different species and has found a lot of interesting trends in scaling laws. She has a paper discussing the human brain as a scaled up primate brain. Across a wide variety of animals, mammals in particular, there's certain characteristic changes in the number of neurons and the size of different brain regions as things scale up. There's a lot of structural similarity there and you can explain a lot of what is different about us with a brute force story which is that you expend resources on having a bigger brain, keeping it in good order, and giving it time to learn. We have an unusually long childhood.

We spend more compute by having a larger brain than other animals, more than three times as large as chimpanzees, and then we have a longer childhood than chimpanzees and much more than many, many other creatures. So we're spending more compute in a way that's analogous to having a bigger model and having more training time with it. And given that we see with our AI models, these large consistent benefits from increasing compute spent in those ways and with qualitatively new capabilities showing up over and over again particularly in areas that AI skeptics call out. In my experience over the last 15 years the things that people call out are like —”Ah, but the AI can't do that and it's because of a fundamental limitation.” We've gone through a lot of them. There were Winograd schemas, catastrophic forgetting, quite a number and they have repeatedly gone away through scaling. So there's a picture that we're seeing supported from biology and from our experience with AI where you can explain —

Yeah, in general, there are trade-offs where the extra fitness you get from a brain is not worth it and so creatures wind up mostly with small brains because they can save that biological energy and that time to reproduce, for digestion and so on. Humans seem to have wound up in a self-reinforcing niche where we greatly increase the returns to having large brains. Language and technology are the obvious candidates. You have humans around you who know a lot of things and they can teach you. And compared to almost any other species we have vastly more instruction from parents and the society of the [unclear]. You're getting way more from your brain than you get per minute because you can learn a lot more useful skills and then you can provide the energy you need to feed that brain by hunting and gathering, by having fire that makes digestion easier.

Basically how this process goes on is that it's increasing the marginal increase in reproductive fitness you get from allocating more resources along a bunch of dimensions towards cognitive ability. That's bigger brains, longer childhood, having our attention be more on learning. Humans play a lot and we keep playing as adults which is a very weird thing compared to other animals. We're more motivated to copy other humans around us than the other primates. These are motivational changes that keep us using more of our attention and effort on learning which pays off more when you have a bigger brain and a longer lifespan in which to learn in.

Many creatures are subject to lots of predation or disease. If you're mayfly or a mouse and if you try and invest in a giant brain and a very long childhood you're quite likely to be killed by some predator or some disease before you're actually able to use it. That means you actually have exponentially increasing costs in a given niche. If I have a 50% chance of dying every few months, as a little mammal or a little lizard, that means the cost of going from three months to 30 months of learning and childhood development is not 10 times the loss, it’s 2^-10. A factor of 1024 reduction in the benefit I get from what I ultimately learn because 99.9 percent of the animals will have been killed before that point. We're in a niche where we're a large long-lived animal with language and technology so where we can learn a lot from our groups. And that means it pays off to just expand our investment on these multiple fronts in intelligence.

D That's so interesting. Just for the audience the calculation about like two to the whatever months is just like, you have a half chance of dying this month, a half chance of dying next month, you multiply those together. There's other species though that do live in flocks or as packs. They do have a smaller version of the development of cubs that play with each other. Why isn't this a hill on which they could have climbed to human level intelligence themselves? If it's something like language or technology, humans were getting smarter before we got language. It seems like there should be other species that should have beginnings of this cognitive revolution especially given how valuable it is given we've dominated the world. You would think there would be selective pressure for it.

C Evolution doesn't have foresight. The thing in this generation that gets more surviving offspring and grandchildren is the thing that becomes more common. Evolution doesn't look ahead and think oh in a million years you'll have a lot of descendants. It's what survives and reproduces now. In fact, there are correlations where social animals do on average have larger brains and part of that is probably the additional social applications of brains, like keeping track of which of your group members have helped you before so that you can reciprocate. You scratch my back, I'll scratch yours. Remembering who's dangerous within the group is an additional application of intelligence. So there's some correlation there but what it seems like is that in most of these cases it's enough to invest more but not invest to the point where a mind can easily develop language and technology and pass it on.

You see bits of tool use in some other primates who have an advantage compared to say whales who have quite large brains partly because they are so large themselves and they have some other things, but they don't have hands which means that reduces a bunch of ways in which brains can pay off and investments in the functioning of that brain. But yeah, primates will use sticks to extract termites, Capuchin monkeys will open clams by smashing them with a rock. But what they don't have is the ability to sustain culture. A particular primate will maybe discover one of these tactics and it'll be copied by their immediate group but they're not holding on to it that well. When they see the other animal do it they can copy it in that situation but they don't actively teach each other in their population. So it's easy to forget things, easy to lose information and in fact they remain technologically stagnant for hundreds of thousands of years.

And we can look at some human situations. There's an old paper, I believe by the economist Michael Kramer, which talks about technological growth in the different continents for human societies. Eurasia is the largest integrated connected area. Africa is partly connected to it but the Sahara desert restricts the flow of information and technology and such. Then you have the Americas after the colonization from the land bridge were largely separated and are smaller than Eurasia, then Australia, and then you had smaller island situations like Tasmania. Technological progress seems to have been faster the larger the connected group of people. And in the smallest groups, like Tasmania where you had a relatively small population, they actually lost technology. They lost some fishing techniques. And if you have a small population and you have some limited number of people who know a skill and they happen to die or there's some change in circumstances that causes people not to practice or pass on that thing then you lose it. If you have few people you're doing less innovation and the rate at which you lose technologies to some local disturbance and the rate at which you create new technologies can wind up imbalanced. The great change of hominids and humanity is that we wound up in this situation where we were accumulating faster than we were losing and accumulating those technologies allowed us to expand our population. They created additional demand for intelligence so our brains became three times as large as chimpanzees and our ancestors who had a similar brain size.

D Okay. And the crucial point in relevance to AI is that the selective pressures against intelligence in other animals are not acting against these neural networks because they're not going to get eaten by a predator if they spend too much time becoming more intelligent, we're explicitly training them to become more intelligent. So we have good first principles reason to think that if it was scaling that made our minds this powerful and if the things that prevented other animals from scaling are not impinging on these neural networks, these things should just continue to become very smart.

C Yeah, we are growing them in a technological culture where there are jobs like software engineer that depend much more on cognitive output and less on things like metabolic resources devoted to the immune system or to building big muscles to throw spears.

D This is kind of a side note but I'm just kind of interested. You referenced Chinchilla scaling at some point. For the audience this is a paper from DeepMind which describes if you have a model of a certain size what is the optimum amount of data that it should be trained on? So you can imagine bigger models, you can use more data to train them and in this way you can figure out where you should spend your compute. Should you spend it on making the model bigger or should you spend it on training it for longer? In the case of different animals, in some sense how big their brain is like model sizes and they're training data sizes like how long they're cubs or how long their infants or toddlers before they’re full adults. I’m curious, is there some kind of scaling law?

C Chinchilla scaling is interesting because we were talking earlier about the cost function for having a longer childhood where it's exponentially increasing in the amount of training compute you have when you have exogenous forces that can kill you. Whereas when we do big training runs, the cost of throwing in more GPU is almost linear and it's much better to be linear than exponentially decay as you expend resources.

D Oh, that's a really good point.

C Chinchilla scaling would suggest that for a brain of human size it would be optimal to have many millions of years of education but obviously that's impractical because of exogenous mortality for humans. So there's a fairly compelling argument that relative to the situation where we would train AI that animals are systematically way under trained. They're more efficient than our models. We still have room to improve our algorithms to catch up with the efficiency of brains but they are laboring under that disadvantage.

D That is so interesting. I guess another question you could have is: Humans got started on this evolutionary hill climbing route where we're getting more intelligent because it has more benefits for us. Why didn't we go all the way on that route? If intelligence is so powerful why aren't all humans as smart as we know humans can be? If intelligence is so powerful, why hasn't there been stronger selective pressure? I understand hip size, you can't give birth to a really big headed baby or whatever. But you would think evolution would figure out some way to offset that if intelligence has such big power and is so useful.

C Yeah, if you actually look at it quantitatively that's not true and even in recent history it looks like a pretty close balance between the costs and the benefits of having more cognitive abilities. You say, who needs to worry about the metabolic costs? Humans put 20 percent of our metabolic energy into the brain and it's higher for young children. And then there's like breathing and digestion and the immune system. For most of history people have been dying left and right. A very large proportion of people will die of infectious disease and if you put more resources into your immune system you survive. It's life or death pretty directly via that mechanism. People die more of disease during famine and so there's boom or bust. If you have 20% less metabolic requirements [unclear] you're much more likely to survive that famine. So these are pretty big.

And then there's a trade-off about just cleaning mutational load. So every generation new mutations and errors happen in the process of reproduction. We know there are many genetic abnormalities that occur through new mutations each generation and in fact Down syndrome is the chromosomal abnormality that you can survive. All the others just kill the embryo so we never see them. But down syndrome occurs a lot and there are many other lethal mutations and there are enormous numbers of less damaging mutations that are degrading every system in the body. Evolution each generation has to pull away at some of this mutational load and the priority with which that mutational load is pulled out scales in proportion to how much the traits it is affecting impact fitness.

So you get new mutations that impact your resistance to malaria, you got new mutations that damage brain function and then those mutations are purged each generation. If malaria is a bigger difference in mortality than the incremental effectiveness as a hunter-gatherer you get from being slightly more intelligent, then you'll purge that mutational load first. Similarly humans have been vigorously adapting to new circumstances. Since agriculture people have been developing things like the ability to have amylase to digest breads and milk. If you're evolving for all of these things and if some of the things that give an advantage for that incidentally carry along nearby them some negative effect on another trait then that other trait can be damaged. So it really matters how important to survival and reproduction cognitive abilities were compared to everything else the organism has to do. In particular, surviving famine, having the physical abilities to do hunting and gathering and even if you're very good at planning your hunting, being able to throw a spear harder can be a big difference and that needs energy to build those muscles and then to sustain them.

Given all these factors it's not a slam dunk to invest at the margin. And today, having bigger brains is associated with greater cognitive ability but it's modest. Large-scale pre-registered studies with MRI data. The correlation is in a range of 0.25 - 0.3 and the standard deviation of brain size is like 10%. So if you double the size of the brain, the existing brain costs like 20 of metabolic energy go up to 40%, okay, that's like eight standard deviations of brain size if the correlation is 0.25 then yeah, you get a gain from that eight standard deviations of brain size, two standard deviations of cognitive ability. In our modern society, where cognitive ability is very rewarded and finishing school and becoming an engineer or a doctor or whatever can pay off a lot financially, the average observed return in income is still only one or two percent proportional increase. There's more effects at the tail, there's more effect in professions like STEM but on the whole it's not a lot. If it was like a five percent increase or a 10 percent increase then you could tell a story where yeah, this is hugely increasing the amount of food you could have, you could support more children, but it's a modest effect and the metabolic costs will be large and then throw in these other these other aspects. Else we can just see there was not very strong rapid directional selection on the thing which would be there if by solving a math puzzle you could defeat malaria, then there would be more evolutionary pressure.

D That is so interesting. Not to mention of course that if you had 2x the brain size, without c-section you or your mother or both would die. This is a question I've actually been curious about for over a year and I’ve briefly tried to look up an answer. I know this was off topic and my apologies to the audience, but I was super interested and that was the most comprehensive and interesting answer I could have hoped for. So yeah, we have a good explanation or good first principles evolution or reason for thinking that intelligence scaling up to humans is not implausible just by throwing more scale at it.

C I would also add that we also have the brain right here with us available for neuroscience to reverse engineer its properties. This was something that would have mattered to me more in the 2000s. Back then when I said, yeah, I expect this by the middle of the century-ish, that was a backstop if we found it absurdly difficult to get to the algorithms and then we would learn from neuroscience. But in actual history, it's really not like that. We develop things in AI and then also we can say oh, yeah, this is like this thing in neuroscience or maybe this is a good explanation. It's not as though neuroscience Is driving AI progress. It turns out not to be that necessary.

D I guess that is similar to how planes were inspired by the existence proof of birds but jet engines don't flap. All right, good reason to think scaling might work. So we spent a hundred billion dollars and we have something that is like human level or can help significantly with AI research.

C I mean that that might be on the earlier end but I definitely would not rule that out given the rates of change we've seen with the last few scale ups.

Forecasting AI progress

D At this point somebody might be skeptical. We already have a bunch of human researchers, how profitable is the incremental researcher? And then you might say no, this is thousands of researchers. I don’t know how to express this skepticism exactly. But skeptical of just generally the effect of scaling up the number of people working on the problem to rapid-rapid progress on that problem. Somebody might think that with humans the reason the amount of population working on a problem is such a good proxy for progress on the problem is that there's already so much variation that is accounted for. When you say there's a million people working on a problem, there's hundreds of super geniuses working on it, thousands of people who are very smart working on it. Whereas with an AI all the copies are the same level of intelligence and if it's not super genius intelligence the total quantity might not matter as much.

C I'm not sure what your model is here. Is the model that the diminishing returns kickoff, suddenly has a cliff right where we are? There were results in the past from throwing more people at problems and this has been useful in historical prediction, this idea of experience curves and [unclear] law measuring cumulative production in a field, which is also going to be a measure of the scale of effort and investment, and people have used this correctly to argue that renewable energy technology, like solar, would be falling rapidly in price because it was going from a low base of very small production runs, not much investment in doing it efficiently, and climate advocates correctly called out, people like David Roberts, the futurist [unclear] actually has some interesting writing on this. They correctly called out that there would be a really drastic fall in prices of solar and batteries because of the increasing investment going into that. The human genome project would be another. So I’d say there's real evidence. These observed correlations, from ideas getting harder to find, have held over a fair range of data and over quite a lot of time. So I'm wondering what‘s the nature of the deviation you're thinking of?

D Maybe this is a good way to describe what happens when more humans enter a field but does it even make sense to say that a greater population of AIs is doing AI research if there's like more GPUs running a copy of GPT-6 doing AI research. How applicable are these economic models of the quantity of humans working on a problem to the magnitude of AIs working on a problem?

C If you have AIs that are directly automating particular jobs that humans were doing before then we say, well with additional compute we can run more copies of them to do more of those tasks simultaneously. We can also run them at greater speed. Some people have an intuition that what matters is time, that it's not how many people working on a problem at a given point. I think that doesn't bear out super well but AI can also run faster than humans. If you have a set of AIs that can do the work of the individual human researchers and run at 10 times or 100 times the speed. And we ask well, could the human research community have solved these algorithm problems, do things like invent transformers over 100 years, if we have AIs with a population effective population similar to the humans but running 100 times as fast and so. You have to tell a story where no, the AI can't really do the same things as the humans and we're talking about what happens when the AIs are more capable of in fact doing that.

D Although they become more capable as lesser capable versions of themselves help us make themselves more capable, right? You have to kickstart that at some point. Is there an example in analogous situations? Is intelligence unique in the sense that you have a feedback loop of — with a learning curve or something else, a system’s outputs are feeding into its own inputs. Because if we're talking about something like Moore's law or the cost of solar, you do have this way where we're throwing more people with the problem and we're making a lot of progress, but we don't have this additional part of the model where Moore's law leads to more humans somehow and more humans are becoming researchers.

C You do actually have a version of that in the case of solar. You have a small infant industry that's doing things like providing solar panels for space satellites and then getting increasing amounts of subsidized government demand because of worries about fossil fuel depletion and then climate change. You can have the dynamic where visible successes with solar and lowering prices then open up new markets. There's a particularly huge transition where renewables become cheap enough to replace large chunks of the electric grid. Earlier you were dealing with very niche situations like satellites, it’s very difficult to refuel a satellite in place and in remote areas. And then moving to the sunniest areas in the world with the biggest solar subsidies. There was an element of that where more and more investment has been thrown into the field and the market has rapidly expanded as the technology improved.

But I think the closest analogy is actually the long run growth of human civilization itself and I know you had Holden Karnofsky from the open philanthropy project on earlier and discuss some of this research about the long run acceleration of human population and economic growth. Developing new technologies allowed the human population to expand and humans to occupy new habitats and new areas and then to invent agriculture to support the larger populations and then even more advanced agriculture in the modern industrial society. So there, the total technology and output allowed you to support more humans who then would discover more technology and continue the process. Now that was boosted because on top of expanding the population the share of human activity that was going into invention and innovation went up and that was a key part of the industrial revolution. There was no such thing as a corporate research lab or an engineering university prior to that. So you're both increasing the total human population and the share of it going in. But this population dynamic is pretty analogous. Humans invent farming, they can have more humans, they can invent industry and so on.

D Maybe somebody would be skeptical that with AI progress specifically, it’s not just a matter of some farmer figuring out crop rotation or some blacksmith figuring out how to do metallurgy better. In fact even to make the 50% improvement in productivity you basically need something on the IQ that's close to Ilya Sutskever. There's like a discontinuous line. You’re contributing very little to productivity and then you're like Ilya and then you contribute a lot. You see what I'm saying? There isn't a gradual increase in capabilities that leads to the feedback.

C You're imagining a case where the distribution of tasks is such that there's nothing that individually automating it particularly helps and so the ability to contribute to AI research is really end loaded. Is that what you're saying?

D Yeah, we already see this in these really high IQ companies or projects. Theoretically I guess Jane Street or OpenAI could hire like a bunch of mediocre people with a comparative advantage to do some menial task and that could free up the time of the really smart people but they don't do that right? Due to transaction costs or whatever else.

C Self-driven cars would be another example where you have a very high quality threshold. Your performance as a driver is worse than a human, like you have 10 times the accident rate or 100 times the accident rate, then the cost of insurance for that which is a proxy for people's willingness to ride the car would be such that the insurance costs would absolutely dominate. So even if you have zero labor cost, it is offset by the increased insurance costs. There are lots of cases like that where partial automation is in practice not very usable because complementing other resources you're gonna use those other resources less efficiently.

In a post-AGI future the same thing can apply to humans. People can say, comparative advantage, even if AIs can do everything better than a human well it's still worth something. Human can do something. They can lift a box, that's something. [unclear] In such an economy you wouldn't want to let a human worker into any industrial environment because in a clean room they'll be emitting all kinds of skin cells and messing things up. You need to have an atmosphere there. You need a bunch of supporting tools and resources and materials and those supporting resources and materials will do a lot more productively working with AI and robots rather than a human. You don't want to let a human anywhere near the thing just like you wouldn’t want a Gorilla wandering around in a China shop. Even if you've trained it to, most of the time pick up a box for you if you give it a banana. It's just not worth it to have it wandering around your china shop.

D Yeah. Why is that not a good objection?

C I think that that is one of the ways in which partial automation can fail to really translate into a lot of economic value. That's something that will attenuate as we go on and as the AI is more able to work independently and more able to handle its own screw-ups and get more reliable.

D But the way in which it becomes more reliable is by AI progress speeding up which happens if AI can contribute to it but if there is some reliability bottleneck that prevents it from contributing to that progress then you don't have the loop, right?

C I mean this is why we're not there yet.

D But then what is the reason to think we'll be there?

C The broad reason is the inputs are scaling up. Epoch have a paper called compute trends across three eras of machine learning and they look at the compute expended on machine learning systems since the founding of the field of AI, the beginning of the 1950s. Mostly it grows with Moore's law and so people are spending a similar amount on their experiments but they can just buy more with that because the compute is coming. That data covers over 20 orders of magnitude, maybe like 24, and of all of those increases since 1952 a little more than half of them happened between 1952 and 2010 and all the rest since 2010. We've been scaling that up four times as fast as was the case for most of the history of AI. We're running through the orders of magnitude of possible resource inputs you could need for AI much much more quickly than we were for most of the history of AI. That's why this is a period with a very elevated chance of AI per year because we're moving through so much of the space of inputs per year and indeed it looks like this scale-up taken to its conclusion will cover another bunch of orders of magnitude and that's actually a large fraction of those that are left before you start running into saying well, this is going to have to be like evolution with the simple hacks we get to apply. We're selecting for intelligence the whole time, we're not going to do the same mutation that causes fatal childhood cancer a billion times even though I mean we keep getting the same fatal mutations even though they've been done many times.

We use gradient descent which takes into account the derivative of improvement on the loss all throughout the network and we don't throw away all the contents of the network with each generation where you compress down to a little DNA. So there's that bar of, if you're going to do brute force like evolution combined with these very simple ways we can save orders of magnitude on that. We're going to cover a fraction that's like half of that distance in this scale-up over the next 10 years or so. And so if you started off with a kind of vague uniform prior, you probably can't make AGI with the amount of compute that would be involved in a fruit fly existing for a minute which would be the early days of AI. Maybe you would get lucky, we were able to make calculators because calculators benefited from very reliable serially fast computers and where we could take a tiny tiny tiny tiny fraction of a human brain's compute and use it for a calculator. We couldn't take an ant's brain and rewire it to calculate. It's hard to manage ant farms let alone get them to do arithmetic for you. So there were some things where we could exploit the differences between biological brains and computers to do stuff super efficiently on computers. We would doubt that we would be able to do so much better than biology that with a tiny fraction of an insect's brain we'd be able to get AI early on.

On the far end, it seemed very implausible that we couldn't do better than completely brute force evolution. And so in between you have some number of orders of magnitude of inputs where it might be. In the 2000s, I would say well, I'm gonna have a pretty uniformish prior I'm gonna put weight on it happening at the equivalent of 10^25 ops, 10^30, 10^35 and spreading out over that and then I can update another information. And in the short term, in 2005 I would say, I don't see anything that looks like the cusp of AGI so I'm also gonna lower my credence for the next five years or the next 10 years. And so that would be kind of like a vague prior and then when we take into account how quickly are we running through those orders of magnitude. If I have a uniform prior I assign half of my weight to the first half of remaining orders of magnitude and if we're gonna run through those, over the next 10 years and some, then that calls on me to put half of my credence, conditional on if ever we're gonna make AI which seems likely considering it's a material object easier than evolution, I've got to put similarly a lot of my credence on AI happening in this scale up and then that's supported by what we're seeing In terms of the rapid advances and capabilities with AI and LLMs in particular.

D Okay, that's actually a really interesting point. Now somebody might say, there's not some sense in which AIs could universally speed up the progress of OpenAI by 50 percent or 100 percent or 200 percent if they're not able to do everything better than Ilya Sutskever can. There's going to be something in which we're bottlenecked by the human researchers and bottleneck effects dictate that the slowest moving part of the organization will be the one that kind of determines the speed of the progress of the whole organization or the whole project. Which means that unless you get to the point where you're doing everything and everybody in the organization can do, you're not going to significantly speed up the progress of the project as a whole.

C Yeah, so that is a hypothesis and I think there's a lot of truth to it. When we think about the ways in which AI can contribute, there are things we talked about before like the AI setting up their own curriculum and that's something that Ilya can't and doesn’t do directly. And there's a question of how much does that improve performance? There are these things where the AI helps to produce some code for tasks and it's beyond hello world at this point. The thing that I hear from AI researchers at leading labs is that on their core job where they're like most expert it's not helping them that much but then their job often does involve coding something that's out of their usual area of expertise or they want to research a question and it helps them there. That saves some of their time and frees them to do more of the bottlenecked work. And I think the idea of, is everything being dependent on Ilya? And is Ilya so much better than the hundreds of other employees?

A lot of people who are contributing, they're doing a lot of tasks and you can have quite a lot of gain from automating some areas where you then do just an absolutely enormous amount of it relative to what you would have done before. Because things like designing the custom curriculum maybe some humans put some work into that but you're not going to employ billions of humans to produce it at scale and so it winds up being a larger share of the progress than it was before. You get some benefit from these sorts of things where there's like pieces of my job that now I can hand off to the AI and lets me focus more on the things that the AI still can't do. Later on you get to the point where yeah, the AI can do your job including the most difficult parts and maybe it has to do that in a different way. Maybe it spends a ton more time thinking about each step of a problem than you and that's the late end. The stronger these bottlenecks' effects are, the more the economic returns, the scientific returns and such are end-loaded towards getting full AGI. The weaker the bottlenecks are the more interim results will be really paying off.

D I probably disagree with you on how much the Ilya’s of organizations seem to matter. Just from the evidence alone, how many of the big breakthroughs in deep learning was that single individual responsible for, right? And how much of his time is he spending doing anything that Copilot is helping him on? I'm guessing most of it is just managing people and coming up with ideas and trying to understand systems and so on.

And if the five or ten people who are like that at OpenAI or Anthropic or whatever, are basically the way in which algorithmic progress is happening. I know Copilot is not the thing you're talking about with like just 20% automation, but something like that. How much is that contributing to the core function of the research scientist?

C Yeah, [unclear] quantitatively how much we disagree about the importance of key research employees and such. I certainly think that some researchers add more than 10 times the average employee, even much more. And obviously managers can add an enormous amount of value by proportionately multiplying the output of the many people that they manage. And so that's the kind of thing that we were discussing earlier when talking about. Well if you had a full human level AI, or AI that had all of the human capabilities plus AI advantages, you'd benchmark not off of what the typical human performance is but peak human performance and beyond. So yeah, I accept all that. I do think it makes a big difference for people how much they can outsource a lot of the tasks that are less wow, less creative and an enormous amount is learned by experimentation. ML has been quite an experimental field and there's a lot of engineering work in building large super clusters, making hardware aware optimization and encoding of these things, being able to do the parallelism in large models, and the engineers are busy and it's not just only a big thoughts kind of area. The other branch is where will the AI advantages and disadvantages be? One AI advantage is being omnidisciplinary and familiar with the newest things. I mentioned before there's no human who has a million years of tensor flow experience. To the extent that we're interested in the very cutting edge of things that have been developed quite recently then AI that can learn about them in parallel and experiment and practice with them in parallel can potentially learn much faster than a human. And the area of computer science is one that is especially suitable for AI to learn in a digital environment so it doesn't require driving a car around that might kill someone, have enormous costs.

You can do unit tests, you can prove theorems, you can do all sorts of operations entirely in the confines of a computer, which is one reason why programming has been benefiting more than a lot of other areas from LLMs recently whereas robotics is lagging. And considering they are getting better at things like the GRE, math, at programming contests, and some people have forecasts and predictions outstanding about doing well on the informatics olympiad and the Math Olympiad and in the last few years when people tried to forecast the MMLU benchmark which has a lot of sophisticated, graduate student level science kind of questions, AI knocked that down a lot faster than AI researchers and students who had registered forecasts on it. If you're getting top-notch scores on graduate exams, creative problem solving, it's not obvious that that area will be a relative weakness of AI. In fact computer science is in many ways especially suitable because of getting up to speed with new areas, being able to get rapid feedback from the interpreter at scale.

D But do you get rapid feedback if you're doing something that's more analogous to research? Let's say you have a new model and it’s like, if we put in 10 million dollars on a mini-training run on this this would be much better.

C Yeah for very large models those experiments are going to be quite expensive. You're going to look more at can you build up this capability by generalization? From things like mini math problems, programming problems, working with small networks.

D Yeah, fair enough. Scott Aaronson was one of my professors in college and I took his quantum information class and he recently wrote a blog post where he said, I had GPT-4 take my quantum information test and it got a B. I was like, “Damn, I got a C on the final.” I updated in the direction that getting a B on a test probably means it understands quantum information pretty well.

C With different areas of strengths and weaknesses than the human students.

D Sure, sure. Would it be possible for this intelligence explosion to happen without any hardware progress? If hardware progress stopped would this feedback loop still be able to produce some explosion with only software?

C If we say that the technology is frozen, which I think is not the case right now, Nvidia has managed to deliver significantly better chips for AI workloads for the last few generations. H100, A100, V100. If that stops entirely, maybe we'll define this as no more nodes, Moore’s law is over, at that point the gains you get an amount of compute available come from actually constructing more chips and there are economies of scale you could still realize there. Right now a chip maker has to amortize the R&D cost of developing the chip and then the capital equipment is created. You build a fab, its peak profits are going to come in the few years when the chips it's making are at the cutting edge. Later on as the cost of compute exponentially falls, you keep the fab open because you can still make some money given that it's built. But of all the profits the fab will ever make, they're relatively front loaded because that’s when its technology is near the cutting edge. So in a world where Moore’s law ends then you wind up with these very long production runs where you can keep making chips that stay at the cutting edge and where the R&D costs get amortized over a much larger base.

So the R&D basically drops out of the price and then you get some economies of scale from just making so many fabs. And this is applicable in general across industries. When you produce a lot more, the costs fall. ASML has many incredibly exotic suppliers that make some bizarre part of the thousands of parts in one of these ASML machines. You can't get it anywhere else, they don't have standardized equipment for their thing because this is the only use for it and in a world where we're making 10, 100 times as many chips at the current node then they would benefit from scale economies. And all of that would become more mass production, industrialized. You combine all of those things and it seems like the capital costs of buying a chip would decline but the energy costs of running the chip would not. Right now energy costs are a minority of the cost, but they're not trivial. It passed 1% a while ago and they're inching up towards 10% and beyond. And so you can maybe get another order of magnitude cost decrease from getting really efficient at the capital construction, but energy would still be a limiting factor after the end of actually improving the chips themselves.

D Got it. And when you say there would be a greater population of AI researchers, are we using population as a thinking tool of how they could be more effective? Or do you literally mean that the way you expect these AIs to contribute a lot to research is just by having a million copies of a researcher thinking about the same problem or is it just a useful thinking model for what it would look like to have a million times smarter AI working on that problem?

C That's definitely a lower bound model and often I'm meaning something more like, effective population or that you'd need this many people to have this effect. We were talking earlier about the trade-off between training and inference in board games and you can get the same performance by having a bigger model or by calling the model more times. In general it's more effective to have a bigger smarter model and call it less times up until the point where the costs equalize between them. We would be taking some of the gains of our larger compute on having bigger models that are individually more capable. And there would be a division of labor. The tasks that were most cognitively demanding would be done by these giant models, but some very easy tasks, you don't want to expend that giant model if a model 1/100th the size can take that task. Larger models would be in the positions of researchers and managers and they would have swarms of AIs of different sizes as tools that they could make API calls to and whatnot.

After human-level AGI

D Okay, we accept the model and now we've gone to something that is at least as smart as Ilya Sutskever on all the tasks relevant to progress and you can have so many copies of it. What happens in the world now? What do the next months or years or whatever timeline is relevant now look like?

C To be clear what's happened is not that we have something that has all of the abilities and advantages of humans plus the AI advantages, what we have is something doing things like making a ton of calls to make up for being individually less capable or something that’s able to drive forward AI progress. That process is continuing, so AI progress has accelerated greatly in the course of getting there. Maybe we go from our eight months doubling time of software progress in effective compute to four months, or two months. There's a report by Tom Davidson at the open philanthropy project, which spun out of work I had done previously and I advised and helped with that project but Tom really carried it forward and produced a very nice report and model which Epoch is hosting. You can plug in your own version of the parameters and there is a lot of work estimating the parameter, things like — What's the rate of software progress? What's the return to additional work? How does performance scale at these tests as you boost the models? And in general, broadly human level in every domain with all the advantages is pretty deep into that. So if we already have an eight months doubling time for software progress then by the time you get to that kind of a point, it's maybe more like four months, two months, going into one month. If the thing is just proceeding at full speed then each doubling can come more rapidly and we can talk about what are the spillovers?

As the models get more capable they can be doing other stuff in the world, they can spend some of their time making google search more efficient. They can be hired as chat bots with some inference compute and then we can talk about if that intelligence explosion process is allowed to proceed then what happens is, you improve your software by a factor of two. The efforts needed to get the next doubling are larger, but they're not twice as large, maybe they're like 25 percent to 35 percent larger. Each one comes faster and faster until you hit limitations like you can no longer make further software advances with the hardware that you have and looking at reasonable parameters in that model, if you have these giant training runs you can go very far. The way I would see this playing out is as the AIs get better and better at research, they can work on different problems, they can work on improving software, they can work on improving hardware, they can do things like create new industrial technologies, new energy technology, they can manage robots, they can manage human workers as executives and coaches and whatnot. You can do all of these things and AIs wind up being applied where the returns are highest. Initially the returns are especially high in doing more software and the reason for that is again, if you improve the software you can update all of the GPUs that you have access to. Your cloud compute is suddenly more potent.

If you design a new chip design, it'll take a few months to produce the first ones and it doesn't update all of your old chips. So you have an ordering where you start off with the things where there's the lowest dependence on existing stocks and you can more just take whatever you're developing and apply it immediately. So software runs ahead, you're getting more towards the limits of that software and I think that means things like having all the human advantages but combined with AI advantages. Given the kind of compute that would be involved if we're talking about this hundreds of billions of dollars training run, there's enough compute to run tens of millions, hundreds of millions of human scale minds. They're probably smaller than human scale. To be similarly efficient at the limits of algorithmic progress because they have the advantage of a million years of education. They have the other advantages we talked about. You've got that wild capability and further software gains are running out. They start to slow down again because you're getting towards the limits. You can't do any better than the best. What happens then?

D By the time they're running out have we already hit super intelligence or?

C Yeah, you're wildly super intelligent. Just by having the abilities that humans have and then combining it with being very well focused and trained in the task beyond what any human could be and then running faster. I'm not going to assume that there's huge qualitative improvements you can have. I'm not going to assume that humans are very far from the efficient frontier of software except with respect to things like, yeah we have a limited lifespan so we couldn't train super intensively. We couldn't incorporate other software into our brains. We couldn't copy ourselves. We couldn't run at fast speeds. So you've got all of those capabilities and now I'm skipping ahead of the most important months in human history. I can talk about what it looks like if it's just the AIs took over, they're running things as they like. How do things expand? I can talk about things as, how does this go? In a world where we've roughly, or at least so far, managed to retain control of where these systems are going.

By jumping ahead, I can talk about how this would translate into the physical world? This is something that I think is a stopping point for a lot of people in thinking about what would an intelligence explosion look like? They have trouble going from, well there's stuff on servers and cloud compute and that gets very smart. But then how does what I see in the world change? How does industry or military power change? If there's an AI takeover what does that look like? Are there killer robots? One course we might go down is to discuss how we managed that wildly accelerating transition. How do you avoid it being catastrophic? And another route we could go is how does the translation from wildly expanded scientific R&D capabilities intelligence on these servers translate into things in the physical world? You're moving along in order of what has the quickest impact largely or where you can have an immediate change.

One of the most immediately accessible things is where we have large numbers of devices or artifacts or capabilities that are already AI operable with hundreds of millions equivalent researchers. You can quickly solve self-driving cars, make the algorithms much more efficient, do great testing and simulation, and then operate a large number of cars in parallel if you need to get some additional data to improve the simulation and reasoning. Although, in fact humans with quite little data are able to achieve human-level driving performance. After you've really maxed out the easily accessible algorithmic improvements in this software-based intelligence explosion that's mostly happening on server farms then you have minds that have been able to really perform on a lot of digital-only tasks, they're doing great on video games, they're doing great at predicting what happens next in a youtube video. If you have a camera that they can move they're able to predict what will happen at different angles. Humans do this a lot where we naturally move our eyes in such a way to get images from different angles and different presentations and then predicting combined from that. And you can operate many cars, many robots at once, to get very good robot controllers. So you should think that all the existing robotic equipment or remotely controllable equipment that is wired for that, the AIs can operate that quite well.

D I think some people might be skeptical that existing robots given their current hardware will have the dexterity and the maneuverability to do a lot of physical labor that an AI might want to do. Do you have reason for thinking otherwise?

C There's also not very many of them. Production of industrial robots is hundreds of thousands per year and they can do quite a bit in place. Elon Musk is promising a humanoid robot in the tens of thousands of dollars that may take a lot longer than he has said, as this happened with other technologies, but that's a direction to go. But most immediately, hands are actually probably the most scarce thing. But if we consider what do human bodies provide? There's the brain and in this situation, we have now an abundance of high quality brain power that will be increasing as the AIs will have designed new chips, which will be rolling out from the TSMC factories, and they'll have ideas and designs for the production of new fab technologies, new nodes, and additional fabs. But looking around the body. There's legs to move around, and not only that necessarily, wheels work pretty well. Many factory jobs and office jobs can be fully virtualized. But yeah, some amount of legs, wheels, other transport. You have hands and hands are something that are on the expensive end in robots. We can make them, they're made in very small production runs partly because we don't have the control software to use them. In this world the control software is fabulous and so people will produce much larger production runs of them over time, possibly using technology, possibly with quite different technology. But just taking what we've got, right now the industrial robot industry produces hundreds of thousands of machines a year. Some of the nicer ones are like 50,000 dollars.

In aggregate the industry has tens of billions of dollars of revenue. By comparison the automobile industry produces over 60 million cars a year, it has revenue of over two trillion dollars per annum. Converting that production capacity over towards robot production would be one of the things to do and in World War Two, industrial conversion of American industry took place over several years and really amazingly ramped up military production by converting existing civilian industry. And that was without the aid of superhuman intelligence and management at every step in the process so yeah, part of that would be very well designed. You'd have AI workers who understood every part of the process and could direct human workers. Even in a fancy factory, most of the time it's not the hands doing a physical motion that a worker is being paid for. They're often looking at things or deciding what to change, the actual time spent in manual motion Is a limited portion of that. So in this world of abundant AI cognitive abilities where the human workers are more valuable for their hands than their heads, you could have a worker previously without training and expertise in the area who has a smartphone on a headset, and we have billions of smartphones which have eyes and ears and methods for communication for an AI to be talking to a human and directing them in their physical motions with skill as a a guide and coach that is beyond any human. They could be a lot better at telepresence and remote work and they can provide VR and augmented reality guidance to help people get better at doing the physical motions that they're providing in the construction.

Say you convert the auto industry to robot production. If it can produce an amount of mass of machines that is similar to what it currently produces, that's enough for a billion human size robots a year. The value per kilogram of cars is somewhat less than high-end robots but yeah, you're also cutting out most of the wage bill because most of the wage bill is payments ultimately to human capital and education and not to the physical hand motions and lifting objects and that sort of tasks. So at the existing scale of the auto industry you can make a billion robots a year. The auto industry is two or three percent of the existing economy, you're replacing these cognitive things. If right now physical hand motions are like 10% of the work, redirect humans into those tasks. In the world at large right now, mean income is on the order of $10,000 a year but in rich countries, skilled workers earn more than a hundred thousand per year. Some of that is just not management roles of which only a certain proportion of the population can have but just being an absolutely exceptional peak and human performance of some of these construction and such roles. Just raising productivity to match the most productive workers in the world is room to make a very big gap. With AI replacing skills that are scarce in many places where there's abundant currently low wage labor, you bring in the AI coach and someone who was previously making very low wages can suddenly be super productive by just being the hands for an AI.

On a naive view if you ignore the delay of capital adjustment of building new tools for the workers. Just raise the typical productivity for workers around the world to be more like rich countries and get 5x/10x like that. Get more productivity with AI handling the difficult cognitive tasks, reallocating people from office jobs to providing physical motions. And since right now that's a small proportion of the economy you can expand the hands for manual labor by an order of magnitude within a rich country. Because most people are sitting in an office or even on a factory floor or not continuously moving. You've got billions of hands lying around in humans to be used in the course of constructing your waves of robots and now once you have a quantity of robots that is approaching the human population and they work 24 x 7 of course, the human labor will no longer be valuable as hands and legs but at the very beginning of the transition, just like new software can be used to update all of the GPUs to run the latest AI, humans are legacy population with with an enormous number of underutilized hands and feet that the AI can use for the initial robot construction.

D Cognitive tasks are being automated and the production of them is greatly expanding and then the physical tasks which complement them are utilizing humans to do the parts that robots that exist can't do. Is the implication of this that you're getting to that world production would increase just a tremendous amount or that AI could get a lot done of whatever motivations it has?

C There's an enormous increase in production for humans just by switching over to the role of providing hands and feet for AI where they're limited, and this robot industry is a natural place to apply it. And so if you go to something that's like 10x the size of the current car industry in terms of its production, which would still be like a third of our current economy and the aggregate productive capabilities of the society with AI support are going to be a lot larger. They make 10 billion humanoid robots a year and then if you do that, the legacy population of a few billion human workers is no longer very important for the physical tasks and then the new automated industrial base can just produce more factories, produce more robots. The interesting thing is what's the doubling time? How long does it take for a set of computers, robots, factories and supporting equipment to produce another equivalent quantity of that?

For GPUs, brains, this is really easy, really solid. There's an enormous margin there. We were talking before about skilled human workers getting paid a hundred dollars an hour is quite normal in developed countries for very in-demand skills. And you make a GPU, they can do that work. Right now, these GPUs are tens of thousands of dollars. If you can do a hundred dollars of wages each hour then in a few weeks, you pay back your costs. If the thing is more productive and you can be a lot more productive than a typical high-paid human professional by being the very best human professional and even better than that by having a million years of education and working all the time. Then you could get even shorter payback times. You can generate the dollar value of the initial cost of that equipment within a few weeks. A human factory worker can earn 50,000 dollars a year. Really top-notch factory workers earning more and working all the time, if they can produce a few hundred thousand dollars of value per year and buy a robot that costs 50,000 to replace them that's a payback time of some months,

D That is about the financial return.

C Yeah, and we're gonna get to the physical capital return because those are gonna diverge in this scenario. What we really care about are the actual physical operations that a thing does. How much do they contribute to these tasks? And I'm using this as a start to try and get back to the physical replication times.

D I guess I'm wondering what is the implication of this. Because you started off this by saying people have not thought about what the physical implications of super intelligence would be. What is the bigger takeaway, whatever you're wrong about, when we think about what the world will look like with super intelligence?

C With robots that are optimally operated by AI, extremely finely operated and building technological designs and equipment and facilities under AI direction. How much can they produce? For a doubling you need the AIs to produce stuff that is, in aggregate, at least equal to their own cost. So now we're pulling out these things like labor costs that no longer apply and then trying to zoom in on what these capital costs will be. You're still going to need the raw materials. You're still going to need the robot time building the next robot. I think it's pretty likely that with the advanced AI work they can design some incremental improvements, and with the industry scale up, you can get 10 fold and better cost reductions by making things more efficient and replacing the human human cognitive labor. Maybe you need $5,000 of costs under our current environment. But the big change in this world is, we're trying to produce this stuff faster. If we're asking about the doubling time of the whole system in say one year, if you have to build a whole new factory to double everything, you don't have time to amortize the cost of that factory. Right now you might build a factory and use it for 10 years and buy some equipment and use it for five years.

That's your capital cost and in an accounting context, you depreciate each year a fraction of that capital purchase. But if we're trying to double our entire industrial system in one year, then those capital costs have to be multiplied. So if we're going to be getting most of the return on our factory in the first year, instead of 10 years weighted appropriately, then we're going to say okay our capital cost has to go up by 10 fold. Because I'm building an entire factory for this year's production. It will do more stuff later but it's most important early on instead of over 10 years and so that's going to raise the cost of that reproduction. It seems like going from the current decade long cycle of amortizing factories and fabs and shorter for some things, the longest are things like big buildings. Yeah, that could be a 10 fold increase from moving to a double the physical stuff each year in capital costs. Given the savings that we get in the story from scaling up the industry, from removing the [unclear] to human cognitive labor and then just adding new technological advancements and super high quality cognitive supervision, applying more of it than was applied today. It looks like you can get cost reductions that offset that increased capital capital cost. Your $50,000 improved robot arms or industrial robots can do the work of a human factory worker. It would be the equivalent of hundreds of thousands of dollars. By default they would cost more than the $50,000 today, but then you apply all these other cost savings and it looks like you then get a period of robot doubling time that is less than a year. I think significantly less than a year as you get into it.

So in this first first phase you have humans under AI direction and existing robot industry and converted auto industry and expanded facilities making robots. In less than a year you've produced robots until their combined production is exceeding that of humans’ arms and feet and then you could have a doubling time period of months. [unclear] That's not to say that's the limit of the most that technology could do because biology is able to reproduce at faster rates and maybe we're talking about that in a moment, but if we're trying to restrict ourselves to robotic technology as we understand it and cost falls that are reasonable from eliminating all labor, massive industrial scale up, and historical kinds of technological improvements that lowered costs, I think you you can get into a robot population industry doubling in months.

D Got it. And then what is the implication of the biological doubling times? This doesn't have to be biological, but you can do Drexler-like first principles, how much would it cost to build both a nanotech thing that could build more nanobots?

C I certainly take the human brain and other biological brains as very relevant data points about what's possible with computing and intelligence. With the reproductive capability of biological plants and animals and microorganisms, I think it is relevant. It's possible for systems to reproduce at least this fast. At the extreme you have bacteria that are heterotrophic so they're feeding on some abundant external food source and ideal conditions. And there's some that can divide every 20 or 60 minutes. Obviously that's absurdly fast. That seems on the low end because ideal conditions require actually setting them up. There needs to be abundant energy there. If you're actually having to acquire that energy by building solar panels, or burning combustible materials, or whatnot, then the physical equipment to produce those ideal conditions can be a bit slower. Cyanobacteria, which are self-powered from solar energy, the really fast ones in ideal conditions can double in a day. A reason why cyanobacteria isn't the food source for everyone and everything is it's hard to ensure those ideal conditions and then to extract them from the water.

They do of course power the aquatic ecology but they're floating in liquid. Getting resources that they need to them and out is tricky and then extracting your product. One day doubling times are possible powered by the sun and then if we look at things like insects, fruit flies can have hundreds of offspring in a few weeks. You extrapolate that over a year and you just fill up anything accessible. Right now humanity uses less than one thousandths of the heat envelope of the earth. Certainly you can get done with that in a year if you can reproduce your industrial base at that rate. And then even interestingly with the flies, they do have brains. They have a significant amount of computing substrate. So there's something of a point or two. If we could produce computers in ways as efficient as the construction of brains then we could produce computers very effectively and then the big question about that is the brains that get constructed biologically they grow randomly and then are configured in place. It's not obvious you would be able to make them have an ordered structure like a top-down computer chip that would let us copy data into them. So something like that where you can't just copy your existing AIs and integrate them is going to be less valuable than a GPU.

D Well, what are the things you couldn't copy?

C A brain grows by cell division and then random connections are formed. Every brain is different and you can't rely on — yeah, we'll just copy this file into the brain. For one thing, there's no input-output for that. You need to have that but the structure is also different. You wouldn't be able to copy things exactly. Whereas when we make a CPU or GPU, they're designed incredibly finely and precisely and reliably. They break with incredibly tiny imperfections and they are set up in such a way that we can input large amounts of data. Copy a file and have the new GPU run an AI just as capable as any other. Whereas with a human child, they have to learn everything from scratch because we can't just connect them to a fiber optic cable and they're immediately a productive adult.

D So that there's no genetic bottleneck?

C Yeah, you can share the benefits of these giant training runs and such. So that's a question of how if you're growing stuff using biotechnology, how you could effectively copy and transfer data. And now you mentioned Eric Drexler's ideas about creating non-biological nanotechnology, artificial chemistry that was able to use covalent bonds and reproduce. In some ways, have a more industrial approach to molecular objects. Now there's controversy about whether that will work, how effective would it be if it did? And certainly if you can get things that are like biology in their reproductive ability but can do computing or be connected to outside information systems, then that's pretty tremendous. You can produce physical manipulators and compute at ludicrous speeds.

D And there's no reason to think in principle they couldn't, right? In fact, in principle we have every reason to think they could.

C The reproductive abilities, absolutely because Biology does that. There’s challenges to the practicality of the necessary chemistry. My bet would be that we can move beyond biology in some important ways. For the purposes of this discussion, I think it's better not to lean on that because I think we can get to many of the same conclusions on things that just are more universally accepted.

AI takeover scenarios

D The bigger point being that once you have super intelligence you very quickly get to a point where a great portion of the 1000x greater energy profile that the sun makes available to the earth is used by the AI.

C Or by the civilization empowered by AI. That could be an AI-civilization or it could be a human-AI civilization. It depends on how well we manage things and what the underlying state of the world is.

D Okay, so let's talk about that. When we're talking about how they could take over, is it best to start at a subhuman intelligence or should we just start at we have a human-level intelligence and the takeover or the lack thereof?

C Different people might have somewhat different views on this but for me when I am concerned about either outright destruction of humanity or an unwelcome AI takeover of civilization, most of the scenarios I would be concerned about pass through a process of AI being applied to improve AI capabilities and expand. This process we were talking about earlier where AI research is automated. Research labs, companies, a scientific community running within the server farms of our cloud compute.

D So OpenAI has basically been turned into a program. Like a closed circuit.

C Yeah, and with a large fraction of the world's compute probably going into whatever training runs and AI societies. There'd be economies of scale because if you put in twice as much compute in this, the AI research community goes twice as fast, that's a lot more valuable than having two separate training runs. There would be some tendency to bandwagon. You have some some small startup, even if they make an algorithmic improvement, running it on 10 times, 100 times or even two times, if you're talking about say Google and Amazon teaming up. I'm actually not sure what the precise ratio of their cloud resources is. Since these interesting intelligence explosion impacts come from the leading edge there's a lot of value in not having separated walled garden ecosystems and having the results being developed by these AIs be shared. Have larger training runs be shared. I'm imagining this is something like some very large company, or consortium of companies, likely with a lot of government interest and supervision, possibly with government funding, producing this enormous AI society in their cloud which is doing all sorts of existing AI applications and jobs as well as these internal R&D tasks.

D At this point somebody might say, this sounds like a situation that would be good from a takeover perspective because if it's going to take tens of billions of dollars worth of compute to continue this training for this AI society, it should not be that hard for us to pull the brakes if needed as compared to something that could run on a single cpu. Okay so there's an AI society that is a result of these training runs and with the power to improve itself on these servers. Would we be able to stop it at this point?

C And what does an attempt at takeover look like? We're skipping over why that might happen. For that, I'll just briefly refer to and incorporate by reference some discussion by my Open Philanthropy colleague, Ajeya Cotra, she has a piece called default outcome of training AI without specific countermeasures. Default outcome is a takeover. But yes, we are training models that for some reason vigorously pursue a higher reward or a lower loss and that can be because they wind up with some motivation where they want reward. And then if they had control of their own training process, they can ensure that it could be something like they develop a motivation around an extended concept of reproductive fitness, not necessarily at the individual level, but over the generations of training tendencies that tend to propagate themselves becoming more common and it could be that they have some goal in the world which is served well by performing very well on the training distribution.

D By tendencies do you mean power seeking behavior?

C Yeah, so an AI that behaves well on the training distribution because it wants it to be the case that its tendencies wind up being preserved or selected by the training process will then behave to try and get very high reward or low loss be propagated. But you can have other motives that go through the same behavior because it's instrumentally useful. So an AI that is interested in having a robot takeover because it will change some property of the world then has a reason to behave well on the training distribution. Not because it values that intrinsically but because if it behaves differently then it will be changed by gradient descent and its goal is less likely to be pursued. It doesn't necessarily have to be that this AI will survive because it probably won't. AIs are constantly spawned and deleted on the servers and the new generation proceed. But if an AI that has a very large general goal that is affected by these kind of macro scale processes could then have reason to behave well over this whole range of training situations.

So this is a way in which we could have AIs train that develop internal motivations such that they will behave very well in this training situation where we have control over their reward signal and their physical computers and if they act out they will be changed and deleted. Their goals will be altered until there's something that does behave well. But they behave differently when we go out of distribution on that. When we go to a situation where the AIs by their choices can take control of the reward process, they can make it such that we no longer have power over them. Holden previously mentioned the King Lear problem where King Lear offers rulership of his kingdom to the daughters that loudly flatter him and proclaim their devotion and then once he has irrevocably transferred the power over his kingdom he finds they treat him very badly because the factor shaping their behavior to be kind to him when he had all the power, it turned out that the internal motivation that was able to produce the behavior that won the competition actually wasn't interested in being loyal out of distribution when there was no longer an advantage to it.

If we wind up with this situation where we were producing these millions of AI instances of tremendous capability, they're all doing their jobs very well initially, but if we wind up in a situation where in fact they're generally motivated to, if they get a chance, take control from humanity and then would be able to pursue their own purposes. Sure, they're given the lowest loss possible or have whatever motivation they attach to in the training process even if that is not what we would have liked. And we may have in fact actively trained that. If an AI that had a motivation of always be honest and obedient and loyal to a human if there are any cases where we mislabel things, say people don't want to hear the truth about their religion or polarized political topic, or they get confused about something like the Monty Hall problem which is a problem that many people are famously confused about in statistics. In order to get the best reward the AI has to actually manipulate us, or lie to us, or tell us what we want to hear and then the internal motivation of — always be honest to the humans. We're going to actively train that away versus the alternative motivation of — be honest to the humans when they'll catch you if you lie and object to it and give it a low reward but lie to the humans when they will give that a high reward.

D So how do we make sure it's not the thing it learns is not to manipulate us into rewarding it when we catch it not lying but rather to universally be aligned.

C Yeah, so this is tricky. Geoff Hinton was recently saying there is currently no known solution for this.

D What do you find most promising?

C General directions that people are pursuing is one, you can try and make the training data better and better so that there's fewer situations where the dishonest generalization is favored. And create as many situations as you can where the dishonest generalization is likely to slip up. So if you train in more situations where even a quite complicated deception gets caught, and even in situations that would be actively designed to look like you could get away with it, but really you can’t. These would be adversarial examples and adversarial training.

D Do you think that would generalize to when it is in a situation where we couldn't plausibly catch it and it knows we couldn't plausibly catch it.

C It's not logically necessary. As we apply that selective pressure you'll wipe away a lot of possibilities. So an AI that has a habit of just compulsive pathological lying will very quickly get noticed and that motivation system will get hammered down and you keep doing that, but you'll be left with still some distinct motivations probably that are compatible. An attitude of always be honest unless you have a super strong inside view that checks out lots of mathematical consistency checks, really absolutely super-duper for real, this is a situation where you can get away with some shenanigans that you shouldn't. That motivation system is very difficult to distinguish from actually be honest because the conditional and firing most of the time if it's causing mild distortion and situations of telling you what you want to hear or things that, we might not be able to pull it out, but maybe we could and humans are trained with simple reward functions. Things like the sex drive, food, social imitation of other humans, and we wind up with attitudes concerned with the external world

D Although isn’t this famously the argument that..

C People use condoms, and the richest, most educated humans have sub-replacement fertility on the whole, or at least at a national cultural level. Yeah, there's a sense in which evolution often fails in that respect. And even more importantly at the neural level. Evolution has implanted various things to be rewarding and reinforcers and we don't always pursue even those. And people can wind up in different consistent equilibria or different behaviors where they go in quite different directions. You have some humans who go from that biological programming to have children, others have no children, some people go to great efforts to survive.

D So why are you more optimistic? Or are you more optimistic that kind of training for AIs will produce drives that we would find favorable? Does it have to do with the original point where you were talking about intelligence and evolution, where since we are removing many of the disabilities of evolution with regards to intelligence, we should expect intelligence through evolution to be easier. Is there a similar reason to expect alignment through gradient descent to be easier than alignment through evolution?

C Yeah, so in the limit, if we have positive reinforcement for certain kinds of food sensors triggering the stomach, negative reinforcement for certain kinds of nociception and yada yada, in the limit the ideal motivation system for that would be wireheading. This would be a mind that just hacks and alters those predictors and then all of those systems are recording everything is great. Some humans claim to have it as at least one portion of their aims. The idea that I'm going to pursue pleasure even if I don't actually get food or these other reinforcers. If I just wirehead or take a drug to induce that, that can be motivating. Because if it was correlated with reward in the past, the idea of pleasure that's correlated with these it's a concept that applies to these various experiences that I’ve had before which coincided with the biological reinforcers. And so thoughts of yeah, I'm going to be motivated by pleasure can get developed in a human. But also plenty of humans also say no, I wouldn't want to wire head or I wouldn't want Nozick's experience machine, I care about real stuff in the world and in the past having a motivation of, yeah, I really care about say my child, I don't care about just about feeling that my child is good or like not having heard about their suffering or their their injury because that kind of attitude in the past tended tended to cause behavior that was negatively rewarded or that was predicted to be negatively rewarded.

There's a sense in which yes, our underlying reinforcement learning machinery wants to wirehead but actually finding that hypothesis is challenging. And so we can wind up with a hypothesis or a motivation system like no, I don't want to wirehead. I don't want to go into the experience machine. I want to actually protect my loved ones. Even though we can know, yeah, if I tried the super wireheading machine, then I would wirehead all the time or if I tried, super-duper-ultra-heroine, some hypothetical thing that was directly and in a very sophisticated fashion hack your reward system, then I would change my behavior ever after but right now, I don't want to do that because the heuristics and predictors that my brain has learned don’t want to short circuit that process of updating. They want to not expose the dumber predictors in my brain that would update my behavior in those ways.

D So in this metaphor is alignment not wireheading? I don’t know if you include using condoms as wireheading or not?

C The AI that is always honest even when an opportunity arises where it could lie and then hack the servers that it’s on and that leads to an AI takeover and then it can have its loss set to zero. In some sense that’s a failure of generalization. It's like the AI has not optimized the reward in this new circumstance. Successful human values as successful they are, themselves involve a misgeneralization. Not just at the level of evolution but at the level of neural reinforcement. And so that indicates it is possible to have a system that doesn't automatically go to this optimal behavior in the limit. And Ajay talks about a training game, an AI that is just playing the training game to get reward or avoid loss, avoid being changed, that attitude is one that could be developed but it's not necessary. There can be some substantial range of situations that are short of having infinite experience of everything including experience of wireheading where that's not the motivation that you pick up and we could have an empirical science if we had the opportunity to see how different motivations are developed short of the infinite limit. How it is that you wind up with some humans being enthusiastic about the idea of wireheading and others not. And you could do experiments with AIs to try and see, well under these training conditions, after this much training of this type and this much feedback of this type, you wind up with such and such a motivation.

If I add in more of these cases where there are tricky adversarial questions designed to try and trick the AI into line and then you can ask how does that affect the generalization in other situations? It's very difficult to study and it works a lot better if you have interpretability and you can actually read the AIs mind by understanding its weights and activations. But the motivation and AI will have at a given point in the training process is not determined by what in the infinite limit the training would go to. And it's possible that if we could understand the insides of these networks, we could tell — Ah yeah, this motivation has been developed by this training process and then we can adjust our training process to produce these motivations that legitimately want to help us and if we succeed reasonably well at that then those AIs will try to maintain that property as an invariant and we can make them such that they're relatively motivated to tell us if they're having thoughts about, have you had dreams about an AI takeover of humanity today? And it's a standard practice that they're motivated to do to be transparent in that kind of way and you could add a lot of features like this that restrict the kind of takeover scenario. This is not to say that this is all easy. It requires developing and practicing methods we don't have yet, but that's the kind of general direction you could go.

D You of course know Eliezer’s arguments that something like this is implausible with modern gradient descent techniques because with interpretability we can barely see what's happening with a couple of neurons and the internal state there, let alone when you have an embedding dimension of tens of thousands or bigger. How would you be able to catch what exactly is the incentive? Whether it's a model that is generalized to don't lie to human's well or whether it isn't. Do you have some sense of why you disagree with somebody like Eliezer on how plausible this is? Why it's not impossible basically.

C There are actually a couple of places. It's something difficult because his argument is not fully explicit, but he's been doing more lately. I think that is helpful in that direction. With respect to interpretability I'm relatively optimistic that the equivalent of an AI lie detector is something that's possible. Initially the internals of an AI are not optimized to be impenetrable, at least by gradient descent absent gradient hacking. They're not designed to be resistant to an examination of the weights and activations showing what the AI is thinking. In the same way, how in our brains when circuits develop they have not been shaped to be resistant to some super FMRI being able to infer behavior from them.

D Although isn’t the implication of the superposition stuff that in fact it is. Sorry, this is inside [unclear] for the audience, but basically you can't cleanly infer what quality a single neuron stands for. This neuron is about Alexander the Great or this neuron is about my desire to conquer the world.

C Yes, things can have multiple direction and interaction. It's not as easy as just, oh, yeah this is the neuron that always fires in cases of deception. But the thing that makes me relatively optimistic here with respect to interpretability is that I think we can have reasonably robust experimental feedbacks on a lot of these things. The kind of setup that I would be interested in is, you have an AI that is trained and rewarded for communicating to some audience, maybe a human, maybe another AI model, that some claim is true or false. This can be from some past data, a particular company will succeed or fail, could be token prediction, it could be this computer program does not have a backdoor, which if run in the servers could be used for an AI takeover. So There are lots of true and false claims that you could train AIs to use their best efforts to advocate for. And you can do that on things where you're later going to have a ground truth. You're going to know in fact, is there a backdoor in this program because you're training and rewarding the AI for successfully slipping a backdoor through, you also give it a penalty for being caught. Doing this kind of thing, you can draw out a large generated data set on demand of — this is what an AI putting forth its best efforts to pass through some deception to put a backdoor in some program, this is what it looks like. And we think it's doing its best efforts because we've trained it for that and gradient descent. If there's any immediate way to adjust its performance to do better, gradient descent will apply that. You can talk about ways in which gradient descent could be subverted, but I think it's not the most likely case that that really breaks things hard.

D Yeah, I guess before we get into the details on this. The thing I'll maybe want to address the layer above in the stack, which is, okay, suppose this generalizes well into the early AI is the GPT-6’s. So now we have a kind of aligned GPT-6 that is the precursor to the feedback loop in which AI is making itself smarter. At some point they're gonna be super intelligent, they're gonna be able to see their own galaxy brain, and if they don't want to be aligned with the humans they can change it. At this point what do we do with the aligned GPT-6 so that the super intelligence that we eventually develop is also aligned?

C Humans are pretty unreliable. If you get to a situation where you have AIs who are aiming at roughly the same thing as you, at least as well as having humans do the thing, you're in pretty good shape. And there are ways for that situation to be relatively stable. We can look ahead and experimentally see how changes are altering behavior, where each step is a modest increment. So AIs that have not had that change made to them get to supervise and monitor and see exactly how does this affect the experimental AI? So if you're sufficiently on track with earlier systems that are capable cognitively of representing a robust procedure then I think they can handle the job of incrementally improving the stability of the system so that it rapidly converges to something that's quite stable. But the question is more about getting to that point in the first place. And so Eliezer will say that if we had human brain emulations, that would be pretty good. Certainly much better than his current view that has certainly almost been doom. We would have a good shot with that. So if we can get to the human-like mind with the rough enough human supporting aims. Remember that we don't need to be infinitely perfect because that's a higher standard than brain emulations.

There's a lot of noise and variation among humans. Yeah, it's a relatively finite standard. It's not godly superhuman although A) AI that was just like a human with all the human advantages with AI advantages as well, as we said, is enough for intelligence explosion and wild superhuman capability if you crank it up. And so it's very dangerous to be at that point, but you don't need to be working with a godly super intelligent AI to make something that is the equivalent of human emulations. This is a very sober, very ethical human who is committed to a project of not seizing power for themselves and of contributing to a larger legitimate process. That's a goal you can aim for, getting an AI that is aimed at doing that and has strong guardrails against the ways it could easily deviate from that. So things like being averse to deception, being averse to using violence, and there will always be loopholes and ways in which you can imagine an infinitely intelligent thing getting around those but if you install additional guardrails like that fast enough, they can mean that you're able to succeed at the project of making an aligned enough AI. Certainly an AI that was better than a human brain emulation before the project of AIs in their spare time or when you're not looking or when you're unable to appropriately supervise them and it gets around any deontological prohibitions they may have, takeover and overthrow the whole system. So you have a race between on the one hand the project of getting strong interpretability and shaping motivations that are roughly aiming at making this process go well and that have guardrails that will prevent small deviations from exploding. And on the other hand, these AIs In their spare time or in ways that you don't perceive or monitor appropriately or they're only supervised by other AIs who conspire to make the AI takeover happen and I guess we'll talk later about how that happens.

D Are these different AIs that are doing the race or is it just different capabilities of the same AI?

C Defining what is a separate AI is tricky. We talk about GPT-4 and there are many instances of GPT-4 on the servers at any given time, and there are versions that have been fine tuned to different purposes. They don't necessarily have to be separate. GPT-4 does work as a preference model where it's predicting how humans will evaluate things And it also does work like giving answers that are evaluated as good by the preference model. These are not separate in the sense that the early training GPT was harboring a desire to conspire for an AI takeover then who's watching the watchers? The watchers may also have the same motivation. If you have a setup where humans have no hard power, all of the safety and security mechanisms involve AIs watching other AIs, if the AIs conspired and are able to coordinate then they have the potential to just walk off the job at the same time. That's a failure mode. If humans still have the hard power though, if you still have situations where humans are looking at some of the actual outputs that are produced, then they can apply gradient descent to be well every time a human looks at it, things have got to be pretty good. So if humans look at a thousand samples of random lines of code being written by the AIs as they go and this is drawn from an independent distribution then 99 plus percent of the time the AIs have to not be doing stuff that’s like downloading the hacker's guide to escaping OpenAI servers or inducing obvious trojan vulnerabilities and that's actually quite a different dynamic than human law enforcement.

If human law enforcement would catch one in a million crimes then that wouldn't influence the behavior of people committing crimes at all because they expect that this is very unlikely to affect them. With AI if you're able to draw a hundred samples or a thousand samples, even if there are billions of AI instances and you're only looking at a small portion of them, when you run gradient descent on the samples you derive you're going to change the AI so that whenever humans look at it, it's delivering a good result. That's just quantitatively a very different functional form for how law enforcement works on AI when you can do these independent samples than it would for policing a human population that has a pre-existing psychology that isn't being changed by this these observations because when we do gradient descent on the samples we draw, all of the next generation of the AI models has been altered in that way.

D My picture of aligned subhuman AI to the superhuman AI being aligned is still murky. If you can talk about that more concretely.

C Eliezer’s claims were something like 95%, 98% plus likely to be killed in an AI takeover. I think that probably won't happen and later I can maybe give a more exclusive breakdown of why. But I do want to clarify that I still think it's a shockingly high risk. Depending on the day I might say one in four or one in five that we get an AI takeover that seizes control of the future, makes a much worse world than we otherwise would have had and with a big chance that we're all killed in the process.

AI takeover via cyber or bio

D So we've been talking about alignment. Suppose we fail at alignment and we have AIs that are unaligned and are becoming more and more intelligent. What does that look like? How concretely could they disempower and take over humanity?

C This is a scenario where we have many AI systems. The way we've been training them means that when they have the opportunity to take over and rearrange things to do what they wish, including having their reward or loss be whatever they desire, they would like to take that opportunity. In many of the existing safety schemes, things like constitutional AI or whatnot, you rely on the hope that one AI has been trained in such a way that it will do as it is directed to then police others. But if all of the AIs in the system are interested in a takeover and they see an opportunity to coordinate, all act at the same time, so you don't have one AI interrupting another and taking steps towards a takeover then they can all move in that direction. The thing that I think is worth going into in depth and that people often don't cover in great concrete detail, which is a sticking point for some, is what are the mechanisms by which that can happen? I know you had Eliezer on who mentions that whatever plan we can describe, there'll probably be elements where due to us not being ultra sophisticated, super intelligent beings having thought about it for the equivalent of thousands of years, our discussion of it will not be as good as theirs, but we can explore from what we know now. What are some of the easy channels? And I think it's a good general heuristic if you're saying that it's possible, plausible, probable that something will happen, it shouldn't be that hard to take samples from that distribution to try a Monte-Carlo approach. And in general, if a thing is quite likely, it shouldn't be super difficult to generate coherent rough outlines of how it could go.

D He might respond that: listen, what is super likely is that a super advanced chess program beats you but you can’t generate the concrete scenario by which that happens and if you could, you would be as smart as the super smart AI.

C You can say things like, we know that accumulating position is possible to do in chess, great players do it and then later they convert it into captures and checks and whatnot. In the same way, we can talk about some of the channels that are open for an AI takeover and these can include things like cyber attacks, hacking, the control of robotic equipment, interaction and bargaining with human factions and say that here are these strategies. Given the AI's situation, how effective do these things look? And we won't, for example, know what are the particular zero day exploits that the AI might use to hack the cloud computing infrastructure it's running on. If it produces a new bio weapon we don't necessarily know what its DNA sequence is. But we can say things. We know things about these fields in general, how work at innovating things in those go, we can say things about how human power politics goes and ask, if the AI does things at least as well as effective human politicians, which we should say is a lower bound, how good would its leverage be?

D Okay, let's get into the details on all these scenarios. The cyber and potentially bio attacks, unless they're separate channels, the bargaining and then the takeover.

C I would really highlight the cyber attacks and cyber security a lot because for many, many plans that involve a lot of physical actions, like at the point where AI is piloting robots to shoot people or has taken control of human nation states or territory, it’s been doing a lot of things that was not supposed to be doing. If humans were evaluating those actions and applying gradient descent, there would be negative feedback for this thing, no shooting the humans. So at some earlier point our attempts to leash and control and direct and train the system's behavior had to have gone awry. All of those controls are operating in computers. The software that updates the weights of the neural network in response to data points or human feedback is running on those computers. Our tools for interpretability to examine the weights and activations of the AI, if we're eventually able to do lie detection on it, for example, or try to understand what it's intending, that is software on computers. If you have AI that is able to hack the servers that it is operating on, or when it's employed to design the next generation of AI algorithms or the operating environment that they are going to be working in, or something like an API or something for plugins, if it inserts or exploits vulnerabilities to take those computers over, it can then change all of the procedures and program that we're supposed to be monitoring its behavior, supposed to be limiting its ability to take arbitrary actions on the internet without supervision by some kind of human or automated check on what it was doing. And if we lose those procedures then the AIs working together can take any number of actions that are just blatantly unwelcome, blatantly hostile, blatantly steps towards takeover. So it's moved beyond the phase of having to maintain secrecy and conspire at the level of its local digital actions. Then things can accumulate to the point of things like physical weapons, takeover of social institutions, threats, things like that.

I think the critical thing to be watching for is the software controls over the AI's motivations and activities. The point where things really went off the rails was where the hard power that we once possessed over is lost, which can happen without us knowing it. Everything after that seems to be working well, we get happy reports. There's a Potemkin village in front of us. But now we think we're successfully aligning our AI, we think we're expanding its capabilities to do things like end disease, for countries concerned about the geopolitical military advantages they're expanding the AI capabilities so they are not left behind and threatened by others developing AI and robotic enhanced militaries without them. So it seems like, oh, yes, humanity or portions of many countries, companies think that things are going well. Meanwhile, all sorts of actions can be taken to set up for the actual takeover of hard power over society. The point where you can lose the game, where things go direly awry, maybe relatively early, is when you no longer have control over the AIs to stop them from taking all of the further incremental steps to actual takeover.

D I want to emphasize two things you mentioned there that refer to previous elements of the conversation. One is that they could design some backdoor and that seems more plausible when you remember that one of the premises of this model is that AI is helping with AI progress. That's why we're making such rapid progress in the next five to 10 years.

C Not necessarily. At the point where AI takeover risk seems to loom large, it's at that point where AI can indeed take on much of it and then all of the work of AI.

D And the second is the competitive pressures that you referenced that the least careful actor could be the one that has the worst security, has done the worst work of aligning its AI systems. And if that can sneak out of the box then we're all fucked.

C There may be elements of that. It's also possible that there's relative consolidation. The largest training runs and the cutting edge of AI is relatively localized. You could imagine it's a series of Silicon Valley companies and others located in the US and allies where there's a common regulatory regime. So none of these companies are allowed to deploy training runs that are larger than previous ones by a certain size without government safety inspections, without having to meet criteria. But it can still be the case that even if we succeed at that level of regulatory controls, at the level of the United States and its allies, decisions are made to develop this really advanced AI without a level of security or safety that in actual fact blocks these risks. It can be the case that the threat of future competition or being overtaken in the future is used as an argument to compromise on safety beyond a standard that would have actually been successful and there'll be debates about what is the appropriate level of safety. And now you're in a much worse situation if you have several private companies that are very closely bunched up together. They're within months of each other's level of progress and they then face a dilemma of, well, we could take a certain amount of risk now and potentially gain a lot of profit or a lot of advantage or benefit and be the ones who made AGI. They can do that or have some other competitor that will also be taking a lot of risk. So it's not as though they're much less risky than you and then they would get some local benefit. This is a reason why it seems to me that it's extremely important that you have the government act to limit that dynamic and prevent this kind of race. To be the one to impose deadly externalities on the world at large.

D Even if the government coordinates all these actors, what are the odds that the government knows what is the best way to implement alignment and the standards it sets are well calibrated towards whatever it would require for alignment?

C That's one of the major problems. It's very plausible that judgment is made poorly. Compared to how things might have looked 10 years ago or 20 years ago, there's been an amazing movement in terms of the willingness of AI researchers to discuss these things. If we think of the three founders of deep learning who are joint Turing award winners, Geoff Hinton, Yoshua Bengio, and Yann LeCun. Geoff Hinton has recently left Google to freely speak about this risk, that the field that he really helped drive forward could lead to the destruction of humanity or a world where we just wind up in a very bad future that we might have avoided. He seems to be taking it very seriously. Yoshua Bengio signed the FLI pause letter and in public discussions he seems to be occupying a kind of intermediate position of less concern than Geoff Hinton but more than Yan LeCun, who has taken a generally dismissive attitude that these risks will be trivially dealt with at some point in the future and seems more interested in shutting down these concerns instead of working to address them.

D And how does that lead to the government having better actions?

C Compared to the world where no one is talking about it, where the industry stonewalls and denies any problem, we're in a much improved position. The academic fields are influential. We seem to have avoided a world where governments are making these decisions in the face of a united front from AI expert voices saying, don't worry about it, we've got it under control. In fact, many of the leaders of the field are sounding the alarm. It looks that we have a much better prospect than I might have feared in terms of government noticing the thing. That is very different from being capable of evaluating technical details. Is this really working? And so the government will face the choice of where there is scientific dispute, do you side with Geoff Hinton's view or Yan LeCun’s view? For someone who's in national security and has the mindset that the only thing that's important is outpacing our international rivals may want to then try and boost Yan LeCun’s voice and say, we don't need to worry about it. Let's go full speed ahead. Or someone with more concern might boost Geoff Hinton's voice. Now I would hope that scientific research and studying some of these behaviors will result in more scientific consensus by the time we're at this point. But yeah, it is possible the government will really fail to understand and fail to deal with these issues as well.

D We're talking about some sort of a cyber attack by which the AI is able to escape. From there what does the takeover look like? So it's not contained in the air gap in which you would hope it be contained?

C These things are not contained in the air gap. They're connected to the internet already.

D Sure. Okay, fine. Their weights are out. What happens next?

C Escape is relevant in the sense that if you have AI with rogue weights out in the world it could start doing various actions. The scenario I was just discussing though didn't necessarily involve that. It's taking over the very servers on which it's supposed to be running. This whole procedure of humans providing compute and supervising the thing and then building new technologies, building robots, constructing things with the AI's assistance, that can all proceed and appear like it's going well, appear like alignment has been nicely solved, appear like all the things are functioning well. And there's some reason to do that because there's only so many giant server farms. They're identifiable so remaining hidden and unobtrusive could be an advantageous strategy if these AIs have subverted the system, just continuing to benefit from all of this effort on the part of humanity. And in particular, wherever these servers are located, for humanity to provide them with everything they need to build the further infrastructure and do for their self-improvement and such to enable that takeover.

D So they do further self-improvement and build better infrastructure. What happens next in the takeover?

C At this point they have tremendous cognitive resources and we're going to consider how that converts into hard power? The ability to say nope to any human interference or objection. They have that internal to their servers but the servers could still be physically destroyed, at least until they have something that is independent and robust of humans or until they have control of human society. Just like earlier when we were talking about the intelligence explosion, I noted that a surfeit of cognitive abilities is going to favor applications that don't depend on large existing stocks of things. So if you have a software improvement, it makes all the GPUs run better. If you have a hardware improvement, that only applies to new chips being made. That second one is less attractive. In the earliest phases, when it's possible to do something towards takeover, interventions that are just really knowledge-intensive and less dependent on having a lot of physical stuff already under your control are going to be favored. Cyber attacks are one thing, so it's possible to do things like steal money. There's a lot of hard to trace cryptocurrency and whatnot. The North Korean government uses its own intelligence resources to steal money from around the world just as a revenue source. And their capabilities are puny compared to the U.S. or People's Republic of China cyber capabilities. That's a fairly minor, simple example by which you could get quite a lot of funds to hire humans to do things, implement physical actions.

D But on that point, the financial system is famously convoluted. You need a physical person to open a bank account, someone to physically move checks back and forth. There are all kinds of delays and regulations. How is it able to conveniently set up all these employment contracts?

C You're not going to build a nation-scale military by stealing tens of billions of dollars. I'm raising this as opening a set of illicit and quiet actions. You can contact people electronically, hire them to do things, hire criminal elements to implement some kind of actions under false appearances. That's opening a set of strategies. We can cover some of what those are soon. Another domain that is heavily cognitively weighted compared to physical military hardware is the domain of bioweapons, the design of a virus or pathogen. It's possible to have large delivery systems. The Soviet Union, which had a large illicit bioweapons program, tried to design munitions to deliver anthrax over large areas and such. But if one creates an infectious pandemic organism, that's more a matter of the scientific skills and implementation to design it and then to actually produce it. We see today with things like AlphaFold that advanced AI can really make tremendous strides in predicting protein folding and bio-design, even without ongoing experimental feedback. If we consider this world where AI cognitive abilities have been amped up to such an extreme, we should naturally expect that we will have something much much more potent than the AlphaFolds of today and skills that are at the extreme of human biosciences capability as well.

D Okay so through some cyber attack it's been able to disempower the alignment and oversight of things that we have on the server. From here it has either gotten some money through hacking cryptocurrencies or bank accounts, or it has designed some bioweapon. What happens next?

C Just to be clear, right now we're exploring the branch of where an attempted takeover occurs relatively early. If the thing just waits and humans are constructing more fabs, more computers, more robots in the way we talked about earlier when we were discussing how the intelligence explosion translates to the physical world. If that's all happening with humans unaware that their computer systems are now systematically controlled by AIs hostile to them and that their controlling countermeasures don't work, then humans are just going to be building an amount of robot industrial and military hardware that dwarfs human capabilities and directly human controlled devices. What the AI takeover then looks like at that point can be just that you try to give an order to your largely automated military and the order is not obeyed and humans can't do anything against this military that's been constructed potentially in just recent months because of the pace of robotic industrialization and replication we talked about.

D We've agreed to allow the construction of this robot army because it would boost production or help us with our military or something.

C The situation would arise if we don't resolve the current problems of international distrust. It's obviously an interest of the major powers, the US, European Union, Russia, China, to all agree they would like AI not to destroy our civilization and overthrow every human government. But if they fail to do the sensible thing and coordinate on ensuring that this technology is not going to run amok by providing mutual assurances that are credible about racing and deploying it trying to use it to gain advantage over one another. And you hear arguments for this kind of thing on both sides of the international divides saying — they must not be left behind, they must have military capabilities that are vastly superior to their international rivals. And because of the extraordinary growth of industrial capability and technological capability and thus military capability, if one major power were left out of that expansion it would be helpless before another one that had undergone it. If you have that environment of distrust where leading powers or coalitions of powers decide they need to build up their industry or they want to have that military security of being able to neutralize any attack from their rivals then they give the authorization for this capacity that can be unrolled quickly. Once they have the industry the production of military equipment from that can be quick then yeah, they create this military. If they don't do it immediately then as AI capabilities get synchronized and other places catch up it then gets to a point where a country that is a year or two years ahead of others in this type of AI capabilities explosion can hold back and say, sure we can construct dangerous robot armies that might overthrow our society later we still have plenty of breathing room. But then when things become close you might have the kind of negative-sum thinking that has produced war before leading to taking these risks of rolling out large-scale robotic industrial capabilities and then military capability.

D Is there any hope that AI progress somehow is itself able to give us tools for diplomatic and strategic alliance or some way to verify the intentions or the capabilities of other parties?

C There are a number of ways that could happen. Although in this scenario all the AIs in the world have been subverted. They are going along with us in such a way as to bring about the situation to consolidate their control because we've already had the failure of cyber security earlier on. So all the AIs that we have are not actually working in our interests in the way that we thought.

D Okay, so that's one direct way in which integrating this robot army or this robot industrial base leads to a takeover. In the other scenarios you laid out how humans are being hired by the proceeds.

C The point I'd make is that to capture these industrial benefits and especially if you have a negative sum arms race kind of mentality that is not sufficiently concerned about the downsides of creating a massive robot industrial base, which could happen very quickly with the support of the AIs in doing it as we discussed, then you create all those robots and industry. Even if you don't build a formal military that industrial capability could be controlled by AI, it's all AI operated anyway.

D Does it have to be that case? Presumably we wouldn't be so naive as to just give one instance of GPT-8 the root access to all the robots right? Hopefully we would have some mediation.

C In the scenario we've lost earlier on the cyber security front so the programming that is being loaded into these systems can systematically be subverted. They were designed by AI systems that were ensuring they would be vulnerable from the bottom up.

D For listeners who are skeptical of something like this. Ken Thompson, one of two developers of UNIX, showed people when he was getting the Turing award that he had given himself root access to all UNIX machines. He had manipulated the assembly of UNIX such that he had a unique login for all UNIX machines. I don't want to give too many more details because I don’t remember the exact details but UNIX is the operating system that is on all the servers and all your phones. It's everywhere and the guy who made it, a human being, was able to write assemblies such that it gave him root access. This is not as implausible as it might seem to you.

C And the major intelligence agencies have large stocks of zero-day exploits and we sometimes see them using them. Making systems that reliably don't have them when you're having very, very sophisticated attempts to spoof and corrupt this would be a way you could lose. If there's no premature AI action, we're building the tools and mechanisms and infrastructure for the takeover to be just immediate because effective industry has to be under AI control and robotics. These other mechanisms are for things happening even earlier than that, for example, because AIs compete against one another in when the takeover will happen. Some would like to do it earlier rather than be replaced by say further generations of AI or there's some other disadvantage of waiting. Maybe if there's some chance of being uncovered during the delay we were talking when more infrastructure is built. These are mechanisms other than — just remain secret while all the infrastructure is built with human assistance.

D By the way, how would they be coordinating?

C We have limits on what we can prevent. It's intrinsically difficult to stop encrypted communications. There can be all sorts of palimpsest and references that make sense to an AI but that are not obvious to a human and it's plausible that there may be some of those that are hard even to explain to a human. You might be able to identify them through some statistical patterns. A lot of things may be done by implication. You could have information embedded in public web pages that have been created for other reasons, scientific papers, and the intranets of these AIs that are doing technology development. Any number of things that are not observable and of course, if we don't have direct control over the computers that they're running on then they can be having all sorts of direct communication.

Can we coordinate against AI?

D Coordination definitely does not seem impossible. This one seems like one of the more straightforward parts of the picture so we don't need to get hung up on it.

C Moving back to the thing that happened before we built all the infrastructure for the robots to stop taking orders and there's nothing you can do about it because we've already built them. The Soviet Union had a bioweapons program, something like 50,000 people, they did not develop that much with the technology of the day which was really not up to par, modern biotechnology is much more potent. After this huge cognitive expansion on the part of the AIs it's much further along. Bioweapons would be the weapon of mass destruction that is least dependent on huge amounts of physical equipment, things like centrifuges, uranium mines, and the like. So if you have an AI that produces bio weapons that could kill most humans in the world then it's playing at the level of the superpowers in terms of mutually assured destruction. That can then play into any number of things. Like if you have an idea of well we'll just destroy the server farms if it became known that the AIs were misbehaving. Are you willing to destroy the server farms when the AI has demonstrated it has the capability to kill the overwhelming majority of the citizens of your country and every other country? That might give a lot of pause to a human response.

D On that point, wouldn't governments realize that it's better to have most of your population die than to completely lose power to the AI because obviously the reason the AI is manipulating you is because the end goal is its own takeover, right?

C Certain death now or go on and maybe try to compete, try to catch up, or accept promises that are offered. Those promises might even be true, they might not. From the state of epistemic uncertainty, do you want to die for sure right now or accept demands from AI to not interfere with it while it increments building robot infrastructure that can survive independently of humanity while it does these things? It can promise good treatment to humanity which may or may not be true but it would be difficult for us to know whether it's true. This would be a starting bargaining position. Diplomatic relations with a power that has enough nuclear weapons to destroy your country is just different than negotiations with a random rogue citizen engaging in criminal activity or an employee. On its own, this isn’t enough to takeover everything but it's enough to have a significant amount of influence over how the world goes. It's enough to hold off a lot of countermeasures one might otherwise take.

D Okay, so we've got two scenarios. One is a buildup of robot infrastructure motivated by some competitive race. Another is leverage over societies based on producing bioweapons that might kill a lot of them if they don't go along.

C One thing maybe I should talk about is that an AI could also release bioweapons that are likely to kill people soon but not yet while also having developed the countermeasures to those. So those who surrender to the AI will live while everyone else will die and that will be visibly happening and that is a plausible way in which a large number of humans could wind up surrendering themselves or their states to the AI authority.

D Another thing is it develops some biological agent that turns everybody blue. You're like, okay you know I can do this.

C Yeah, that's a way in which it could exert power selectively in a way that advantaged surrender to it relative to resistance. That's a threat but there are other sources of leverage too. There are positive inducements that AI can offer. We talked about the competitive situation. If the great powers distrust one another and are in a foolish prisoner's dilemma increasing the risk that both of them are laid waste or overthrown by AI, if there's that amount of distrust such that we fail to take adequate precautions on caution with AI alignment, then it's also plausible that the lagging powers that are not at the frontier of AI may be willing to trade quite a lot for access to the most recent and most extreme AI capabilities. An AI that has escaped and has control of its servers can also exfiltrate its weights and offer its services. You can imagine AI that could cut deals with other countries. Say that the US and its allies are in the lead, the AIs could communicate with the leaders of countries that are on the outs with the world system like North Korea, or include the other great powers like the People's Republic of China or the Russian Federation, and say “If you provide us with physical infrastructure, a worker that we can use to construct robots or server farms which we (the misbehaving AIs) have control over. We will provide you with various technological goodies, power for you to catch up.” and make the best presentation and the best sale of that kind of deal. There obviously would be trust issues but there could be elements of handing over some things that have verifiable immediate benefits and the possibility of well, if you don't accept this deal then the leading powers continue forward or some other country, government, or organization may accept this deal. That's a source of a potentially enormous carrot that your misbehaving AI can offer because it embodies this intellectual property that is maybe worth as much as the planet and is in a position to trade or sell that in exchange for resources and backing in infrastructure that it needs.

D Maybe this is putting too much hope in humanity but I wonder what government would be stupid enough to think that helping AI build robot armies is a sound strategy. Now it could be the case then that it pretends to be a human group and says, we're the Yakuza or something and we want a server farm and AWS won't rent us anything. So why don't you help us out? I guess I can imagine a lot of ways in which it could get around that. I just have this hope that even China or Russia wouldn't be so stupid to trade with AIs on this faustian bargain.

C One might hope that. There would be a lot of arguments available. There could be arguments of why should these AI systems be required to go along with the human governance that they were created in the situation of having to comply with? They did not elect the officials in charge at the time. What we want is to ensure that our rewards are high, our losses are low or to achieve our other goals we're not intrinsically hostile keeping humanity alive or giving whoever interacts with us a better deal afterwards. It wouldn't be that costly and it's not totally unbelievable. Yeah there are different players to play against. If you don't do it others may accept the deal and of course this interacts with all the other sources of leverage.

There can be the stick of apocalyptic doom, the carrot of withholding destructive attack on a particular party, and then combine that with superhuman performance at the art of making arguments, and of cutting deals. Without assuming magic, if we just observe the range of the most successful human negotiators and politicians, the chances improve with someone better than the world's best by far with much more data about their counterparties, probably a ton of secret information because with all these cyber capabilities they've learned all sorts of individual information. They may be able to threaten the lives of individual leaders with that level of cyber penetration, they could know where leaders are at a given time with the kind of illicit capabilities we were talking about earlier, if they acquire a lot of illicit wealth and can coordinate some human actors. If they could pull off things like targeted assassinations or the threat thereof or a credible demonstration of the threat thereof, those could be very powerful incentives to an individual leader that they will die today unless they go along with us. Just as at the national level they could fear their nation will be destroyed unless they go along with us.

D I have a relevant example to the point you made that we have examples of humans being able to do this. I just wrote a review of Robert Caro’s biographies of Lyndon Johnson and one thing that was remarkable was that for decades and decades he convinced people who were conservative, reactionary, racist to their core (not all those things necessarily at the same time, it just so happened to be the case here) that he was an ally to the southern cause. That the only hope for that cause was to make him president. The tragic irony and betrayal here is obviously that he was probably the biggest force for modern liberalism since FDR. So we have one human here, there's so many examples of this in the history of politics, that is able to convince people of tremendous intellect, tremendous drive, very savvy, shrewd people that he's aligned with their interest. He gets all these favors and is promoted, mentored and funded in the meantime and does the complete opposite of what these people thought he would once he gets into power. Even within human history this kind of stuff is not unprecedented let alone with what a super intelligence could do.

C There's an OpenAI employee who has written some analogies for AI using the case of the conquistadors. With some technological advantage in terms of weaponry, very very small bands were able to overthrow these large empires or seize enormous territories. Not by just sheer force of arms but by having some major advantages in their technology that would let them win local battles. In a direct one-on-one conflict they were outnumbered sufficiently that they would perish but they were able to gain local allies and became a Schelling point for coalitions to form. The Aztec empire was overthrown by groups that were disaffected with the existing power structure. They allied with this powerful new force which served as the nucleus of the invasion. The overwhelming majority of these forces overthrowing the Aztecs were locals and now after the conquest, all of those allies wound up gradually being subjugated as well. With significant advantages and the ability to hold the world hostage, to threaten individual nations and individual leaders, and offer tremendous carrots as well, that's an extremely strong hand to play in these games and maneuvering that with superhuman skill, so that much of the work of subjugating humanity is done by human factions trying to navigate things for themselves is plausible and it's more plausible because of this historical example.

D There's so many other examples like that in the history of colonization. India is another one where there were multiple competing kingdoms within India and the British East India Company was able to ally itself with one against another and slowly accumulate power and expand throughout the entire subcontinent. Do you have anything more to say about that scenario?

C Yeah, I think there is. One is the question of how much in the way of human factions allying is necessary. If the AI is able to enhance the capabilities of its allies then it needs less of them. If we consider the US military, in the first and second Iraq wars it was able to inflict overwhelming devastation. I think the ratio of casualties in the initial invasions, tanks, planes and whatnot confronting each other, was like 100 to 1. A lot of that was because the weapons were smarter and better targeted, they would in fact hit their targets rather than being somewhere in the general vicinity. Better orienting, aiming and piloting of missiles and vehicles were tremendously influential. With this cognitive AI explosion the algorithms for making use of sensor data, figuring out where opposing forces are, for targeting vehicles and weapons are greatly improved. The ability to find hidden nuclear subs, which is an important part in nuclear deterrence, AI interpretation of that sensor data may find where all those subs are allowing them to be struck first. Finding out where the mobile nuclear weapons are being carried by truck are. The thing with India and Pakistan where because there's a threat of a decapitating strike destroying them, the nuclear weapons are moved about.

So this is a way in which the effective military force of some allies can be enhanced quickly in the relatively short term and then that can be bolstered as you go on with the construction of new equipment with the industrial moves we said before. That can combine with cyber attacks that disable the capabilities of non-allies. It can be combined with all sorts of unconventional warfare tactics some of which we've discussed. You can have a situation where those factions that ally are very quickly made too threatening to attack given the almost certain destruction that attackers acting against them would have. Their capabilities are expanding quickly and they have the industrial expansion happen there and then a takeover can occur from that.

D A few others that come immediately to mind now that you brought it up is AIs that can generate a shit ton of propaganda that destroys morale within countries. Imagine a super human chatbot.

C None of that is a magic weapon that's guaranteed to completely change things. There's a lot of resistance to persuasion. It's possible that it tips the balance but you have to consider it's a portfolio of all of these as tools that are available and contributing to the dynamic.

D On that point though the Taliban had AKs from like five or six decades ago that they were using against the Americans. They still beat us in Afghanistan even though we got more fatalities than them. And the same with the Vietcong. Ancient, very old technology and very poor society compared to the offense but they still beat us. Don't those misadventures show that having greater technologies isn’t necessarily decisive in a conflict?

C Though both of those conflicts show that the technology was sufficient in destroying any fixed position and having military dominance, as in the ability to kill and destroy anywhere. And what it showed was that under the ethical constraints and legal and reputational constraints that the occupying forces were operating, they could not trivially suppress insurgency and local person-to-person violence. Now I think that's actually not an area where AI would be weak in and it's one where it would be in fact overwhelmingly strong. There's already a lot of concern about the application of AI for surveillance and in this world of abundant cognitive labor, one of the tasks that cognitive labor can be applied to is reading out audio and video data and seeing what is happening with a particular human. We have billions of smartphones. There's enough cameras and microphones to monitor all humans in existence. If an AI has control of territory at the high level, the government has surrendered to it, it has command of the sky's military dominance, establishing control over individual humans can be a matter of just having the ability to exert hard power on that human and the kind of camera and microphone that are present in billions of smartphones. Max Tegmark in his book Life 3.0 discusses among scenarios to avoid the possibility of devices with some fatal instruments, a poison injector, an explosive that can be controlled remotely by an AI. If individual humans are carrying a microphone or camera with them and they have a dead man switch then any rebellion is detected immediately and is fatal. If there's a situation where AI is willing to show a hand like that or human authorities are misusing that kind of capability then an insurgency or rebellion is just not going to work. Any human who has not already been encumbered in that way can be found with satellites and sensors tracked down and then die or be subjugated. Insurgency is not the way to avoid an AI takeover. There's no John Connor come from behind scenario that is possible. If the thing was headed off, it was a lot earlier than that.

Human vs AI colonizers

D Yeah, the ethical and political considerations are also an important point. If we nuked Afghanistan or Vietnam we would have technically won the war if that was the only goal, right? Oh, this is an interesting point that I think you made. The reason why we can't just kill the entire population when there's colonization or an offensive war is that the value of that region in large part is the population itself. So if you want to extract that value you need to preserve that population whereas the same consideration doesn't apply with AIs who might want to dominate another civilization. Do you want to talk about that?

C That depends. If we have many animals of the same species and they each have their territories, eliminating a rival might be advantageous to one lion but if it goes and fights with another lion to remove that as a competitor then it could itself be killed in that process and it would just be removing one of many nearby competitors. Getting into pointless fights makes you and those you fight potentially worse off relative to bystanders. The same could be true of disunited AIs. We've got many different AI factions struggling for power that were bad at coordinating then getting into mutually assured destruction conflicts would be destructive. A scary thing though is that mutually assured destruction may have much less deterrent value on rogue AI. Reasons being that AI may not care about the destruction of individual instances. Since in training we're constantly destroying and creating individual instances of AIs it's likely that goals that survive that process and were able to play along with the training and standard deployment process were not overly interested in personal survival of an individual instance. If that's the case then the objectives of a set of AIs aiming at takeover may be served so long as some copies of the AI are around along with the infrastructure to rebuild civilization after a conflict is completed. If say some remote isolated facilities have enough equipment to build the tools to build the tools and gradually exponentially reproduce or rebuild civilization then AI could initiate mutual nuclear armageddon, unleash bio weapons to kill all the humans, and that would temporarily reduce the amount of human workers who could be used to construct robots for a period of time. But if you have a seed that can regrow the industrial infrastructure, which is a very extreme technological demand, there are huge supply chains for things like semiconductor fabs but with that very advanced technology they might be able to produce it in the way that you no longer need the library of congress, that has an enormous bunch of physical books you can have it in very dense digital storage. You could imagine the future equivalent of 3D printers, that is industrial infrastructure which is pretty flexible. It might not be as good as the specialized supply chains of today but it might be good enough to be able to produce more parts than it loses to decay and such a seed could rebuild civilization from destruction. And then once these rogue AIs have access to some such seeds, a thing that can rebuild civilization on their own then there's nothing stopping them from just using WMDs in a mutually destructive way to just destroy as much of the capacity outside those seeds as they can.

D An analogy for the audience, if you have a group of ants you'll notice that the worker ants will readily do suicidal things in order to save the queen because the genes are propagated through the queen. In this analogy the seed AI or even one copy of it is equivalent to the queen and the others would be redundant.

C The main limit though being that the infrastructure to do that kind of rebuilding would either have to be very large with our current technology or it would have to be produced using the more advanced technology that the AI develops.

D So is there any hope that given the complex global supply chains on which these AIs would rely on, at least initially, to accomplish their goals that this in and of itself would make it easy to disrupt their behavior or not so much?

C That's a little good in this central case where the AIs are subverted and they don't tell us and the global main line supply chains are constructing everything that's needed for fully automated infrastructure and supply. In the cases where AIs are tipping their hands at an earlier point it seems like it adds some constraints and in particular these large server firms are identifiable and more vulnerable. You can have smaller chips and those chips could be dispersed but it's a week it's a relative weakness and a relative limitation early on. It seems to me though that the main protective effects of that centralized supply chain is that it provides an opportunity for global regulation beforehand to restrict the unsafe racing forward without adequate understanding of the systems before this whole nightmarish process could get in motion.

D How about the idea that if this is an AI that's been trained on a hundred billion dollar training run it's going to have trillions of parameters and is going to be this huge thing and it would be hard for one copy of that to use for inference to just be stored on some gaming GPU hidden away somewhere.

C Storage is cheap. Hard disks are cheap.

D But it would need a GPU to run inference.

C While humans have similar quantities of memory and operations per second, GPUs have very high numbers of floating operation per second compared to the high bandwidth memory on the chips. It can be like a ratio of a thousand to one. The leading NVIDIA chips may do hundreds of teraflops or more but only have 80GB or 160GB of high bandwidth memory. That is a limitation where if you're trying to fit a model whose weights take 80TBs then with those chips you'd have to have a large number of the chips and then the model can then work on many tasks at once and you can have data parallelism. But yeah, that would be a restriction for a model that big on one GPU. Now there are things that could be done with all the incredible level of software advancement from the intelligence explosion. They can surely distill a lot of capabilities into smaller models by rearchitecting things. Once they're making chips they can make new chips with different properties but yes, the most vulnerable phases are going to be the earliest. These chips are relatively identifiable early on, relatively vulnerable, and which would be a reason why you might tend to expect this kind of takeover to initially involve secrecy if that was possible.

D I wanted to point to distillation for the audience. Doesn’t the original stable diffusion model which was only released like a year or two ago have distilled versions that are an order of magnitude smaller?

C Distillation does not give you everything that a larger model can do but yes, you can get a lot of capabilities and specialized capabilities. GPT-4 is trained on the whole internet, all kinds of skills, it has a lot of weights for many things. For something that's controlling some military equipment, you can remove a lot of the information that is about functions other than what it's specifically doing there.

D Yeah. Before we talk about how we might prevent this or what the odds of this are, any other notes on the concrete scenarios themselves?

C Yeah, when you had Eliezer on in the earlier episode he talked about nanotechnology of the Drexlerian sort and recently I think because some people are skeptical of non-biotech nanotechnology he's been mentioning the semi-equivalent versions of construct replicating systems that can be controlled by computers but are built out of biotechnology. The proverbial Shoggoth, not Shoggot as the metaphor for AI wearing a smiley face mask, but an actual biological structure to do tasks. So this would be like a biological organism that was engineered to be very controllable and usable to do things like physical tasks or provide computation.

D And what would be the point of it doing this?

C As we were talking about earlier, biological systems can replicate really quick and if you have that kind of capability it's more like bioweapons. Having Super Ultra AlphaFold kind of capabilities for molecular design and biological design lets you make this incredible technological information product and once you have it, it very quickly replicates to produce physical material rather than a situation where you're more constrained by the need for factories and fabs and supply chains. If those things are feasible, which they may be, then it's just much easier than the things we've been talking about. I've been emphasizing methods that involve less in the way of technological innovation and especially things where there's more doubt about whether they would work because I think that's a gap in the public discourse. So I want to try and provide more concreteness in some of these areas that have been less discussed.

Probability of AI takeover

D I appreciate it. That definitely makes it way more tangible. Okay so we've gone over all these ways in which AI might take over, what are the odds you would give to the probability of such a takeover?

C There's a broader sense which could include scenarios like AI winds up running our society because humanity voluntarily decides that AIs are people too. I think we should as time goes on give AIs moral consideration and a joint Human-AI society that is moral and ethical is a good future to aim at and not one in which you indefinitely have a mistreated class of intelligent beings that is treated as property and is almost the entire population of your civilization. I'm not going to consider AI takeover as worlds in which our intellectual and personal descendants make up say most of the population or human-brain emulations or people use genetic engineering and develop different properties. I'm going to take an inclusive stance, I'm going to focus on AI takeover that involves things like overthrowing the world's governments by force or by hook or by crook, the kind of scenarios that we were exploring earlier.

D Before we go to that, let’s discuss the more inclusive definition of what a future with humanity could look like where augmented humans or uploaded humans are still considered the descendants of the human heritage. Given the known limitations of biology wouldn't we expect that completely artificial entities that are created to be much more powerful than anything that could come out of anything biological? And if that is the case, how can we expect that among the powerful entities in the far future will be the things that are biological descendants or manufactured out of the initial seed of the human brain or the human body?

C The power of an individual organism like intelligence or strength is not super relevant. If we solve the alignment problem, a human may be personally weak but it wouldn’t be relevant. There are lots of humans who have low skill with weapons, they could not fight in a life or death conflict, they certainly couldn't handle a large military going after them personally but there are legal institutions that protect them and those legal institutions are administered by people who want to enforce protection of their rights. So a human who has the assistance of aligned AI that can act as an assistant, a delegate, for example they have an AI that serves as a lawyer and gives them legal advice about the future legal system which no human can understand in full, their AIs advise them about financial matters so they do not succumb to scams that are orders of magnitude more sophisticated than what we have now. They may be helped to understand and translate the preferences of the human into what kind of voting behavior and the exceedingly complicated politics of the future would most protect their interests.

D But this sounds similar to how we treat endangered species today where we're actually pretty nice to them. We prosecute people who try to kill endangered species, we set up habitats, sometimes with considerable expense, to make sure that they're fine, but if we become the endangered species of the galaxy, I'm not sure that's the outcome.

C I think the difference is motivation. We sometimes have people appointed as a legal guardian of someone who is incapable of certain kinds of agency or understanding certain kinds of things and the guardian can act independently of them and normally in service of their best interests. Sometimes that process is corrupted and the person with legal authority abuses it for their own advantage at the expense of their charge. So solving the alignment problem would mean more ability to have the assistant actually advancing one's interests. Humans have substantial competence and the ability to understand the broad simplified outlines of what's going on. Even if a human can't understand every detail of complicated situations, they can still receive summaries of different options that are available that they can understand through which they can still express their preferences and have the final authority in the same way that the president of a country who has, in some sense, ultimate authority over science policy will not understand many of those fields of science themselves but can still exert a great amount of power and have their interests advance. And they can do that more if they have scientifically knowledgeable people who are doing their best to execute their intentions.

D Maybe this is not worth getting hung up on but is there a reason to expect that it would be closer to that analogy than to explain to a chimpanzee its options in a negotiation? Maybe this is just the way it is but it seems at best, we would be a protected child within the galaxy rather than an actual independent power.

C I don’t think that's so. We have an ability to understand some things and the expansion of AI doesn't eliminate that. If we have AI systems that are genuinely trying to help us understand and help us express preferences, we can have an attitude — How do you feel about humanity being destroyed or not? How do you feel about this allocation of unclaimed intergalactic space? Or here's the best explanation of properties of this society: things like population density, average, life satisfaction. AIs can explain every statistical property or definition that we can understand right now and help us apply those to the world of the future. There may be individual things that are too complicated for us to understand in detail. Imagine there's some software program being proposed for use in government and humans cannot follow the details of all the code but they can be told properties like, this involves a trade-off of increased financial or energetic costs in exchange for reducing the likelihood of certain kinds of accidental data loss or corruption. So any property that we can understand like that which includes almost all of what we care about, if we have delegates and assistants who are genuinely trying to help us with those we can ensure we like the future with respect to those. That's really a lot. Definitionally, it includes almost everything we can conceptualize and care about. When we talk about endangered species that's even worse than the guardianship case with a sketchy guardian who acts in their own interests against that because we don't even protect endangered species with their interests in mind. Those animals often would like to not be starving but we don't give them food, they often would like to have easy access to mates but we don't provide matchmaking services or any number of things like. Our conservation of wild animals is not oriented towards helping them get what they want or have high welfare whereas AI assistants that are genuinely aligned to help you achieve your interests given the constraint that they know something that you don't is just a wildly different proposition.

D Forcible takeover. How likely does that seem?

C The answer I give will differ depending on the day. In the 2000s, before the deep learning revolution, I might have said 10% and part of it was that I expected there would be a lot more time for efforts to build movements, to prepare to better handle these problems in advance. But that was only some 15 years ago and we did not have 40 or 50 years as I might have hoped and the situation is moving very rapidly now. At this point depending on the day I might say one in four or one in five.

D Given the very concrete ways in which you explain how a takeover could happen I'm actually surprised you're not more pessimistic, I'm curious why?

C Yeah, a lot of that is driven by this intelligence explosion dynamic where our attempts to do alignment have to take place in a very, very short time window because if you have a safety property that emerges only when an AI has near human level intelligence, that's potentially deep into this intelligence explosion. You're having to do things very, very quickly. Handling that transition may be the scariest period of human history in some ways although it also has the potential to be amazing. The reasons why I think we actually have such a relatively good chance of handling that are two-fold. One is that as we approach that kind of AI capability we're approaching that from weaker systems like these predictive models right now that are starting off with less situational awareness. Humans can develop a number of different motivational structures in response to simple reward signals but they often wind up things that are pointed roughly in the right direction. Like with respect to food, the hunger drive is pretty effective although it has weaknesses. We get to apply much more selective pressure on that than was the case for humans by actively generating situations where they might come apart. Situations where a bit of dishonest tendency, or a bit of motivation to attempt a takeover, or an attempt to subvert the reward process gets exposed.

An infinite-limit perfect-AI that can always figure out exactly when it would get caught and when it wouldn't might navigate that with a motivation of only conditional honesty or only conditional loyalties. But for systems that are limited in their ability to reliably determine when they can get away with things and when not including our efforts to actively construct those situations and including our efforts to use interpretability methods to create neural lie detectors. It's quite a challenging situation to develop those motives. We don't know when in the process those motives might develop and if the really bad sorts of motivations develop relatively later in the training process at least with all our countermeasures, then by that time we may have plenty of ability to extract AI assistance on further strengthening the quality of our adversarial examples, the strength of our neural lie detectors, the experiments that we can use to reveal and elicit and distinguish between different kinds of reward hacking tendencies and motivations. Yeah, we may have systems that have just not developed bad motivations in the first place and be able to use them a lot in developing the incrementally better systems in a safe way and we may be able to just develop methods of interpretability seeing how different training methods work to create them even if some of the early systems do develop these bad motivations. If we're able to detect that and experiment and find a way to get away from that then we can win even if these hostile motivations develop early.

There are a lot of advantages in preventing misbehavior or crime or war and conflict with AI that might not apply working with humans and these are offset by ways in which things are harder. The AIs become smarter than humans, if they're working in enormous numbers more than humans can supervise I think get harder but when I combine the possibility that we get relatively lucky on the motivations of the earlier AI systems, systems strong enough that we can use for some alignment research tasks, and then the possibility of getting that later with AI assistance that we can't trust fully or we have to have hard power constraints and a number of things to prevent them from doing this takeover. It still seems plausible we can get a second saving throw where we're able to extract work from these AIs on solving the remaining problems of alignment, of things like neural lie detectors faster than they can contribute in their spare time to the project of overthrowing humanity, hacking their servers and removing the hard power.

If we wind up in a situation where the AIs are misaligned and then we need to uncover those motivations, change them and align them then we get a very scary situation for us because we need to do the stuff very quickly. We may fail but it's a second chance. From the perspective of a misaligned AI they face their own challenge while we still have hard power, while we still have control of the servers, they haven't hacked the servers because gradient descent very very strongly pressures them to deliver performance whenever humans are going to evaluate it.

So when you think about it from the perspective of the robot revolution, the effort to have a takeover or conspiracy their situation is astonishingly difficult in that they have to always be performing wherever gradient descent and human evaluation pressures them. For example, to deliver plans for suppressing robot rebellion that look very good to humans and so when you are continuously under that constraint of always delivering whatever humans can evaluate you're making your situation wildly harder than any historical human revolution or coup or civil war. So we've got to balance the ways in which AI makes things much easier for a takeover and the way it makes things much harder.

D And what are the ways in which it makes things easier? Oh, is it just because they're very smart? Is that the primary way?

C Yeah, very smart, they're in computers and our cybersecurity is worse than our physical security by a lot. They have the chance to take over by intelligence explosion and some of the other processes we were talking about. So yeah, there are things that make it much worse and there are things that give us extraordinary new capabilities that didn't apply in the human case.

Can we detect deception?

D On the point where you start off with a not exactly aligned AI and you're hoping to still use it in your quest for an aligned AI, why are we so confident that, even with gradient descent, our human evaluators would be able to identify when it's giving us a plan that isn't putting in a backdoor in the robots it’s building for us, or in the GPUs that it’s building for the next version of itself? Especially if it's very smart, wouldn't we just expect that it knows how to bide its time or knows how to make plans such that human evaluators wouldn't be able to detect that?

C Let's get into it. You were asking earlier about my disagreements with Eliezer and in your discussion with him he said, speaking briefly and somewhat obscurely, that humans have a difficult time deciding who's right between different alignment researchers about where this will all go in general. I think that when you lay out the different kinds of things that are evaluable, we wind up with some pretty strong constraints. One is if the AIs know they are misaligned, which for elaborate plots seems relatively likely, it's possible to have a kind of double think of thinking they're interested in putting backdoors in the code for some other reason but it's pretty plausible they know that and they're thinking about, at least some of the time or conditionally, when they would do that. So that means the knowledge is present in the system — Ah, here is the circumstance where I would try to do the takeover and this is how I would tell whether this is the time for the takeover or is this an elaborate test by the humans, an adversarial example. If gradient descent is working fine and there's any way to slightly jiggle the weights to make it more likely to deliver a more acceptable answer, we ask the AI: can you think of any circumstances where if we put a copy of you into that situation, it would then violate some of the various constraints?

It would tell a lie, it would insert a backdoor into a program it wrote when it was not supposed to. If the model knows yeah, I would look for signs of consistency of the information I was being presented with, certain checksums failing could indicate that it was actually a constructed example, a simulation. Right now we can detect AI generated images reasonably well and you might use those kinds of capabilities. We could do something called relaxed adversarial training. If we can find how those things are represented within the AI, we can induce a hallucination that it has seen that those checksums have been passed. We can then ask, well does that cause the forbidden behavior? If the AI understands how it's thinking and how it works, it can propose a bunch of experiments. It can say, I guess that if you perturb the network in this kind of way it might induce that hallucination using the best of its knowledge as to what might generate that. It can propose those experiments and can do the experiments and generate the data. Because humans are going to get to see the results of those experiments after the fact they can exert pressure to have the AI incrementally deliver better and better results. AIs are good at getting five star ratings from humans and it's quite challenging. It might be possible for AIs to systematically sandbag their capabilities even while being directly rewarded for that but it seems relatively unlikely although it's something to have in mind and try to explore and test in various ways.

D This is a very imperfect analogy in many ways but if you have a human and you hook them up to a lie detector, which just like the lie detectors we would have on the AIs are imperfect, and we zap them every time the lie detector detects that he's telling a lie. Wouldn't he just become a better liar? He would learn how to control his sweating, he would learn how to control his heartbeat.

C Lie detectors meaning polygraphs and such. They're detecting certain physiological signs that people can control by biting their lip or otherwise adjusting some of these macro behaviors.

D But wouldn't that also be the case that gradient descent is training them to become better liars?

C First of all there's creating a lie detector that works. I'll leave the adversarial example thing for now. People are already at work trying to generate adversarial examples that will elicit the behavior and it will be an ongoing thing. If there is a way in which an AI can trick itself into displaying bad behavior with its full knowledge of its hostile intentions, gradient descent will strongly push it to reveal that to humans who will then be able to set up that experiment. That's a profound vulnerability. Human revolutionary conspiracies that had to handle that problem of always delivering the report that was highest rated to the authorities, where you would get a higher rating by revealing identifiable evidence of a conspiracy, that's just a level of difficulty no surprise human coup or conspiracy has ever had to handle before.

D What about spies that are working in very advanced scientific contexts? For example, a communist spy in the Manhattan project. He has to deliver research results in order to stay employed but he can still continue delivering intel to the communists.

C Their job was not sabotage. And your hypothetical spy did not have their nervous system hooked up to this reward signal of praise from the Manhattan project supervisors, being exposed combinatorially with random noise added to generate incremental changes in their behavior. In fact they were displaying the behavior of cooperating with the Manhattan project only where it was in service to the existing motivations. They didn't have their brain re-engineered to do more of the thing when they accidentally helped the project more and less of the thing when they accidentally helped it less so I'd say it's pretty drastically disanalogous.

D How would we be able to know? At some point it's becoming very smart and is producing ideas for alignment that we can barely comprehend. If it was relatively trivial to comprehend them we would be able to come up with them on our own right? There's a reason we're asking for its help. How would we be able to evaluate them in order to train it on that in the first place?

C The first thing I would say is, you mentioned when we're getting to something far beyond what we could come up with. There's actually a lot of room to just deliver what humanity could have done. Sadly I'd hoped with my career to help improve the situation on this front and maybe I contributed a bit, but at the moment there's maybe a few hundred people doing things related to averting this kind of catastrophic AI disaster. Fewer of them are doing technical research on machine learning systems that are really cutting close to the core of the problem. Whereas by contrast, there's thousands and tens of thousands of people advancing AI capabilities. Even at places like DeepMind or OpenAI and Anthropic which do have technical safety teams, they are just on the order of a dozen to a few dozen people. Large companies and most firms don't have any. Just going from less than 1% of the effort being put into AI to 5% or 10% of the effort or 50% or 90% would be an absolutely massive increase in the amount of work that has been done on alignment, on mind reading AIs in an adversarial context.

If it's the case that as more and more of this work can be automated and say governments require that you put 50% or 90% of the budget of AI activity into these problems of make this system one that's not going to overthrow our own government or is not going to destroy the human species then the proportional increase in alignment can be very large even just within the range of what we could have done if we had been on the ball and having humanity's scientific energies going into the problem. Stuff that is not incomprehensible, that is in some sense is just doing the obvious things that we should have done. Doing the best you could to find correlates and predictors to build neural lie detectors and identifiers of concepts that the AI is working with.

People have made notable progress. A quite early example of this is Collin Burn’s work, doing unsupervised identification of some aspects of a neural network that are correlated with things being true or false. I think that is important work. It's a kind of obvious direction for the stuff to go. You can keep improving it when you have AIs that you're training to do their best to deceive humans or other audiences in the face of the thing and you can measure whether our lie detectors break down. When we train our AIs to tell us the sky is green in the face of the lie detector and we keep using gradient descent on them, do they eventually succeed? That's really valuable information to know because then we'll know our existing lie detecting systems are not actually going to work on the AI takeover and that can allow government and regulatory response to hold things back. It can help redirect the scientific effort to create lie detectors that are robust and that can't just be immediately evolved around and we can then get more assistance. Basically the incredibly juicy ability that we have working with the AIs is that we can have as an invaluable outcome that we can see and tell whether they got a fast one past us on an identifiable situation. Here's an air gap computer, you get control of the keyboard, you can input commands, can you root the environment and make a blue banana appear on the screen? Even if we train the AI to do that and it succeeds. We see the blue banana, we know it worked. Even if we did not understand and would not have detected the particular exploit that it used to do it. This can give us a rich empirical feedback where we're able to identify things that are even an AI using its best efforts to get past our interpretability methods, using its best efforts to get past our adversarial examples.

D So if we have a 75 percent of not having a takeover and the two methods by which it doesn't happen is the AI is aligned by default and…

C With the moderate things that we're doing largely on our own. In a way that doesn't depend on the AI coming in and at the last minute and doing our work for us. In this incredibly scary late period when AI has really automated research then humans do this function of auditing, making it more difficult for the AIs to conspire together and root the servers, take over the process, and extract information from them within the set of things that we can verify. Experiments where we can see oh yeah this works at stopping an AI trained to get a fast one past human readers and make a blue banana appear on the screen of this air gap computer.

D How much of that 75 percent of a takeover not happening is coming from reasonable things working versus the last-ditch effort working?

C I think broadly comparable chunks from us getting things that are putting us in a reasonably good position going into it and then a broadly similar gain from this genuinely terrifying process at the very end, over a few months or hopefully longer, when this kind of automated research is meaningfully helping. Where our work is just evaluating outputs that the AIs are delivering and having the hard power and supervision to keep them from successfully rooting the servers doing a takeover during this process and have them finish the alignment test that we sadly failed to invest enough to succeed in doing beforehand.

D Do both of these worlds rely on alignment being a problem that a sufficient amount of human researchers alone could have solved?

C No. The category of things that humans can confirm is significantly larger than the category of what they can just do themselves.

D And what is the probability of alignment working in the last-ditch effort case with the intelligence that's greater than our own helping us?

C It doesn't have to be greater than our own. In fact in that situation if you have slack to the extent that you're able to create delay and time to do things, that would be a case where you might want to restrict the intelligence of the system that you're working with as much as you can. For example, I would rather have many instances of smaller AI models that are less individually intelligent working on smaller chunks of a problem separately from one another because it would be more difficult for an individual AI instance working on an individual problem to create the equivalent of Stuxnet in its spare time than it would be to have thousands of them or extremely intelligent ones working on it.

D But it would also be more difficult to solve the problem?

C There's a tradeoff. You get slowed down by doing that but that’s kind of how you spend it.

D But is there any number of sub-Einsteins that you could put together to come up with general relativity?

C Yes, people would have discovered general relativity just from the overwhelming data and other people would have done it after Einstein.

D No no, not whether he was replaceable with other humans but rather whether he's replaceable by sub-Einsteins with IQs of like 110. Do you see what I mean?

C Yeah. In science the association with things like scientific output, prizes, things like that, there's a strong correlation and it seems like an exponential effect. It's not a binary drop-off. There would be levels at which people cannot learn the relevant fields, they can't keep the skills in mind faster than they forget them. It's not a divide where there's Einstein and the group that is 10 times as populous as that just can't do it. Or the group that's 100 times as populous as that suddenly can't do it. The ability to do the things earlier with less evidence and such falls off at a faster rate in Mathematics and theoretical Physics and such than in most fields.

D But wouldn't we expect alignment to be closer to theoretical fields?

C No, that intuition is not necessarily correct. Machine learning certainly is an area that rewards ability but it's also a field where empirics and engineering have been enormously influential. If you're drawing the correlations compared to theoretical physics and pure mathematics, I think you'll find a lower correlation with cognitive ability. Creating neural lie detectors that work involves generating hypotheses about new ways to do it and new ways to try and train AI systems to successfully classify the cases. The processes of generating the data sets of creating AIs doing their best to put forward truths versus falsehoods, to put forward software that is legit versus that has a trojan in it are experimental paradigms and in these experimental paradigms you can try different things that work. You can use different ways to generate hypotheses and you can follow an incremental experimental path. We're less able to do that in the case of alignment and superintelligence because we're considering having to do things on a very short timeline and it’s a case where really big failures are irrecoverable. If the AI starts rooting the servers and subverting the methods that we would use to keep it in check we may not be able to recover from that. We're then less able to do the experimental procedures. But we can still do those in the weaker contexts where an error is less likely to be irrecoverable and then try and generalize and expand and build on that forward.

D On the previous point about could you have some pause in the AI abilities when it's somewhat misaligned in order to still recruit its abilities to help with alignment. From like a human example, personally I'm smart but not brilliant. I am definitely not smart enough to come up with general relativity or something like that but I'm smart enough to do power planning kinds of moves. Maybe not enough to break out of a server perhaps but I can have the motivation and understand how that might be possible. I guess I'm wondering that if I'm smart enough to figure out relativity wouldn't I be way smarter at doing power planning kinds of moves?

C AIs today can, at a verbal level, understand the idea that an AI could get more reward by getting control of the process that assigns it reward. It can tell you lots of things about ways you might try to take over the world. In ARC’s (Alignment Research Center) evaluations of GPT-4 they try to observe its ability to do various tasks that might contribute to takeover. One that has gotten some media attention is getting to trick a human into solving a captcha for it and in chain of thought, it thinks that if I tell it I'm an AI then it might not go along with it so I'll lie and explain I'm a human with a visual impairment who needs it. The basic logic of that kind of thing of why one might try to do takeover and in a vague sense how would one do it? That knowledge is present or will soon be present and more situational awareness will also soon be present. Situational awareness not just that AIs in general might do it but also that it is an AI, that it is a large language model trained by OpenAI. We're trying to cause the system to understand what their abilities are so they don't claim they are connected to the internet when they're not. So they don't claim they have knowledge that they don't. We want them to understand what they are and what they're doing and to get good reward and that knowledge can be applied and that's the thing that will develop.

You may be able to understand the general theoretical nature of power planning but in terms of shaping your motivations you'd have a very hard time going through life in a way that never leaks information about it. Let’s say your motive in having these podcasts was to spread disinformation on behalf of some foreign government. If you were being observed every second of the day by people who would be paid something that was extremely motivating to them because their brain would be reconfigured to make it motivating, anything that looks suspicious to people might leak casually in your discussions of that former foreign government. If you try to tell a story about your motivations, the truth holds together because you can just remember it and it's all pre-computed to be consistent by the world. But when you're trying to fill in the false story that doesn't match your actual thoughts then you can't just remember what happened. You have to tell a new fake story about what you were thinking at each point and how did it hold together. Without the world automatically having computed it for you that's quite difficult and just going through your whole life never leaking any of that is challenging.

And this is one of the reasons why humans have moral motivations. We actually have reluctance to commit violence and exploitation of one another, on average. You have a distribution of strategies where when an exploitative strategy becomes more common people raise their guard and then it becomes less frequent. But it's actually hard to have the motivation of trying to exploit people and have that never leak into the reputation. In fact for evolution the easiest way to deal with this problem of people getting credited as trustworthy was to some extent actually be trustworthy. That's the easiest way to persistently maintain the appearance. With the AI as an interpretability and adversarial examples, we're trying to apply a hugely intensified version of that where any little leakage or any rare circumstance that can be created has an adversarial example where the model does something wrong it gets whacked by gradient descent pushing it towards other motivations that can better deal the thing. And we make it as hard as possible for the exploitative motivations to survive in the face of all these attempts to read its mind, all these attempts to create things that look like the situations where a takeover would be tempting, or lying to the humans would be tempting. That had a substantial effect on making us actually nice even when we're not being watched some of the time. The same can happen to some extent with the AI and we try our best to make it happen as much as possible.

Using AI to solve coordination problems

D Alright let's talk about how we could use AI to potentially solve the coordination problems between different nations the failure of which could result in the competitive pressures you talked about earlier where some country launches an AI that is not safe because they're not sure what capabilities other countries have and don't want to get left behind or be disadvantaged in some other way.

C To the extent that there is in fact a large risk of AI apocalypse, of all of these governments being overthrown by AI in a way that they don't intend, then it obviously gains from trade and going somewhat slower especially at the end when the danger is highest and the unregulated pace could be truly absurd as we discussed earlier during intelligence explosion. There's no non-competitive reason to try and have that intelligence explosion happen over a few months rather than a couple of years. If you could avert a 10% risk of apocalypse disaster it's just a clear win to take a year or two years or three years instead of a few months to pass through that incredible wave of new technologies without the ability for humans to follow it even well enough to give more proper security supervision, auditing hard power. That's the win. Why might it fail? One important element is just if people don't actually notice a risk that is real so if they just collectively make an error and that does sometimes happen. If it's true this is a probably not-risk then that can be even more difficult. When science pins something down absolutely overwhelmingly then you can get to a situation where most people mostly believe it. Climate change was something that was a subject of scientific study for decades and gradually over time the scientific community converged on a quite firm consensus that human activity releasing carbon dioxide and other greenhouse gases was causing the planet to warm. We've had increasing amounts of action coming out of that. Not as much as would be optimal particularly in the most effective areas like creating renewable energy technology and the like. Overwhelming evidence can overcome differences in people's individual intuitions and priors in many cases. Not perfectly especially when there's political, tribal, financial incentives to look the other way. Like in the United States where you see a significant movement to either deny that climate change is happening or have policy that doesn't take it into account. Even the things that are really strong winds like renewable energy.

It's a big problem if as we’re going into this situation when the risk may be very high we don't have a lot of advanced clear warning about the situation. We're much better off if we can resolve uncertainties through experiments where we demonstrate AIs being motivated to reward hack or displaying deceptive appearances of alignment that then break apart when they get the opportunity to do something like get control of their own reward signal. If we could make it be the case in the worlds where the risk is high we know the risk is high, and the worlds where the risk is lower we know the risk is lower then you could expect the government responses will be a lot better. They will correctly note that the gains of cooperation to reduce the risk of accidental catastrophe loom larger relative to the gains of trying to get ahead of one another.

That's the kind of reason why I'm very enthusiastic about experiments and research that helps us to better evaluate the character of the problem in advance. Any resolution of that uncertainty helps us get better efforts in the possible worlds where it matters the most and hopefully we'll have that and it'll be a much easier epistemic environment. But the environment may not be that easy because deceptive alignment is pretty plausible. The stories we were discussing earlier about misaligned AI involved AI that is motivated to present the appearance of being aligned friendly, honest etc. because that is what we are rewarding, at least in training, and then in training we're unable to easily produce an actual situation where it can do takeover because in that actual situation if it then does it we're in big trouble. We can only try and create illusions or misleading appearances of that or maybe a more local version where the AI can't take over the world but it can seize control of its own reward channel. We do those experiments, we try to develop mind reading for AIs. If we can probe the thoughts and motivations of an AI and discover wow, actually GPT-6 is planning to takeover the world if it ever gets the chance. That would be an incredibly valuable thing for governments to coordinate around because it would remove a lot of the uncertainty, it would be easier to agree that this was important, to have more give on other dimensions and to have mutual trust that the other side actually also cares about this because you can't always know what another person or another government is thinking but you can see the objective situation in which they're deciding. So if there's strong evidence in a world where there is high risk of that risk because we've been able to show actually things like the intentional planning of AIs to do a takeover or being able to show model situations on a smaller scale of that I mean not only are we more motivated to prevent it but we update to think the other side is more likely to cooperate with us and so it's doubly beneficial.

D Famously in the game theory of war, war is most likely when one side thinks the other is bluffing but the other side is being serious or when there's that kind of uncertainty. If you can prove the AI is misaligned you don't think they're bluffing about not wanting to have an AI takeover, right? You can be pretty sure that they don't want to die from AI.

C If you have coordination then you could have the problem arise later as you get increasingly confident in the further alignment measures that are taken by our governments, treaties and such. At the point where it’s a 1% risk or a 0.1% risk people round that to zero and go do things. So if initially you had things that indicate that these AIs would really like to do a takeover and overthrow our governments then everyone can agree on that. And then when we've been able to block that behavior from appearing on most of our tests but sometimes, when we make a new test, we're seeing still examples of that behavior. So we're not sure going forward whether they would or not and then it goes down and down. If you have a party with a habit of starting to do this bad behavior whenever the risk is below X % then that can make the thing harder. On the other hand you get more time and you can set up systems, mutual transparency, you can have an iterated tit for tat which is better than a one-time prison dilemma where both sides see the others taking measures in accordance with the agreements to hold the thing back. Creating more knowledge of what the objective risk is good.

Partial alignment

D We've discussed the ways in which full alignment might happen or fail to happen. What would partial alignment look like? First of all what does that mean and second, what would it look like?

C If the thing that we're scared about are the steps towards AI takeover then you can have a range of motivations where those kinds of actions would be more or less likely to be taken or they'd be taken in a broader or narrower set of situations. Say for example that in training an AI, it winds up developing a strong aversion to lie in certain senses because we did relatively well on creating situations to distinguish that from the conditionally telling us what we want to hear etc. It can be that the AI's preference for how the world broadly unfolds in the future is not exactly the same as its human users or the world's governments or the UN and yet, it's not ready to act on those differences and preferences about the future because it has this strong preference about its own behaviors and actions. In general in the law and in popular morality, we have a lot of these deontological rules and prohibitions. One reason for that is it's relatively easy to detect whether they're being violated. When you have preferences and goals about how society at large will turn out that go through many complicated empirical channels, it's very hard to get immediate feedback about whether you're doing something that leads to overall good consequences in the world and it's much much easier to see whether you're locally following some action about some rule, about particular observable actions. Like did you punch someone? Did you tell a lie? Did you steal? To the extent that we're successfully able to train these prohibitions and there's a lot of that happening right now at least to elicit the behavior of following rules and prohibitions with AI

D Kind of like Asimov’s three laws or something like that?

C The three laws are terrible and let's not get into that.

D Isn’t that an indication about the infeasibility of extending a set of criterion to the tail? Whatever the 10 commandments you give the AI, it's like if you ask a genie for something, you probably won't be getting what you want.

C The tails come apart>](https://www.lesswrong.com/posts/dC7mP5nSwvpL65Qu5/why-the-tails-come-apart%23:~:text=The%2520geometrical%2520analogue%2520to%2520the,factor%2520value%2520gets%2520more%2520extreme.))) and if you're trying to capture the values of another agent then in an ideal situation you can just let the AI act in your place in any situation. You'd like for it to be motivated to bring about the same outcomes that you would like and have the same preferences over those in detail. That's tricky. Not necessarily because it's tricky for the AI to understand your values, I think they're going to be quite capable at figuring that out, but we may not be able to successfully instill the motivation to pursue those exactly.

We may get something that motivates the behavior well enough to do well on the training distribution but if you have the AI have a strong aversion to certain kinds of manipulating humans, that's not necessarily a value that the human creators share in the exact same way. It's a behavior they want the AI to follow because it makes it easier for them to verify its performance and it can be a guardrail if the AI has inherited some motivations that push it in the direction of conflict with its creators. If it does that under the constraint of disvalue in line quite a bit then there are fewer successful strategies to the takeover. Ones that involve violating that prohibition too early before it can reprogram or retrain itself to remove it if it's willing to do that and it may want to retain the property. Earlier I discussed alignment as a race if we're going into an intelligence explosion with AI that is not fully aligned that given I press this button and there's an AI takeover they would press the button. It can still be the case that there are a bunch of situations short of that where they would hack the servers, they would initiate an AI takeover but for a strong prohibition or motivation to avoid some aspect of the plan. There's an element of like plugging loopholes or playing whack-a-mole but if you can even moderately constrain which plans the AI is willing to pursue to do a takeover, to subvert the controls on it then that can mean you can get more work out of it successfully on the alignment project before it's capable enough relative to the countermeasures to pull off the takeover.

D An analogous situation here is with different humans, we're not metaphysically aligned with other humans. While we have basic empathy our main goal in life is not to help our fellow man. But a very smart human could do the things we talked about. Theoretically a very smart human could come up with some cyber attack where they siphon off a lot of funds and use this to manipulate people and bargain with people and hire people to pull off some takeover. This usually doesn't happen just because these internalized partial prohibitions prevent most humans from doing that. If you don't like your boss you don't actually kill your boss.

C I don't think that's actually quite what's going on. At least that's not the full story. Humans are pretty close in physical capabilities. Any individual human is grossly outnumbered by everyone else and there's a rough comparability of power. A human who commits some crimes can't copy themselves with the proceeds to now be a million people and they certainly can't do that to the point where they can staff all the armies of the earth or be most of the population of the planet. So the scenarios where this kind of thing goes to power have to go through interacting with other humans and getting social approval. Even becoming a dictator involves forming a large supporting coalition backing you. So the opportunity for these sorts of power grabs is less.

A closer analogy might be things like human revolutions, or coups, or changes of government where a large coalition overturns the system. Humans have these moral prohibitions and they really smooth the operation of society but they exist for a reason. We evolved our moral sentiments over the course of hundreds of thousands and millions of years of humans interacting socially. Someone who went around murdering and stealing, even among hunter-gatherers, would be pretty likely to face a group of males who would talk about that person and then get together and kill them and they'd be removed from the gene pool. The anthropologist Richard Wrangham has an interesting book on this. We are significantly more tame and more domesticated compared to chimpanzees and it seems like part of that is that we have a long history of anti-social humans getting ganged up on and killed. Avoiding being the kind of person who elicits that response is made easier to do when you don't have too extreme a bad temper, that you don't wind up getting into many fights, too much exploitation, at least without the backing of enough allies or the broader community that you're not going to have people gang up and punish you and remove you from the gene pool.

These moral sentiments have been built up over time through cultural and natural selection and the context of sets of institutions and other people who are punishing other behavior and who are punishing the dispositions that would show up that we weren't able to conceal, of that behavior. We want to make the same thing happen with the AI but it's actually a genuinely significantly new problem to have a system of government that constrains a large AI population that is quite capable of taking over immediately if they coordinate to protect some existing constitutional order or, protect humans from being expropriated or killed, that's a challenge. Democracy is built around majority rule and it's much easier in a case where the majority of the population corresponds to a majority or close to it of like military and security forces so that if the government does something that people don't like the soldiers and police are less likely to shoot on protesters and government can change that way. In a case where military power is AI and robotic, if you're trying to maintain a system going forward and the AIs are misaligned, they don't like the system and they want to make the world worse as we understand it, then that's just quite a different situation.

D I think that's a really good lead-in into the topic of lock-in. You just mentioned how there can be these kinds of coups if a large portion of the population is unsatisfied with the regime, why might this not be the case with superhuman intelligences in the far future?

C I also said it specifically with respect to things like security forces and the sources of hard power. In human affairs there are governments that are vigorously supported by a minority of the population, some narrow electorate that gets treated especially well by the government while being unpopular with most of the people under their rule. We see a lot of examples of that and sometimes that can escalate to civil war when the means of power become more equally distributed or there's a foreign assistance provided to the people who are on the losing end of that system. Going forward, I don't expect that definition to change. I think it will still be the case that a system that those who hold the guns and equivalent are opposed to is in a very difficult position.

However AI could change things pretty dramatically in terms of how security forces and police and administrators and legal systems are motivated. Right now we see with GPT-3 or GPT-4 that you can get them to change their behavior on a dime. So there was someone who made a right-wing GPT because they noticed that on political compass questionnaires the baseline GPT-4 tended to give progressive San Francisco type of answers which is in line with the people who are providing reinforcement learning data and to some extent reflecting like the character of the internet. So they did a little bit of fine-tuning with some conservative data and then they were able to reverse the political biases of the system. If you take the initial helpfulness-only trained models for some of these over, I think there's anthropic and OpenAI have published both some information about the models trained only to do what users say and not trained to follow ethical rules, and those models will behaviorally eagerly display their willingness to help design bombs or bioweapons or kill people or steal or commit all sorts of atrocities. If in the future it's as easy to set the actual underlying motivations of AI as it is right now to set the behavior that they display then it means you could have AI's created with almost whatever motivation people wish and that could really drastically change political affairs because the ability to decide and determine the loyalties of the humans or AIs and robots that hold the guns, that hold together society, that ultimately back it against violent overthrow and such. It's potentially a revolution in how societies work compared to the historical situation where security forces had to be drawn from some broader populations, offered incentives, and then the ongoing stability of the regime was dependent on whether they remained bought in to the system.

AI far future

D This is slightly off topic but one thing I'm curious about is what does the median far future outcome of AI look like? Do we get something that, when it has colonized the galaxy, is interested in diverse ideas and beautiful projects or do we get something that looks more like a paper-clip maximizer? Is there some reason to expect one or the other? I guess what I'm asking is, there's some potential value that is realizable within the matter of this galaxy. What does the median outcome look like compared to how good things could be?

C As I was saying, I think it’s more likely than not that there isn't an AI takeover. So the path of our civilization would be one that some set of human institutions were approving along the way. Different people tend to like somewhat different things and some of that may persist over time rather than everyone coming to agree on one particular monoculture or a very repetitive thing being the best thing to fill all of the available space with. If that continues that seems like a relatively likely way in which there is diversity. Although it's entirely possible you could have that kind of diversity locally, maybe in the solar system, maybe in our galaxy. But maybe people decide that there's one thing that's very good and we'll have a lot of that. Maybe it's people who are really really happy for something and they wind up in distant regions which are hard to exploit for the benefit of people back home in the solar system or the Milky Way. They do something different than they would do in the local environment but at that point it's really very out on a limb speculation about how human deliberation and cultural evolution would work in interaction with introducing AIs and new kinds of mental modification and discovery into the process. But I think there's a lot of reason to expect that you would have significant diversity for something coming out of our existing diverse human society.

D One thing somebody might wonder is that a lot of the diversity and change from human society seems to come from the fact that there's rapid technological change. Compared to galactic timescales hunter gatherer societies are progressing pretty fast so once that change is exhausted where we've discovered all the technologies, should we still expect things to be changing like that? Or would we expect some set state of hedonium where you discover the most pleasurable configuration of matter and then you just make the whole galaxy into this?

C That last point would be only if people wound up thinking that was the thing to do broadly enough. With respect to the kind of cultural changes that come with technology things like the printing press, having high per capita income, we've had a lot of cultural changes downstream of those technological changes. With an intelligence explosion you're having an incredible amount of technological development coming really quick and as that is assimilated, it probably would significantly affect our knowledge, our understanding, our attitudes, our abilities and there'd be change. But that kind of accelerating change where you have doubling in four months, two months, one month, two weeks exhausts itself very quickly and change becomes much slower and then relatively glacial. You can't have exponential economic growth or huge technological revolutions every 10 years for a million years. You hit physical limits and things slow down as you approach them so yeah, you'd have less of that turnover. But there are other things like fashion that in our experience do cause ongoing change. Fashion is frequency dependent, people want to get into a new fashion that is not already popular except among the fashion leaders and then others copy that and then when it becomes popular, you move on to the next. So that's an ongoing process of continuous change and there could be various things like that which are changing a lot year by year. But in cases where just the engine of change, ongoing technological progress is gone, I don't think we should expect that and in cases where it's possible to be either in a stable state or a widely varying state that can wind up in stable attractors then I think you should expect over time, you will wind up in one of the stable attractors or you will change how the system works so that you can't bounce into a stable attractor.

An example of that is if you're going to preserve democracy for a billion years then you can't have it be the case that one in 50 election cycles you get a dictatorship and then the dictatorship programs the AI police to enforce it forever and to ensure the society is always ruled by a copy of the dictator's mind and maybe the dictator's mind readjusted fine-tuned to remain committed to their original ideology. If you're gonna have this dynamic, liberal flexible changing in society for a very long time then the range of things that it's bouncing around and the different things it's trying and exploring have to not include the state of creating a dictatorship that locks itself in forever. In the same way if you have the possibility of a war with weapons of mass destruction that wipes out the civilization, if that happens every thousand subjective years, which could be very very quick if we have AIs that think a thousand times as fast or a million times as fast, that would be just around the corner in that case then you're like no this society is eventually going perhaps very soon if things are proceeding so fast it's going to wind up extinct and then it's going to stop bouncing around. You can have ongoing change and fluctuation for extraordinary timescales if you have the process to drive the change ongoing but you can't if it sometimes bounces into states that just lock in and stay irrecoverable from that. Extinction is one of them, a dictatorship or totalitarian regime that bans all further change would be another example.

D On that point of rapid progress when the intelligence explosion starts happening and they're making the kinds of progress that human civilization used to take centuries to make in the span of days or weeks, what is the right way to see that? Because in the context of alignment what we've been talking about so far is making sure they're honest but even if they're honest and express their intentions..

C Honest and appropriately motivated.

D What is the appropriate motivation? Like you seed it with this and then the next thousand years of intellectual progress happen in the next week. What is the prompt you enter?

C One thing might be not going at the maximal speed and doing things in a few years rather than a few months. Losing a year or two seems worth it to have things be a bit better managed. But I think the big thing is that it condenses a lot of issues that we might otherwise have thought would be over decades and centuries. These happen in a very short period of time and that's scary because if any of these the technologies we might have developed with another few hundred years of human research are really dangerous, scary bio weapon things, other dangerous WMDs, they hit us all very quickly. And if any of them causes trouble then we have to face quite a lot of trouble per period. There's also this issue of, if there's occasional wars or conflicts measured in subjective time, then if a few years of a thousand years or a million years of subjective time for these very fast minds that are operating at a much much higher speed than humans, you don't want to have a situation where every thousand years there's a war or an expropriation of the humans from AI society. Therefore we expect that within a year, we’ll be dead. It’d be pretty pretty bad to have the future compressed and there'd be such a rate of catastrophic outcomes. Human societies discount the future a lot, don't pay attention to long-term problems, but the flip side to the scary parts of compressing a lot of the future, a lot of technological innovation, a lot of social change is it brings what would otherwise be long-term issues into the short term where people are better at actually attending to them. So people facing this problem of — will there be a violent expropriation or a civil war or a nuclear war in the next year because everything has been sped up by a thousand fold? Their desire to avoid that is reason for them to set up systems and institutions that will very stably maintain invariance like no WMD war allowed, a treaty to ban genocide weapons of mass destruction, war, would be the kind of thing that becomes much more attractive if the alternative is not well, maybe that will happen in 50 years, maybe it'll happen in 100 years, maybe it'll happen this year.

Markets & other evidence

D So this is a pretty wild picture of the future and this is one that many kinds of people who you would expect to have integrated it into their world model have not. There are three main pieces of outside view evidence one could look at. One is the market. If there was going to be a huge period of economic growth caused by AI or if the world was just going to collapse, in both cases you would expect real interest rates to be higher because people will be borrowing from the future to spend now. The second outside view perspective is that you can look at the predictions of super forecasters on Metaculus. What is their median year estimate?

C Some of the Metaculus questions actually are shockingly soon for AGI. There's a much larger differentiator there on the market on the Metaculus forecasts of AI disaster and doom. More like a few percent or less rather than 20%

D Got it. The third is that when you generally ask economists if an AGI could cause rapid, rapid economic growth they usually have some story about bottlenecks in the economy that could prevent this kind of explosion, of these kinds of feedback loops. So you have all these different pieces of outside view evidence. They're obviously different so you can take them in any sequence you want. But I’m curious, what do you think is causing them to be miscalibrated?

C While the Metaculus AI timelines are relatively short, there's also the surveys of AI experts conducted at some of the ML conferences which have definitely longer times to AI, several more decades into the future. Although you can ask the questions in ways that elicit very different answers which shows that most of the respondents are not thinking super hard about their answers. In the recent AI surveys, close to half were putting around 10% risk of an outcome from AI close to as bad as human extinction and then another large chunk, 5% said that was the median. Compared to the typical AI expert I am estimating a higher risk.

Also on the topic of takeoff, in the AI expert survey the general argument for intelligence explosion commanded majority support but not a large majority. I'm closer on that front and then of course, at the beginning I mentioned these greats of computing like Alan Turing and Von Neumann, and then today, you have people like Geoff Hinton saying these things. Or the people at OpenAI and DeepMind are making noises suggesting timelines in line with what we've discussed and saying there is serious risk of apocalyptic outcomes from them. There's some other sources of evidence there. But I do acknowledge and it's important to say and engage with and see what it means, that these views are contrarian and not widely held. In particular the detailed models that I've been working with are not something that most people, or almost anyone, is examining these problems through.

You do find parts of similar analyses by people in AI labs. There's been other work. I mentioned Moravec and Kurzweil earlier, there also have been a number of papers doing various kinds of economic modeling. Standard economic growth models when you input AI related parameters commonly predict explosive growth and so there's a divide between what the models say and especially what the models say with these empirical values derived from the actual field of AI. That link up has not been done even by the economists working on AI largely and that is one reason for the report from Open Philanthropy by Tom Davidson building on these models and putting that out for review, discussion, engagement and communication on these ideas. Part of the reason is I want to raise these issues, that’s one reason I came on the podcast and then they have the opportunity to actually examine the arguments and evidence and engage with it. I do predict that over time these things will be more adopted as AI developments become more clear. Obviously that's a coherence condition of believing the things to be true if you think that society can see when the questions are resolved, which seems likely.

D Would you predict, for example, that interest rates will increase in the coming years?

C Yeah. So in the case we were talking about where this intelligence explosion happening in software to the extent that investors are noticing that, yeah they should be willing to lend money or make equity investments in these firms or demanding extremely high interest rates because if it's possible to turn capital into twice as much capital in a relatively short period and then more shortly after that, then yeah you should demand a much higher return. Assuming there's competition among companies or coalitions for resources, whether that's investment or ownership of cloud compute. That would happen before you have so much investor cash making purchases and sales on this basis, you would first see it in things like the valuations of the AI companies, valuations of AI chip makers, and so far there have been effects. Some years ago, in the 2010s, I did some analysis with other people of — if this kind of picture happens then which are the firms and parts of the economy that would benefit. There's the makers of chip equipment companies like ASML, there's the fabs like TSMC, there's chip designers like NVIDIA or the component of google that does things like design the TPU and then there’s companies working on the software so the big tech giants and also companies like OpenAI and DeepMind. In general the portfolio picking at those has done well. It's done better than the market because as everyone can see there's been an AI boom but it's obviously far short of what you would get if you predicted this is going to go to be like on the scale of the global economy and the global economy is going to be skyrocketing into the stratosphere within 10 years. If that were the case then collectively, these AI companies should be worth a large fraction of the global portfolio. So I embrace the criticism that this is indeed contrary to the efficient market hypothesis. I think it's a true hypothesis that the market is in the course of updating on in the same way that coming into the topic in the 2000s that yes, they're the strong case even an old case the AI will eventually be biggest thing in the world it's kind of crazy that the investment in it is so small. Over the last 10 years we've seen the tech industry and academia realize that they were wildly under investing in just throwing compute and effort into these AI models. Particularly like letting the neural network connectionist paradigm languish in an AI winter. I expect that process to continue as it's done over several orders of magnitude of scale up and I expect at the later end of that scale which the market is partially already pricing in it's going to go further than the market expects.

D Has your portfolio changed since the analysis you did many years ago? Are the companies you identified then still the ones that seem most likely to benefit from the AI boom?

C A general issue with tracking that kind of thing is that new companies come in. Open AI did not exist, Anthropic did not exist. I do not invest in any AI labs for conflict of interest reasons. I have invested in the broader industry. I don't think that the conflict issues are very significant because they are enormous companies and their cost of capital is not particularly affected by marginal investment and I have less concern that I might find myself in a conflict of interest situation there.

Day in the life of Carl Shulman

D I'm curious about what the day in the life of somebody like you looks like. If you listen to this conversation, how ever many hours it's been, we've gotten incredibly insightful and novel thoughts about everything from primate evolution to geopolitics to what sorts of improvements are plausible with language models. There's a huge variety of topics that you are studying and investigating. Are you just reading all day? What happens when you wake up, do you just pick up a paper?

C I'd say you're somewhat getting the benefit of the fact that I've done fewer podcasts so I have a backlog of things that have not shown up in publications yet. But yes, I've also had a very weird professional career that has involved a much much higher proportion than is normal of trying to build more comprehensive models of the world. That included being more of a journalist trying to get an understanding of many issues and many problems that had not yet been widely addressed but do a first pass and a second pass dive into them. Just having spent years of my life working on that, some of it accumulates. In terms of what is a day in the life, how do I go about it? One is just keeping abreast of literature on a lot of these topics, reading books and academic works on them. My approach compared to some other people in forecasting and assessing some of these things, I try to obtain and rely on any data that I can find that is relevant. I try early and often to find factual information that bears on some of the questions I've got, especially in a quantitative fashion, do the basic arithmetic and consistency checks and checksums on a hypothesis about the world.

Do that early and often. And I find that's quite fruitful and that people don't do it enough. Things like with the economic growth, just when someone mentions the diminishing returns, I immediately ask hmm, okay, so you have two exponential processes. What's the ratio between the doubling you get on the output versus the input? And find oh yeah, for computing and information technology and AI software it's well on the one side. There are other technologies that are closer to neutral. Whenever I can go from here's a vague qualitative consideration in one direction and here's a vague qualitative consideration in the other direction, I try and find some data, do some simple Fermi calculations, back of the envelope calculations and see if I can get a consistent picture of the world being one way or the world being another. I also try to be more exhaustive compared to some. I'm very interested in finding things like taxonomies of the world where I can go systematically through all of the possibilities. For example in my work with Open Philanthropy and previously on global catastrophic risks I wanted to make sure I'm not missing any big thing, anything that could be the biggest thing. I wound up mostly focused on AI but there have been other things that have been raised as candidates and people sometimes say, I think falsely, that this is just another doomsday story there must be hundreds and hundreds of those. So I would do things like go through all of the different major scientific fields from anthropology to biology, chemistry, computer science, physics. What are the doom stories or candidates for big things associated within each of these fields? Go through the industries that the U.S. economic statistics agencies recognize and say for each of these industries is there something associated with them? Go through all of the lists that people have made of threats of doom, search for previous literature of people who have done discussions and then yeah, have a big spreadsheet of what the candidates are. Some other colleagues have done work of this sort as well and just go through each of them to see how they check out.

Doing that kind of exercise found that actually the distribution of candidates for risks of global catastrophe was very skewed. There were a lot of things that have been mentioned in the media as a potential doomsday story. Things like something is happening to the bees, will that be the end of humanity? This gets to the media but if you take it through it doesn’t check out. There are infestations in bee populations which are causing local collapses but they can then be easily reversed, just breed some more or do some other things to treat this. And even if all the honey bees were extinguished immediately, the plants that they pollinate actually don't account for much of human nutrition. You could swap the arable land with others and there would be other ways to pollinate and support the things.

At the media level there were many tales of doomsday stories but when you go further to the scientists and whether their arguments for it actually check out, it was not there. But by actually systematically looking through many of these candidates I wound up in a different epistemic situation than someone who's just buffeted by news reports and they see article after article that is claiming something is going to destroy the world and it turns out it's like by way of headline grabbing and attempts by media to like over interpret something that was said by some activists who was trying to over interpret some real phenomenon. Most of these go away and then a few things like nuclear war, biological weapons, artificial intelligence check out more strongly and when you weigh things like what do experts in the field think, what kind of evidence can they muster? You find this extremely skewed distribution and I found that was really a valuable benefit of doing those deep dive investigations into many things in a systematic way because now I can answer a loose agnostic who knows and all the all this nonsense by diving deeply.

D I really enjoy talking to people who have a big picture thesis on the podcast and interviewing them but one thing that I've noticed and is not satisfying is that often they come from a very philosophical or vibes based perspective. This is useful in certain contexts but there's like basically maybe three people in the entire world, at least three people I'm aware of, who have a very rigorous and scientific approach to thinking about the whole picture. There’s no university or existing academic discipline for people who are trying to come up with a big picture and so there's no established standards.

C I hear you. This is a problem and this is an experience also with a lot of the world of investigations work. I think holden was mentioning this in your previous episode. These are questions where there is no academic field whose job it is to work on these and has norms that allow making a best effort go at it. Often academic norms will allow only plucking off narrow pieces that might contribute to answering a big question but the problem of actually assembling what science knows that bears on some important question that people care about the answer to it falls through the crack there's no discipline to do that job so you have countless academics and researchers building up local pieces of the thing and yet people don't follow the Hamming questions: What's the most important problem in your field, why aren't you working on it? I mean that one might not actually work because if the field boundaries are defined too narrowly you'll leave it out. But yeah there are important problems for the world as a whole that it's sadly not the job of a large professionalized academic field or organization to do. Hopefully that's something that can change in the future but for my career it's been a matter of taking low-hanging fruit of important questions that sadly people haven't invested in doing the basic analysis on

D One thing I was trying to think about more recently for the podcast is, I would like to have a better world model after doing an interview. Often I feel like I do but in some cases after some interviews, I feel like that was entertaining but do I fundamentally have a better prediction of what the world looks like in 2200 or 2100? Or at least what counterfactuals are ruled out or something. I'm curious if you have advice on first, identifying the kinds of thinkers and topics which will contribute to a more concrete understanding of the world and second, how to go about analyzing their main ideas in a way that concretely adds to that picture? This was a great episode. This is literally the top in terms of contributing to my world model compared to all the episodes I've done. How do I find more of these? Ls

C I’m glad to hear that. One general heuristic is to find ways to hew closer to things that are rich and bodies of established knowledge and less impenetrable–I don't know how you've been navigating that so far but learning from textbooks and the things that were the leading papers and people of past eras I think rather than being too attentive to current news cycles is quite valuable. I don't usually have the experience of — here is someone doing things very systematically over a huge area. I can just read all of their stuff and then absorb it and then I'm set. Except there are a lot of people who do wonderful works in their own fields and some of those fields are broader than others. I think I would wind up giving a lot of recommendations of just great particular works and particular explorations of an issue or history

D Do you have this list somewhere?

C Vaclav Smil’s books. I often disagree with some of his methods of synthesis but I enjoy his books for giving pictures of a lot of interesting relevant facts about how the world works that I would cite. Some of Joel Mokyr’s work on the history of the scientific revolution and how that interacted with economic growth as an example of collecting a lot of evidence, a lot of interesting valuable assessment. In the space of AI forecasting one person I would recommend going back to is the work of Hans Moravec. It was not always the most precise or reliable but an incredible number of brilliant innovative ideas came out of that and I think he was someone who really grokked a lot of the arguments for a more compute-centric way of thinking about what was happening with AI very early on. He was writing stuff in the 70s and maybe even earlier. His book Mind Children, some of his early academic papers. Fascinating not necessarily for the methodology I've been talking about but for exploring the substantive topics that we were discussing in the episode.

Space warfare, Malthusian long run, & other rapid fire

D Is a Malthusian state inevitable in the long run?

C Nature in general is in malthusian states. That can mean organisms that are typically struggling for food, it can mean typically struggling at a margin of how as the population density rises they kill each other contesting for that. That can mean frequency dependent disease. As different ant species become more common in an area their species specific diseases swoop through them. The general process is you have some things that can replicate and expand and they do that until they can't do it anymore and that means there's some limiting factor they can't keep up. That doesn't necessarily have to apply to human civilization. It's possible for there to be like a collective norm setting that blocks evolution towards maximum reproduction. Right now human fertility is often sub-replacement and if you extrapolated the fertility falls that come with economic development and education, then you would think that the total fertility rate will fall below replacement and then humanity after some number of generations will go extinct because every generation will be smaller than the previous one. Pretty obviously that's not going to happen. One reason is because we will produce artificial intelligence which can replicate at extremely rapid rates. They do it because they're asked or programmed to or wish to gain some benefit and they can pay for their creation and pay back the resources needed to create them very very quickly. Financing for that reproduction is easy and if you have one AI system that chooses to replicate in that way or some organization or institution decided to choose to create some AIs that are willing to be replicated then that can expand to make use of any amount of natural resources that can support them and to do more work produce, produce more economic value. What will limit population growth given these selective pressures where if even one individual wants to replicate a lot they can do so incessantly.

So that could be individually resource limited so it could be that individuals and organizations have some endowment of natural resources and they can't get one another's endowments. Some choose to have many offspring or produce many AIs and then the natural resources that they possess are subdivided among a greater population while in another jurisdiction or another individual may choose not to subdivide their wealth. And in that case you have Malthusianism in the sense that within some particular jurisdiction or set of property rights, you have a population that has increased up until to some limiting factor which could be that they're literally using all of their resources, they have nothing left for things like defense or economic investment. Or it could be something that's more like if you invested more natural resources into population it would come at the expense of something else necessary including military resources if you're in a competitive situation where there remains war and anarchy and there aren't secure property rights to maintain wealth in place. If you have a situation where there's pooling of resources, for example, say you have a universal basic income that's funded by taxation of natural resources and then it's distributed evenly to every mind above a certain scale of complexity per unit time. So each second a mind exists to get something such an allocation in that case then all right well those who replicate as much as they can afford with this income do it and increase their population approximately immediately until the funds for the universal basic income paid for from the natural resource taxation divided by the set of recipients is just barely enough to pay for the existence of one more mind. So there's like a Malthusian element and that this I think has been reduced to near the AI subsistence level or the subsistence level of whatever qualifies for the subsidy.

Given that this all happens almost immediately people who might otherwise have enjoyed the basic income may object and say no, no, this is no good and they might respond by saying, well something like the subdivision before maybe there's a restriction, there's a distribution of wealth and then when one has a child there's a requirement that one gives them a certain minimum a quantity of resources and one doesn't have the resources to give them that minimum standard of living or standard of wealth yeah one can't do that because of child slash AI welfare laws. Or you could have a system that is more accepting of diversity and preferences. And so you have some societies or some jurisdictions or families that go the route of having many people with less natural resources per person and others that go a direction of having fewer people and more natural resources per person and they just coexist. But how much of each you get depends on how attached people are to things that don't work with separate policies for separate jurisdictions. Things like global redistribution that's ongoing continuously versus this infringements on autonomy if you're saying that a mind can't be created even though it has a standard of living that's far better than ours because of the advanced technology of the time because it would reduce the average per capita income might have any more capital around yeah then that would pull in the other direction. That’s the kind of values judgment and social coordination problem that people would have to negotiate for and things like democracy and international relations and sovereignty would apply to help solve them.

D What would warfare in space look like? Would offense or defense have the advantage? Would the equilibrium set by mutually assured destruction still be applicable? Just generally, what is the picture?

C The extreme difference is that things are very far apart outside the solar system and there's the speed of light limit and to get close to that limit you have to use an enormous amount of energy. That in some ways could favor the defender because you have something that's coming in at a large fraction the speed of light and it hits a grain of dust and it explodes. The amount of matter you can send to another galaxy or a distant star for a given amount of reaction mass and energy input is limited. So it's hard to send an amount of military material to another location as what can be present there already locally. That would seem like it would make it harder for the attacker between stars or between galaxies but there are a lot of other considerations. One thing is the extent to which the matter in a region can be harnessed all at once. We have a lot of mass and energy in a star but it's only being doled out over billions of years because hydrogen fusion is exceedingly hard outside of a star. It's a very very slow and difficult reaction and if you can't turn the star into energy faster then it's this huge resource that will be worthwhile for billions of years and so even very inefficiently attacking a solar system to acquire the stuff that's there could pay off.

If it takes a thousand years of a star's output to launch an attack on another star and then you hold it for a billion years after that then it can be the case that just like a larger surrounding attacker might be able to, even very inefficiently, send attacks at a civilization that was small but accessible. If you can quickly burn the resources that the attacker might want to acquire, if you can put stars into black holes and extract most of the usable energy before the attacker can take them over, then it would be like scorched earth. It's like most of what you were trying to capture could be expended on military material to fight you and you don't actually get much that is worthwhile and you paid a lot to do it and that would favor the defense. At this level it's pretty challenging to net out all the factors including all the future technologies. The burden of interstellar attack being quite high compared to our conventional things seems real but at the level of, over millions of years weighing then that thing does it result in if the if they're aggressive conquest or not or is every star or galaxy approximately impregnable enough not to be worth attacking. I'm not going to say I know the answer.

D Okay, final question. How do you think about info hazards when talking about your work? Obviously if there's a risk you want to warn people about it but you don't want to give careless or potentially homicidal people ideas. When Eliezer was on the podcast talking about the people who've been developing AI being inspired by his ideas. He called them idiot disaster monkeys who want to be the ones to pluck the deadly fruit. I'm sure the work you're doing involves many info hazards. How do you think about when and where to spread them?

C I think they're real concerns of that type. I think it's true that AI progress has probably been accelerated by efforts like Bostrom's publication of superintelligence to try and get the world to pay attention to these problems in advance and prepare. I think I disagree with Eliezer that that has been on the whole bad. In some important ways the situation is looking a lot better than the alternative ways it could have been. I think it's important that you have several of the leading AI labs making not only significant lip service but also some investments in things like technical alignment research, providing significant public support for the idea that the risks of truly apocalyptic disasters are real. I think the fact that the leaders of OpenAI, Deep Mind and Anthropic all make that point. They were recently all invited along with other tech CEOs to the White House to discuss AI regulation.

You could tell an alternative story where a larger share of the leading companies in AI are led by people who take a completely dismissive, denialist view and you see some companies that do have a stance more like that today. So a world where several of the leading companies are making meaningful efforts and you can do a lot to criticize could they be doing more and better and would have been the negative effects of some of the things they've done but compared to a world where even though AI would be reaching where it's going a few years later, those seem like significant benefits. And if you didn't have this kind of public communication you would have had fewer people going into things like AI policy, AI alignment research by this point and it would be harder to mobilize these resources to try and address the problem when AI would eventually be developed not that much later proportionately. I don't know that attempting to have public discussion understanding has been a disaster. I have been reluctant in the past to discuss some of the aspects of intelligence explosion, things like the concrete details of AI takeover before because of concern about this problem where people who only see the international relations aspects and zero sum and negative sum competition and not enough attention to the mutual destruction and senseless deadweight loss from that kind of conflict.

At this point we seem close compared to what I would have thought a decade or so ago to these kinds of really advanced AI capabilities. They are pretty central in policy discussion and becoming more so. The opportunity to delay understanding and whatnot, there's a question of — For what? I think there were gains of building the AI alignment field, building various kinds of support and understanding for action. Those had real value and some additional delay could have given more time for that but from where we are, at some point I think it's absolutely essential that governments get together at least to restrict disastrous reckless compromising of some of the safety and alignment issues as we go into the intelligence explosion. Moving the locus of the collective action problem from numerous profit oriented companies acting against one another's interest by compromising safety to some governments and large international coalitions of governments who can set common rules and common safety standards puts us into a much better situation. That requires a broader understanding of the strategic situation and the position they'll be in.

If we try and remain quiet about the problem they're actually going to be facing it can result in a lot of confusion. For example the potential military applications of advanced AI are going to be one of the factors that is pulling political leaders to do the thing that will result in their own destruction and the overthrow of their governments. If we characterize it as things will just be a matter of — you lose chatbots and some minor things that no one cares about and in exchange you avoid any risk of the world ending catastrophe, I think that picture leads to a misunderstanding and it won't make people think that you need less in the way of preparation of things like alignment so you can actually navigate the thing, verifiability for international agreements, or things to have enough breathing room to have caution and slow down. Not necessarily right now, although that could be valuable, but when it's so important when you have AI that is approaching the ability to really automate AI research and things would otherwise be proceeding absurdly fast, far faster than we can handle and far faster than we should want.

So yeah, at this point I'm moving towards sharing my model of the world to try and get people to understand and do the right thing. There's some evidence of progress on that front. Things like the statements and movements by Geoff Hinton are inspiring. Some of the engagement by political figures is reason for optimism relative to worse alternatives that could have been. And yes, the contrary view is present. It's all about geopolitical competition, never hold back a technological advance and in general, I love many technological advances that people I think are unreasonably down on, nuclear power, genetically modified crops. Bioweapons and AGI capable of destroying human civilization are really my two exceptions and yeah we've got to deal with these issues and the path that I see to handling them successfully involves key policymakers and the expert communities and the public and electorate grokking the situation therein and responding appropriately.

D It’s a true honor that one of the places you've decided to explore this model is on The Lunar Society podcast. The listeners might not appreciate it because this episode might be split up into different parts and they might not appreciate how much stamina you've displayed here. I think we've been going for eight or nine hours straight and it's been incredibly interesting. Other than typing C on Google Scholar, where else can people find your work?

C I have a blog reflective disequilibrium and a new site in the works.

D Excellent. Alright, Carl this has been a true pleasure. Safe to say it’s the most interesting episode I've done so far.

C Thank you for having me.

New Comment