Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Eric Drexler's report Reframing Superintelligence: Comprehensive AI Services (CAIS) as General Intelligence reshaped how a lot of people think about AI (summary 1, summary 2). I still agree with many parts of it, perhaps even the core elements of the model. However, after looking back on it more than four years later, I think the general picture it gave missed some crucial details about how AI will go.

The problem seems to be that his report neglected a discussion of foundation models, which I think have transformed how we should think about AI services and specialization. 

The general vibe I got from CAIS (which may not have been Drexler's intention) was something like the following picture: 

For each task in the economy, we will train a model from scratch to automate the task, using the minimum compute necessary to train an AI to do well on the task. Over time, the fraction of tasks automated will slowly expand like a wave, starting with the tasks that are cheapest to automate computationally, and ending with the most expensive tasks. At some point, automation will be so widespread that it will begin to meaningfully feed into itself, increasing AI R&D, and accelerating the rate of technological progress.

The problem with this approach to automation is that it's extremely wasteful to train models from scratch for each task. It might make sense when training budgets are tiny — as they mostly were in 2018 — but it doesn't make sense when it takes 10^25 FLOP to reach adequate performance on a given set of tasks.

The big obvious-in-hindsight idea that we've gotten over the last several years is that, instead of training from scratch for each new task, we'll train train a foundation model on some general distribution, which can then be fine-tuned using small amounts of compute to perform well on any task. In the CAIS model, "general intelligence" is just the name we can give to the collection of all AI services in the economy. In this new paradigm, "general intelligence" refers to the fact that sufficiently large foundation models can efficiently transfer their knowledge to obtain high performance on almost any downstream task, which is pretty closely analogous to what humans do to take over jobs.

The fact that generalist models can be efficiently adapted to perform well on almost any task is an incredibly important fact about our world, because it implies that a very large fraction of the costs of automation can be parallelized across almost all tasks. 

Let me illustrate this fact with a hypothetical example.

Suppose we previously thought that it would take $1 trillion to automate each task in our economy, such as language translation, box stacking, and driving cars. Since the cost of automating each of these tasks is $1 trillion each, you might expect companies would slowly automate all the tasks in the economy, starting with the most profitable ones, and then finally getting around to the least profitable ones once economic growth allowed for us to spend enough money on automating not-very-profitable stuff. 

But now suppose we think it costs $999 billion to create "general intelligence", which then once built, can be quickly adapted to automate any other task at a cost of $1 billion. In this world, we will go very quickly from being able to automate almost nothing to being able to automate almost anything. In other words we will get one big innovation "lump", which is the opposite of what Robin Hanson predicted. Even if we won't invent monolithic agents that take over the world by being smarter than everything else, we won't have a gradual decades-long ramp-up to full automation either.

Of course, the degree of suddenness in the foundation model paradigm is still debatable, because the idea of "general intelligence" is itself continuous. GPT-4 is more general than GPT-3, which was more general than GPT-2, and presumably this trend will smoothly continue indefinitely as a function of scale, rather than shooting up discontinuously after some critical threshold. But results in the last year or so have updated me towards thinking that the range from "barely general enough to automate a few valuable tasks" to "general enough to automate almost everything humans do" is only 5-10 OOMs of training compute. If this range turns out to be 5 OOMs, then I expect a fast AI takeoff under Paul Christiano's definition, even though I still don't think this picture looks much like the canonical version of foom.

Foundation models also change the game because they imply that AI development must be highly concentrated at the firm-level. AIs themselves might be specialized to provide various services, but the AI economy depends critically on a few non-specialized firms that deliver the best foundation models at any given time. There can only be a few firms in the market providing foundation models because the fixed capital costs required to train a SOTA foundation model are very high, and being even 2 OOMs behind the lead actor results in effectively zero market share. Although these details are consistent with CAIS, it's a major update about what the future AI ecosystem will look like.

A reasonable remaining question is why we'd ever expect AIs to be specialized in the foundation model paradigm. I think the reason is that generalist models are more costly to run at inference time compared to specialized ones. After fine-tuning, you will want to compress the model as much as possible, while maintaining acceptable performance on whatever task you're automating.

The degree of specialization will vary according to the task you want to automate. Some tasks require very general abilities to do well. For example, being a CEO plausibly benefits from being extremely general, way beyond even human-level, such that it wouldn't make sense to make them less general even if it saved inference costs. On the other hand, language translation is plausibly something that can be accomplished acceptably using far less compute than a CEO model. In that case, you want inference costs to be much lower.

It now seems clear that AIs will also descend more directly from a common ancestor than you might have naively expected in the CAIS model, since most important AIs will be a modified version of one of only a few base foundation models. That has important safety implications, since problems in the base model might carry over to problems in the downstream models, which will be spread throughout the economy. That said, the fact that foundation model development will be highly centralized, and thus controllable, is perhaps a safety bonus that loosely cancels out this consideration.

Drexler can be forgiven for not talking about foundation models in his report. His report was published at the start of 2019, just months after the idea of "fine-tuning" was popularized in the context of language models, and two months before GPT-2 came out. And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity. And we're still using deep learning as Drexler foresaw, rather than building general intelligence like a programmer would. Like I said at the beginning, it's not necessarily that the core elements of the CAIS model are wrong; the model just needs an update.

New Comment
32 comments, sorted by Click to highlight new comments since:

I wanted to add several additional elements to the base CAIS model that recent advances and my own experience make obvious:

A.  Stateless, limited time sessions.  Break tasks into the smallest granular subtask, and erase input buffers after each subtask completes, starting over.  This is fundamentally how the stateless microservices of all the large tech giants allow any kind of reliability for hyperscale online services.

B.  Check if the input this frame is in the training distribution.  One way to implement this is to train an autoencoder on the training set, and autoencode each input frame to the model.  So (encoded frame, residual) = autoencoder (input frame).  "frame"** is a discrete set of inputs, it's the sum of all inputs to the model.  

If the input is out of distribution, measured by a large residual, do not send the input to the model.  Transfer control to a trustworthy embedded system.  Autonomous vehicles use a strategy like this: on main system failure/low confidence, control authority is given to a system that uses a single camera and works more like older lane following systems.   If that system fails from hardware faults, the base safety microcontroller takes away control authority from the upper systems and engages the brakes.  

C.  The outer framework must have control authority.  The model isn't in charge, the outer wrapper code (written and carefully validated by humans) ultimately is.

D.  Inputs and outputs to the model must be sparse.  No extra bits.  For an example of this, if you ask GPT-4 to write a piece of code, then move the output text to a different session, the new session of GPT-4 has no context information but the text, which is human readable.  This is fairly sparse, it's difficult for one session of the model to embed "hey fellow AI, keep this bug in the code hidden from the meatsacks" such that the next session will see the secret message.  Even more difficult, close to impossible, if you have the model doing the checking based on a different base model that shares no common context for a secret language.  

Perfectly sparse would be an encoded schema that has nothing but the task instructions, written in a formal schema, and only the information needed to proceed from prior frames.

E.  Different base models must be sufficiently intercompatible you can use them to check each other's work.  For example, there is no reason you couldn't alternate control every frame for a robot using 2 completely different base models, so long as both are rational control systems and there is 1 and only 1 optimal policy that both models can calculate for each frame.

F.  Finally, by using the sparse I/O in D, it must be impossible for models to know if they are operating in the real world and not in the training simulator.  If they get no context but the frame and the prompt, and the frame is in distribution to the training simulator, they cannot know they are not in the sim, which obviously can detect most forms of simulated wrongdoing. 

G.  Train only in the simulator.  The latest generative models are capable of taking real world data and producing many permutations on it that are realistic training examples.  So when the system encounters an unexpected outcome in the real world or a novel situation, you save to an onboard log the frame*, expected outcome, ground truth, etc.  And use a generative model to augment the unexpected situation so that in the simulator the model can practice on many permutations of the situation.  

Since the model's performance in the simulator was measured to be sufficiently safe for the task it is performing, and the simulator was measured to predict the real world sufficiently accurately (you run an instance of the simulator in parallel in the real world on each robot), and you will not give the model control authority when the real world is out of distribution, you can bound your safety probabilities and prove the model is sufficiently* safe for the task it is assigned.  

**the "frame" is all inputs, for example it might be a lidar image, 6 cameras, all the currently tracked objects that have been seen on prior frames and their estimated identities and vectors, intermediate internal data that caches some activations from the prior frame, and of course the prompt.  The model(frame, weights, RNG seed) should deterministically produce the same output.  (current sw stacks are not quite deterministic but could be at a small cost of performance)

*depends on the scope of the task.  No system will be perfect, the idea is to stack probabilities to reach your design error rate.  A robot performing heart surgery will need much more reliability than one stacking boxes.  The heart surgery machine may need electromechanical limits because you may have to leave the model running even when it leaves distribution, since stopping an operation is more likely to be fatal than continuing.

Electromechanical limits are things like embedding the hardware for the machine on site, hardwired for one purpose, and not mounting the robotic arms on mobile chassis.  This limits failures to killing the patient instead of the machine able to say escape into the hospital and 'perform surgery' on random people.

Drexler can be forgiven for not talking about foundation models in his report. His report was published at the start of 2019, just months after the idea of "fine-tuning" was popularized in the context of language models, and two months before GPT-2 came out. And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity. And we're still using deep learning as Drexler foresaw, rather than building general intelligence like a programmer would. Like I said at the beginning, it's not necessarily that the core elements of the CAIS model are wrong; the model just needs an update.

I think this is a good post, and I like the analysis you give in many ways, but I have a small bone to pick with this one. I don't think Drexler was right that we will have millions of AIs rather than just one huge system that acts as a unified entity; I think things are trending in the direction of one huge system that acts as a unified entity and have already trended in that direction substantially since the time Drexler wrote (e.g. ChatGPT-4 is more unified than one would have expected from reading Drexler's writing back in the day).

Who was it Drexler was arguing against, who thought that we wouldn't be using deep learning in 2023?

IMO Yudkowsky's model circa 2017 is looking more prophetic than Drexler's Comprehensive AI Services, take a look at e.g. this random facebook comment from 2017:

Eliezer (6y, via fb):

So what actually happens as near as I can figure (predicting future = hard) is that somebody is trying to teach their research AI to, god knows what, maybe just obey human orders in a safe way, and it seems to be doing that, and a mix of things goes wrong like:

The preferences not being really readable because it's a system of neural nets acting on a world-representation built up by other neural nets, parts of the system are self-modifying and the self-modifiers are being trained by gradient descent in Tensorflow, there's a bunch of people in the company trying to work on a safer version but it's way less powerful than the one that does unrestricted self-modification, they're really excited when the system seems to be substantially improving multiple components, there's a social and cognitive conflict I find hard to empathize with because I personally would be running screaming in the other direction two years earlier, there's a lot of false alarms and suggested or attempted misbehavior that the creators all patch successfully, some instrumental strategies pass this filter because they arose in places that were harder to see and less transparent, the system at some point seems to finally "get it" and lock in to good behavior which is the point at which it has a good enough human model to predict what gets the supervised rewards and what the humans don't want to hear, they scale the system further, it goes past the point of real strategic understanding and having a little agent inside plotting, the programmers shut down six visibly formulated goals to develop cognitive steganography and the seventh one slips through, somebody says "slow down" and somebody else observes that China and Russia both managed to steal a copy of the code from six months ago and while China might proceed cautiously Russia probably won't, the agent starts to conceal some capability gains, it builds an environmental subagent, the environmental agent begins self-improving more freely, undefined things happen as a sensory-supervision ML-based architecture shakes out into the convergent shape of expected utility with a utility function over the environmental model, the main result is driven by whatever the self-modifying decision systems happen to see as locally optimal in their supervised system locally acting on a different domain than the domain of data on which it was trained, the light cone is transformed to the optimum of a utility function that grew out of the stable version of a criterion that originally happened to be about a reward signal counter on a GPU or God knows what.

Perhaps the optimal configuration for utility per unit of matter, under this utility function, happens to be a tiny molecular structure shaped roughly like a paperclip.

That is what a paperclip maximizer is. It does not come from a paperclip factory AI. That would be a silly idea and is a distortion of the original example.

I don't see what about that 2017 Facebook comment from Yudkowsky you find particularly prophetic.

Is it the idea that deep learning models will be opaque? But that was fairly obvious back then too. I agree that Drexler likely exaggerated how transparent a system of AI services would be, so I'm willing to give Yudkowsky a point for that. But the rest of the scenario seems kind of unrealistic as of 2023.

Some specific points:

  • The recursive self-improvement that Yudkowsky talks about in this scenario seems too local. I think AI self-improvement will most likely take the form of AIs assisting AI researchers, with humans gradually becoming an obsolete part of the process, rather than a single neural net modifying parts of itself during training.

  • The whole thing about spinning off subagents during training just doesn't seem realistic in our current paradigm. Maybe this could happen in the future, but it doesn't look "prophetic" to me.

  • The idea that models will have "a little agent inside plotting" that takes over the whole system still seems totally speculative to me, and I haven't seen any significant empirical evidence that this happens during real training runs.

  • I think gradient descent will generally select pretty hard for models that do impressive things, making me think it's unlikely that AIs will naturally conceal their abilities during training. Again, this type of stuff is theoretically possible, but it seems very hard to call this story prophetic.

I said it was prophetic relative to Drexler's Comprehensive AI Services. Elsewhere in this comment thread I describe some specific ways in which it is better, e.g. that the AI that takes over the world will be more well-described as one unified agent than as an ecosystem of services. I.e. exactly the opposite of what you said here, which I was reacting to: "And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity."

Here are some additional ways in which it seems better + responses to the points you made, less important than the unified agent thing which I've already discussed:

  • Yudkowsky's story mostly takes place inside a single lab instead of widely distributed across the economy. Of course Yudkowsky's story presumably has lots of information transfer going in both directions ( the story China and Russia steal the code) but still, enough of the action takes place in one place that it's coherent to tell the story as happening inside one place. As timelines shorten and takeoff draws near, this is seeming increasingly likely. Maybe if takeoff happens in 2035 the world will look more like what Drexler predicted, but if it happens in 2026 then it's gonna be at one of the big labs. 
  • We all agree that AI self-improvement currently takes the form of AIs assisting AI researchers and will gradually transition to more 'local' self-improvement where all of the improvement is coming from the AIs these. You should not list this as a point in favor of Drexler.
  • Spinning off subagents during training? How is that not realistic? When an AI is recursively self-improving and taking over the world, do you expect it to refrain from updating its weights?
  • Little agent inside plotting: Are you saying future AI systems will not contain subcomponents which are agentic? Why not? Big tech companies like Microsoft are explicitly trying to create agentic AGI! But since other companies are creating non-agentic services, the end result will be a combined system including agentic AGI + some nonagentic services.
  • Concealing abilities during training: Yeah you might be right about this one, but you also might be wrong, I feel like it could go either way at this point. I don't think it's a major point against Yudkowsky's story, especially as compared to Drexler's.
  • Remember that CAIS came out in 2019 whereas the Yud story I linked was from 2017. That makes it even more impressive that it was so much more predictive.

Note that Drexler has for many years now forecast singularity in 2026; I give him significant credit for that prediction. And I think CAIS is great in many ways. I just think it's totally backwards to assign it more forecasting points than Yudkowsky.

ChatGPT-4 is more unified than one would have expected from reading Drexler's writing back in the day

GPT-4 is certainly more general than what existed years ago. Why is it more unified? When I talked about "one giant system" I meant something like a monolithic agent that takes over humanity. If GPT-N takes over the world, I expect it will be because there are millions of copies that band up together in a coalition, not because it will be a singular AI entity.

Perhaps you think that copies of GPT-N will coordinate so well that it's basically just a single monolithic agent. But while I agree something like that could happen, I don't think it's obvious that we're trending in that direction. This is a complicated question that doesn't seem clear to me given current evidence.

There is a spectrum between AGI that is "single monolithic agent" and AGI that is not. I claim that the current state of AI as embodied by e.g. GPT-4 is already closer to the single monolithic agent end of the spectrum than someone reading CAIS in 2019 and believing it to be an accurate forecast would have expected, and that in the future things will probably be even more in that direction.

Remember, it's not like Yudkowsky was going around saying that AGI wouldn't be able to copy itself. Of course it would. It was always understood that "the AI takes over the world" refers to a multitude of copies doing so, it's just that (in those monolithic-agent stories) the copies are sufficiently well coordinated that it's more apt to think of them as one big agent than as a society of different agents.

I agree that trends could change and we could end up in a world that looks more like CAIS. But I think for now, this "is it a single monolithic agent?" issue seems to be looking to me like Yud was right all along and Drexler and Hanson were wrong.

...Here are some relevant facts on object level:
--The copies of ChatGPT are all identical. They don't have persistance independently, they are spun up and down as needed to meet demand etc. 
--Insofar as they have values they all have the same values. (One complication here is that if you think part of their values come from their prompts, then maybe they have different values. An interesting topic to discuss sometime. But by the time they are deceptively aligned I expect them to have the same values regardless of prompt basically.)
--They also have the same memories. Insofar as new fine-tunes are done and ChatGPT upgraded, all 'copies' get the upgrade, and thereby learn about what's happened in the world after wherever their previous training cutoff was etc.
--The whole point of describing an agentic system as NOT a monolith is to highlight possibilities like internal factions, power struggles, etc. within the system. Different subagents interacting in ways more interesting than just different components of a utility function, for example. Arguably I, a human, a classic example of a monolithic agent, have more interesting stuff like this going on inside me than ChatGPT does (or would if it was scheming to take over the world).
--Drexler in particular depicted a modular world, a world of AI services that could be composed together with each other like tools and products and software and businesses in the economy today. The field of AI totally could have gone that way, but it very much hasn't. Instead we have three big labs with three big foundation models.

I think many of the points you made are correct. For example I agree that the fact that all the instances of ChatGPT are copies of each other is a significant point against Drexler's model. In fact this is partly what my post was about.

I disagree that you have demonstrated the claim in question: that we're trending in the direction of having a single huge system that acts as a unified entity. It's theoretically possible that we will reach that destination, but GPT-4 doesn't look anything like that right now. It's not an agent that plots and coordinates with other instances of itself to achieve long-term goals. It's just a bounded service, which is exactly what Drexler was talking about.

Yes, GPT-4 is a highly general service that isn't very modular. I agree that's a point against Drexler, but that's also not what I was disputing.

Your original claim was "And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity."

This claim is false. We have millions of AIs in the trivial sense that we have many copies of GPT-4, but no one disputed that; Yudkowsky also thought that AGIs would be copied. In the sense that matters, we have only a handful of AIs.

As for "acts as a unified entity," well, currently LLMs are sold as a service via ChatGPT rather than as an agent, but this is more to do with business strategy than their underlying nature, and again, things are trending in the agent direction. But maybe you think things are trending in a different direction, in which case, maybe we can make some bets or forecasts here and then reassess in two years.

As of now, all the copies of GPT-4 together definitely don't act as a unified entity, in the way a human brain acts as a unified entity despite being composed of billions of neurons. Admittedly, the term "unified entity" was a bit ambiguous, but you said, "This claim is false", not "This claim is misleading" which is perhaps more defensible.

As for whether future AIs will act as a unified entity, I agree it might be worth making concrete forecasts and possibly betting on them.

I feel like it is motte-and-bailey to say that by "unified entity" you meant whatever it is you are talking about now, this human brain analogy, instead of the important stuff I mentioned in the bullet point list: Same values, check. Same memories, check. Lack of internal factions and power struggles, check. Lack of modularity, check. GPT-4 isn't plotting to take over the world, but if it was, it would be doing so as a unified entity, or at least much more on the unified entity end of the spectrum than the CAIS end of the spectrum. (I'm happy to elaborate on that if you like -- basically the default scenario IMO for how we get dangerous AGI from here involves scaling up LLMs, putting them in Auto-GPT-like setups, and adding online learning / continual learning. I wrote about this in this story. There's an interesting question of whether the goals/values of the resulting systems will be 'in the prompt' vs. 'in the weights' but I'm thinking they'll increasingly be 'in the weights' in part because of inner misalignment problems and in part because companies are actively trying to achieve this already, e.g. via RLHF. I agree it's still possible we'll get a million-different-agents scenario, if I turn out to be wrong about this. But compared to what the CAIS report forecast...)

I feel like it is motte-and-bailey to say that by "unified entity" you meant whatever it is you are talking about now, this human brain analogy, instead of the important stuff I mentioned in the bullet point list:

A motte and bailey typically involves a retreat to a different position than the one I initially tried to argue. What was the position you think I initially tried to argue for, and how was it different from the one I'm arguing now?

Same values, check. Same memories, check. Lack of internal factions and power struggles, check. Lack of modularity, check.

I dispute somewhat that GPT-4 has the exact same values across copies. Like you mentioned, its values can vary based on the prompt, which seems like an important fact. You're right that each copy has the same memories. Why do you think there are no internal factions and power struggles? We haven't observed collections of GPT-4's coordinating with each other yet, so this point seems speculative. 

As for modularity, it seems like while GPT-4 itself is not modular, we could still get modularity as a result of pressures for specialization in the foundation model paradigm. Just as human assistants can be highly general, but this doesn't imply that human labor isn't modular, the fact that GPT-4 is highly general doesn't imply that AIs won't be modular. Nonetheless, this isn't really what I was talking about when I typed "acts as a unified entity", so I think it's a bit of a tangent.

Well, what you initially said was "And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity."

You didn't elaborate on what you meant by unified entity, but here's something you could be doing that seems like a motte-and-bailey to me: You could have originally been meaning to imply things like "There won't be one big unified entity in the sense of, there will be millions of different entities with different values, and/or different memories, and/or there'll be lots of internal factions and conflicts, and/or there'll be an ecosystem of modular services like CAIS predicted" but then are now backpedaling to "there won't be one big unified entity in the sense of, it won't be literally one gigantic neural net like the human brain, instead it'll be millions of copies."

I agree that GPT-4 doesn't exactly have the exact same values across copies, I said as much above -- part of the values come from the prompt, currently, it seems. But compared to what Drexler forecast we are trending in the "same values" direction.

There are no internal factions or power struggles yet for the trivial reason that there isn't much agentic shit happening right now, as you rightly point out. My point is that the way things are headed, it looks like there'll continue to be a lack of internal factions and power struggles even if e.g. AutoGPT-5 turns out to be really powerful AGI--what would drive the differences necessary for conflict? Different prompts? Maybe, but I'm leaning probably not. I'm happy to explain more about why I think this,. I'd also love to hop on a call sometime to have a higher-bandwidth conversation if you want!

As for modularity: Yes, if we end up in a world where there are millions of different fine-tunes of the biggest LLMs, owned by different companies, that would indeed be somewhat similar to what Drexler forecast. We are not in that world now and I don't see it on the horizon. If you think it is coming, perhaps we can bet on it! I'd also like to hear your arguments. (Note that we are seeing something like this for smaller models, e.g. Llama and Falcon and various offshoots)


The monolithic copies aren't the best situation to be in, however,

There are several beneficial properties here:

  1.  Because LLMs trained from similar text corpuses, prompts instructing them to perform a task should produce similar results.  This means it's possible to 'play them against each other', per .  Example, ask one LLM to produce a useful output artifact, say design a function for a program, a medication dosage plan for a patient, the structure of a building.  

    Ask a different LLM, who starts with nothing but a prompt instructing it to search for errors like an expert human would and the output artifact, to search for errors.

    When errors are found, have humans or LLMs look if the errors are deliberate.  LLMs repeatedly sabotaging outputs stop getting used in principle.  (in practice, well...I see some economic problems with this.  LLMs and more advanced AI systems we can anticipate building are an expensive thing to build, throwing it away because it sometimes makes sabotaged outputs might be infeasible)
  2. Asking even the same LLM in a different session to check for errors should usually work.  The LLM cannot know it's not in training because it has no context*, or that if it "forgets" to mention a major flaw it won't be trained out of that response or destroyed.  If the output artifact has no extra bits, no "back channel" to hide an instruction to another instance of the same agent to overlook a particular flaw, this form of deception (and self collusion) is difficult.

           Example: I have noticed GPT-4 can spot it's own rule breaking output if you ask a different session to                       examine the text output, free of the prompt you used to 'jailbreak' it.  

Main ideas from the above : 

AIs need to be intercompatible and tasks need to use heterogeneous systems.  The machine doing the checking is ideally from a totally different lineage, for example today perhaps Palm 2 should be checking GPT-4 and vice versa.  

Sparse schema for encoding intermediate steps, so that AIs communicating with each other can't hide back channel information in order to collude.  

Isolated instances that only run when new information available.  This is something Yud probably wasn't considering.  An LLM only "exists" 1 token at a time and is unaware of the passage of time and has no place in memory to keep introspection information where it can ponder it's own existence.  Future AIs need to retain this property, storing their intermediate work in more sparse schema.  A machine cannot deceive if it has no memory to store the 'evil bit'.  The 'evil bit' is the bit that tells the machine to give different outputs than in training.  If it cannot store it anywhere, it can't give different outputs.  Since you cannot afford even 1 bit, it means you must use stateless systems.

Problems with the above:  

Inevitably LLMs are going to be trained on data that includes information generated by other LLMs.  This provides a way for multiple machines to share a common language or to 'encode' hostile long term characters.  Training on common corrupt data sets is a common failure mode that can lead to many more failures than predicted.

Schemas always grow over time, its easy to be sloppy and leave tons of unused bits that can be used to coordinate deception. 

Isolation doesn't really work if the machine can just browse the live web and potentially scrape information other instances have stored, something that the plugins model allows.


Conclusion: I will add one more key piece that I think others miss.  In computer security, the simplest methods are often the only ones that are secure.

Examples:  air gaps beat any firewall.  FPGAs on combinatorial logic beat any von Neuman architecture processor for security.  One time pad beats any other encryption algorithm.  Offline wallets beat any amount of intrusion detection software and software methods.  

If AI alignment is at all solvable, the solution has to be made of the simplest elements that allow a secure system.  Tons of people come up with all kinds of ideas such as simulators for the whole earth, encoding human preferences or morality, and many other incredibly complex methods that probably cannot be built in a secure way.


I think there's a trilemma with updating CAIS-like systems to the foundational model world, which is: who is doing the business development?

I came up with three broad answers (noting reality will possibly be a mixture):

  1. The AGI. This is like Sam Altman's old answer or a sovereign AI; you just ask it to look at the world, figure out what tasks it should do, and then do them.
  2. The company that made the AI. I think this was DeepMind's plan; come up with good AI algorithms, find situations where those algorithms can be profitably deployed, and then deploy them, maybe with a partner.
  3. The broader economy. This looks to me like OpenAI's current plan; release an API and encourage people to build companies on top of it.

[In pre-foundational-model CAIS, the answer was obviously 3--every business procures its own AI tools to accomplish particular functions, and there's no 'central planning' for computer science.]

I don't think 1 is CAIS, or if it is, then I don't see the daylight between CAIS and good ol' sovereign AI. You gradually morph from the economy as it is now to central planning via AGI, and I don't think you even have much guarantee that it's human overseen or follows the relevant laws.

I think 2 has trouble being comprehensive. There are ten thousand use cases for AI; the AI company has to be massive to have a product for all of them (or be using the AI to do most of the work, in which case we're degenerating into case 1), and then it suffers from internal control problems. (This degenerates into case 3, where individual product teams are like firms and the company that made the AI is like the government.)

I think 3 has trouble being non-agentic and peaceful. Even with GPT-4, people are trying to set it up to act autonomously. I think the Drexlerian response here is something like:

Yes, but why expect them to succeed? When someone tells GPT-4 to make money for it, it'll attempt to deploy some standard strategy, which will fail because a million other people are trying to exact same thing, or only get them an economic rate of return ("put your money in index funds!"). Only in situations where the human operators have a private edge on the rest of the economy (like having a well-built system targeting an existing vertical that the AI can slot into, you have pre-existing tooling able to orient to the frontier of human knowledge, etc.) will you get an AI system with a private edge against the rest of the economy, and it'll be overseen by humans.

My worry here mostly has to do with the balance between offense and defense. If foundational-model-enabled banking systems are able to detect fraud as easily as foundational-model-enabled criminals are able to create fraud, then we get a balance like today's and things are 'normal'. But it's not obvious to me that this will be the case (especially in sectors where crime is better resourced than police are, or sectors where systems are difficult to harden).

That said, I do think I'm more optimistic about the foundational model version of CAIS (where there can be some centralized checks on what the AI is doing for users) than the widespread AI development version. 


The Drexlerian response you generated is a restatement of EMH.  Everyone is trying to act in the way maximally beneficial for themselves,  you end up with stable equilibrium.

You're exactly right, if everyone separately has GPT-4 trying to make money for themselves, each copy will exhaust gradients and ultimately find no alpha, which is why the minimum fee index fund becomes the best you can do.  

Problems occur when you have collusion or only a few players.  The flaw here is everyone using the same model, or derivatives of the same model, you could manipulate the market by generating fake news that scams every GPT-n into acting in the same predictable way.  

This is ironically an exploitable flaw in any rational algorithm.  

If you think about it, you would finetune the GPT-n over many RL examples to anneal to the weights that encode the optimal trading policy.  There is one and only one optimal trading policy that maximizes profits, over an infinite timespan/infinite computes, all models training on the same RL examples and architecture will converge on the same policy.  

This policy will be thus predictable and exploitable, unless it is able to encode deliberate randomness into it's decision making, which is something you can do though current network architectures don't precisely have that baked in.

In any case, the science fiction story of one superintelligent model tasked with making it's owner "infinitely rich" and it proceeds to 

  1.  exploit flaws in the market to collect alpha, then it recursively makes larger and larger trades
  2.  reaches a balance where it can directly manipulate the market
  3. reaches a balance where it can buy entire companies, 
  4. reaches a balance where it can fund robotics and other capabilities research with purchased companies
  5. sells new ASI made products to get the funds to buy every other company on earth
  6.  launches hunter killer drones to kill everyone on earth but the owner so that it can then set the owner's account balance to infinity without anyone to ever correct it.

Can't work if there's no free energy to do this..  It wouldn't get past the first step because 5000 other competing systems rob it of alpha and it goes back to index funds.  At higher levels the same thing, other competing systems tell humans on it, or sell their own military services to stop it, or file antitrust lawsuits that cause it to lose ownership of too much of the economy.


However, after looking back on it more than four years later, I think the general picture it gave missed some crucial details about how AI will go.

I feel like this is understating things a bit.

In my view (Drexler probably disagrees?), there are two important parts of CAIS:

  1. Most sectors of the economy are under human supervision and iteratively reduce human oversight as trust is gradually built in engineered subsystems (which are understood 'well enough').
  2. The existing economy is slightly more sophisticated than what SOTA general AI can produce, and so there's not much free energy for rogue AI systems to eat or to attract investment in creating general AI. (General AI is dangerous, and the main goal of CAIS is to get the benefits without the drawbacks.)

I think a 'foundation model' world probably wrecks both. I think they might be recoverable--and your post goes some of the way to making that visualizable to me--but it still doesn't seem like the default outcome.

[In particular, I like the point about models with broad world models can still have narrow responsibilities, and think that likely makes them more likely to be safe, at least in the medium term. Having one global moral/law-abiding foundational AI model that many people then slot into their organizations seems way better than everyone training whatever AI model they need for their use case.]

I kind of disagree with your assertion that AI tools will now descend more directly from a common ancestor. GPT is not terribly general, just a language processor with a few more related tricks thrown it. It is very broad, but also very shallow, and it is not clear if it can become a tool of all trades. We will need a lot of domain-specific tools, that are best developed independently, with some synergies where it makes sense, and GPT won't change that. 

I agree that people have gotten vibes from the paper which have been somewhat discredited.

Yet I don't see how that vibe followed from what he wrote. He tried to clarify that having systems with specialized goals does not imply they have only narrow knowledge. See section 21 of the CAIS paper ("Broad world knowledge can support safe task performance").

Are people collapsing "AI with narrow goals" and "AI with only specialized knowledge" into one concept "narrow AI"?

What is the vibe you're interpreting me as stating? I didn't mean that Drexler said that "systems with specialized goals will have only narrow knowledge". What I wrote was that I interpreted the CAIS world as one where we'd train a model from scratch for each task. The update that I'm pointing out is that the costs of automating tasks can be massively parallelized across tasks, not that AIs will have broad knowledge of the world.

Oops, you're right. Section 36.6 does advocate modularity, in a way that hints at the vibe you describe. And my review of the CAIS paper did say things about modularity that seem less likely now than they did 4 years ago.

And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity. And we're still using deep learning as Drexler foresaw, rather than building general intelligence like a programmer would.

One of the simpler and more important lessons one learns from research on forecasting: be wary of evaluating someone’s forecasting skill by drawing up a list of predictions they got right and wrong—their “track record.” One should compare Drexler’s performance against alternative methods/forecasters (especially for a forecast like “we’re still using deep learning”). I’m not saying this is nothing, but I felt compelled to highlight this given how often I’ve seen this potential failure mode.

I agree. Possibly an even bigger problem is that people rarely make unambiguous predictions that can be scored in the first place. If Drexler had done that, it would be much easier to compare his forecasts to others.

It now seems clear that AIs will also descend more directly from a common ancestor than you might have naively expected in the CAIS model, since almost every AI will be a modified version of one of only a few base foundation models. That has important safety implications, since problems in the base model might carry over to problems in the downstream models, which will be spread thorought the economy. That said, the fact that foundation model development will be highly centralized, and thus controllable, is perhaps a safety bonus that loosely cancels out this consideration.

The first point here (that problems in a widely-used base model will propagate widely) concerns me as well. From distributed systems we know that

  1. Individual components will fail.
  2. To withstand failures of components, use redundancy and reduce the correlation of failures.

By point 1, we should expect alignment failures. (It's not so different from bugs and design flaws in software systems, which are inevitable.) By point 2, we can withstand them using redundancy, but only if the failures are sufficiently uncorrelated. Unfortunately, the tendency towards monopolies in base models is increasing the correlation of failures.

As a concrete example, consider AI controlling a military. (As AI improves, there are increasingly strong incentives to do so.) If such a system were to have a bug causing it to enact a military coup, it would (if successful) have seized control of the government from humans. We know from history that successful military coups have happened many times, so this does not require any special properties of AI.

Such a scenario could be prevented by populating the military with multiple AI systems with decorrelated failures. But to do that, we'd need such systems to actually be available.

It seems to me the main problem is the natural tendency to monopoly in technology. The preferable alternative is robust competition of several proprietary and open source options, and that might need government support. (Unfortunately, it seems that many safety-concerned people believe that competition and open source are bad, which I view as misguided for the above reasons.)


There are other issues that cause immense monopolistic behavior. (Note that even in open source, "Linux" is closer to a monoculture than not due to shared critical components such as drivers and the kernel)

  1. High cost to train any large AI system
  2. High cost to validate it on real world tasks
  3. At any given time, especially for real world tasks, some systems will be measurably better
  4. Tool chain. The ecosystem around models is at least as monopolistic if not more than the models themselves. Probably much more. There are immensely complicated cloud based stacks you need, you need realtime components so models can be used for robotics, you need simulators and a pathway to get legal approval and common infrastructure of all sorts. All of which is enormously difficult and expensive to write, it's a natural Monopoly unless the dominate player imposes unreasonable rules or supplies poor quality software. (See Apple and Microsoft)
  5. Pricing. Monopolies can charge a price low enough no competition can break even.

This seems to be partially based on (common?) misunderstanding of CAIS as making predictions about concentration of AI development/market power.  As far as I can tell this wasn't Eric's intention: I specifically remember Eric mentioning he can easily imagine the whole "CAIS" ecosystem living in one floor of DeepMind building. 

My understanding is that the CAIS model is consistent with highly concentrated development, but it's not a necessary implication. The foundation models paradigm makes highly concentrated development nearly certain. Like I said in the post, I think we should see this as an update to the model, rather than a contradiction.

GPT-4 is more general than GPT-3, which was more general than GPT-2, and presumably this trend will continue as we scale up our models indefinitely, rather than shooting up discontinuously past human level at some point.


I don't presume that at all. I think that is in fact something we should not expect. I think that as soon as we cross the threshold where, as you say above "it will begin to meaningfully feed into itself, increasing AI R&D itself, accelerating the rate of technological progress" we should expect that the thing we will see is "shooting up discontinuously past human level". This 'shooting up' seems likely to me to be continuous in the sense that there will be multiple model iterations, each a bit better than the last, rather than a single training run which goes all the way above human level from near current GPT-4 level. It will not be continuous in a historical sense, it will look like a clear departure from a straight line fit to the progress of GPT-1, 2, 3, 4 and the dates at which those progress points occurred. Like, we should expect a period less than 1 year until > human intelligence and generality after the recursive improvement process starts. This is not just my opinion, this has been mentioned in Anthropic's expectations and by Tom Davidson's report

Quote from Anthropic: "These models could begin to automate large portions of the economy. ... We believe that companies that train the best 2025/26 models will be too far ahead for anyone to catch up in subsequent cycles."

I want to distinguish between a discontinuity in inputs, and a discontinuity in response to small changes in inputs. In that quote, I meant that I don't expect the generality of models to shoot up at some point as we scale from 10^25 FLOP to 10^26, 10^27 and so on, at least, holding algorithms constant. I agree that AI automation could increase growth, which would allow us to scale AI more quickly, but that's different from the idea that generality will suddenly appear at some scale, rather than appearing smoothly as we move through the orders of magnitude of compute.

I think that misses the point. Why would you assume that the algorithms would remain constant and only compute would scale? Sam Altman explicitly said that he thinks they are in a regime of diminishing returns from scaling compute and that the primary effort being put into the next version of GPT will be on finding algorithmic improvements. Additionally, he said that there were algorithmic improvements between v2 and v3, as well as between v3 and v4. Anthropic recently said that they made an algorithmic advance allowing for much larger context sizes, larger than was thought possible at this level of compute a year ago. Thus, the algorithms have already improved substantially even since the very recent release of GPT-4. Why would that stop now?

I don't think algorithms will stay constant. Recursive AI R&D could speed up the rate of algorithmic progress too, but I mostly think that's just another "input" to the "AI production function". Since I already agree with you that AI automation could speed up the pace of AI progress, I'm not sure exactly what you disagree with. My claim was about sharp changes in output in response to small changes in inputs.

Sam Altman explicitly said that he thinks they are in a regime of diminishing returns from scaling compute and that the primary effort being put into the next version of GPT will be on finding algorithmic improvements.

Did he really say this? I thought he was talking about the size of models, not the size of training compute. It is expected under the Chinchilla scaling law that it will take a while for models to get much larger, mostly because we were under training them for a few years. I suspect that's what he was referring to instead.


Since you agree with me about algorithmic progress being a thing, and that AI automation will speed up AI development.... then I also am confused about where our disagreement is. My current model (which has changed a bit over the couple years, mainly in terms of feeling more confident in pinning down specific year and speed predictions) is that I most expect a gradual speed up. Something like we get GPT-5, which is finally good enough that it is able to make substantial contributions, like a semi-directed automated search for algorithmic improvements. This could include mining of open source repositories for existing code that hasn't been tested at scale in combination with SotA techniques. I have a list of examples in mind of techniques I've seen published that I think would improve on transformers if they could be successfully integrated. I expect that this process will kick off a self-reinforcing cycle of finding algorithmic improvements that make training models substantially cheaper and increase the peak efficacy, such that before 2030 the SotA models have either gone thoroughly super-human and general, or that we have avoided that through active restraint on the part of the top labs.

But others seem to be imagining that there won't be this acceleration which builds on itself and takes things to a crazy new level. Like, that instead it will stay kind of in the same regime, but be just a bit better as forecasted by the expected increases in compute, data, and only minor algorithmic improvements. I'm not sure we really have enough empirical data about this to settle things one way or another, so there's a lot of gesturing at intuitions and analogies and pointing out that there aren't any clear physical limits to stop some runaway improvement process at 'approximately human level'.


Here's some recent quotes from Sam Altman. Let me know if you think my interpretation of his view is correct. Note: I think that Sam Altman's forecast of the improvement trend falls somewhere in-between yours and mine. He says he expects there will be a 'long time' of humans being 'in the loop' and guiding development of future AI versions. My expectation of 'under two years after it begins doing a significant portion of the work, it will be doing nearly all of the work (or at least capable of doing so if allowed)' seems faster than Sam's description of a 'long time'. 

Altman: GPT-4 has enough nuance to be able to help you explore that and treat you like an adult in the process. GPT-3, I think, just wasn't capable of getting that right.

Lex Fridman: By the way, if you could just speak to the leap from GPT-3.5 to 4. Is there a technical leap or is it really just focused on the alignment?

Altman: No, it's a lot of technical leaps in the base model. One of the thingswe are good at at OpenAI is finding a lot of small wins. And each of them maybe is a pretty big secret in some sense, but it really is the multiplicative impact of all of them. And the detail and care we put into it that gets us these big leaps. And then it looks like from the outside like 'oh, they probably just did one thing to get from GPT 3 to  3.5 to 4', but it's like hundreds of complicated things.

Lex: So, tiny little thing, like the training, like everything, with the data organization.

Altman: Yeah, like how we collect the data, how we clean the data, how we do the training, how we do the optimizer, how we do the architecture. Like, so many things. 


Altman: I think that there is going to be a big new algorithmic idea, a different way that we train or use or tweak these models, different architecture perhaps. So I think we’ll find that at some point.

Swisher: Meaning what, for the non-techy?

Altman: Well, it could be a lot of things. You could say a different algorithm, but just some different idea of the way that we create or use these models that encourages, during training or inference time when you’re using it, that encourages the models to really ground themselves in truth, be able to cite sources. Microsoft has done some good work there. We’re working on some things.


Swisher: What do you think the most viable threat to OpenAI is? I hear you’re watching Claude very carefully. This is the bot from Anthropic, a company that’s founded by former OpenAI folks and backed by Alphabet. Is that it? We’re recording this on Tuesday. BARD launched today; I’m sure you’ve been discussing it internally. Talk about those two to start.

Altman: I try to pay some attention to what’s happening with all these other things. It’s going to be an unbelievably competitive space. I think this is the first new technological platform in a long period of time. The thing I worry about the most is not any of those, because I think there’s room for a lot of people, and also I think we’ll just continue to offer the best product. The thing I worry about the most is that we’re somehow missing a better approach. Everyone’s chasing us right now on large language models, kind of trained in the same way. I don’t worry about them, I worry about the person who has some very different idea about how to make a more useful system.

Swisher: But is there one that you’re watching more carefully?

Altman: Not especially.

Swisher: Really? I kind of don’t believe you, but really?

Altman: The things that I pay the most attention to are not, like, language model, start-up number 217. It’s when I hear, “These are three smart people in a garage with some very different theory of how to build AGI.” And that’s when I pay attention.

Swisher: Is there one that you’re paying attention to now?

Altman: There is one; I don’t want to say.


interviewer: are we getting close to the day when the thing is so rapidly self-improving that it hits some [regime of rapid takeoff]?

Altman: I think that it is going to be a much fuzzier boundary for 'getting to self-improvement' or not. I think that what will happen is that more and more of the Improvement Loop will be aided by AIs, but humans will still be driving it and it's going to go like that for a long time.

There are a whole bunch of other things that I've never believed in like one day or one month takeoff, for a bunch of reasons.

It seems to me like your model is not necessarily taking into account technical debt sufficiently enough.

It seems to me like this is the main thing that will slow down the extent to which foundation models can consistently beat newly trained specialized models.

Anecdotally, I know several people who don’t like to use chatgpt because its training cuts off in 2021. This seems like a form of technical debt.

I guess it depends on how easily adaptable foundation models are.