Hi,
sorry for commenting without having read most of your post. I just started reading this and thought like "isn't this exactly what the corrigibility agenda is/was about?", and in your "relation to other agendas" section you don't mention corrigibility there, so I thought I just ask whether you're familiar with it and how your approach is different. (Though tbc, I could be totally misunderstanding, I didn't read far.)
Tbc I think further work on corrigibility is very valuable, but if you haven't looked into it much I'd suggest reading up on what other people wrote on that so far. (I'm not sure whether there are very good explainers, and sometimes people seem to get a wrong impression of what corrigibility is about. E.g. corrigibility has nothing to do with "corrigibly aligned" from the "Risks from Learned Optimization" paper. Also the shutdown problem is often misunderstood too. I would make read and try to understand the stuff MIRI wrote about it. Possibly parts of this conversation might also be helpful, but yeah sry it's not written in a nice format that explains everything clearly.)
we may be able to avoid this problem by:
- not building unbounded, non-interruptible optimizers
and, instead,- building some other, safer, kind of AI that can be demonstrated to deliver enough value to make up for the giving up on the business-as-usual kind of AI along with the benefits it was expected to deliver (that "we", though not necessarily its creators, expect might/would lead to the creation of unbounded, non-interruptible AI posing a catastrophic risk),.
This sounds to me like you're imagining just nobody building a more powerful AIs is an option if we already got a lot of value from it (where I don't really know what level of capability you imagine concretely)? If the world was so reasonable we wouldn't rush ahead with our abysmal understanding of AI anyways because obviously the risks outweigh the benefits? Also you don't just need to convince the leading labs because progress will continue and soon enough many many actors will be able to create unaligned powerful AI, and someone will.
I think the right framing of the bounded/corrigible agent agenda is aiming toward a pivotal act.
The short answer to "How is it different from corrigibility?" is something like: here we're thinking about systems that are not sufficiently powerful for us to need them to be fully corrigible.
This sounds to me like you're imagining just nobody building a more powerful AIs is an option if we already got a lot of value from it (where I don't really know what level of capability you imagine concretely)? If the world was so reasonable we wouldn't rush ahead with our abysmal understanding of AI anyways because obviously the risks outweigh the benefits? Also you don't just need to convince the leading labs because progress will continue and soon enough many many actors will be able to create unaligned powerful AI, and someone will.
The (revealed) perception of risks and benefits depends on many things, including what kind of AI is available/widespread/adopted. Perhaps we can tweak those parameters. (Not claiming that it's going to be easy.)
I think the right framing of the bounded/corrigible agent agenda is aiming toward a pivotal act.
Something in this direction, yes.
The short answer to "How is it different from corrigibility?" is something like: here we're thinking about systems that are not sufficiently powerful for us to need them to be fully corrigible.
There's both "attempt to get coherent corrigibility" and "try to deploy corrigibility principles and keep it bounded enough to do a pivotal act". I think the latter approach is the main one MIRI imagines after having failed to find a simple coherent-description/utility-function for corrigibility. (Where here it would e.g. be ideal if the AI needs to only reason very well in a narrow domain without being able to reason well about general-domain problems like how to take over the world, though at our current level of understanding it seems hard to get the first without the second.)
EDIT: Actually the attempt to get coherent corrigibility also was aimed at bounded AI doing a pivotal act. But people were trying to formulate utility functions so that the AI can have a coherent shape which doesn't obviously break once large amounts of optimization power are applied (where decently large amounts are needed for doing a pivotal act.)
And I'd count "training for corrigible behavior/thought patterns in the hopes that the underlying optimization isn't powerful enough to break those patterns" also into that bucket, though yeah about that MIRI doesn't talk that much.
About getting coherent corrigibility, my and Joar's post on Updating Utility Functions, makes some progress on a soft form of corrigibility.
(Work done at Convergence Analysis. Mateusz wrote the post and is responsible for most of the ideas with Justin helping to think it through. Thanks to Olga Babeeva for the feedback on this post.)
1. Motivation
Suppose the perspective of pausing or significantly slowing down AI progress or solving the technical problems necessary to ensure that arbitrarily strong AI has good effects on humanity (in time, before we get such systems) both look gloomy.[1] What options do we have left?
Adam Shimi presents a useful frame on the alignment problem in Abstracting The Hardness of Alignment: Unbounded Atomic Optimization:
If the problem is about some system (or a collection of systems) having an unbounded, non-interruptible impact,[2] can we handle it by ensuring that the impact of the systems in question is reliably bounded and interruptible (even if the systems are scheming against us, or trying to "liberate" themselves from the properties of boundedness and interruptibility)?
The core idea is to impose a multidimensional bound on the system's capabilities such that:
but
sufficiently incapable along the dimensions corresponding to capabilities that would (in combination with the capabilities we want it to have) grant it the dangerous power of unbounded, non-interruptible optimization.[3]
In other words, if we can't solve the problem of aligning unbounded, non-interruptible optimizers, we may be able to avoid this problem by:
and, instead,
Unfortunately, at the current moment, frontier AI labs don't take actions sufficient to mitigate the risks from unbounded non-interruptible optimizers robustly. In business-as-usual futures, they will keep pushing the capabilities they expect to be most profitable while investing in myopic safety efforts prescribed by the incrementalist metastrategy, sacrificing long-term safety from catastrophes for short-term just-enough-safety-to-be-reliably-profitable (on some short-ish timescale). The biggest AI labs, such as OpenAI, Google DeepMind, and Anthropic, are not interested in systems that are safe by their boundedness. They think AGI/ASI (corresponding to unbounded, non-interruptible optimization[4]) is the best way to harness the benefits of superintelligence.[5][6]
However, suppose an alternative was shown to be viable. In that case, it might trigger a narrative bifurcation in which "race to AGI" is dethroned as the dominant frame and replaced with the idea of designing systems such that we can reliably reason about their possible effects, with a minimal sacrifice of short-term profits, relative to the promise of business-as-usual AI R&D.
With this motivation in mind, we introduce the concept of Bounded AI (BAI), i.e. a kind of AI system that (for now, speaking loosely),
but
and
within that domain of application, the system's effects obey certain constraints (to prevent large damage even within said domain).[7]
We contrast Bounded AI with Unbounded AI (UAI), i.e. a kind of AI system on the capabilities of which we can't put justified upper bounds that would exclude its ability to cause catastrophic outcomes.
We can think of UAI as unbounded and non-interruptible ("atomic") in the sense that Adam Shimi discusses in his UAO post.[8]
The next section develops the idea of Bounded AI in more detail, looking at it from several perspectives, and providing some examples to illustrate it. Section 3 relates Bounded AI to prior ideas and agendas. Section 4 lays out some open cruxes, questions, and potentially valuable further research directions.
2. Bounded AI
2.1 The Goldilocks Zone of Capabilities
An illustrative way to think about the kind of boundedness we are interested in is the "Goldilocks Zone of Capabilities".
Imagine a high-dimensional space of all AI capabilities, each dimension corresponding to some specific capability. Some combinations of capabilities generate the central AI catastrophic risks associated with unbounded non-interruptible optimization. This is a region we don't want to enter, Bounded AI systems are those that remain below its "floor".
There is another important threshold — that of capability combinations that can deliver a significant portion of the benefits of superintelligence.
The Goldilocks Zone of Capabilities is a region "sandwiched" between these two thresholds. It contains systems that can deliver the benefits of superintelligence but do not constitute a source of central AI catastrophic risk: harm and loss of control due to unbounded non-interruptible optimization.
In a toy universe where there are only two relevant AI capabilities, this
Goldilocks Zone might look like this:
Or perhaps like this:
The purpose of the Bounded AI concept is to allow us to aim at this zone more reliably. It relies on a hypothesis that we might call the "Goldilocks Zone of Capabilities Conjecture": It is possible to safely elicit the key benefits of superintelligence using AI systems that remain confined to the Goldilocks Zone of Capabilities. By assumption, the capability profiles of AI systems falling within this zone will be "spiked": high on some capabilities, relatively low on others implying that each would only be useful in some domain(s) of application and not very useful in other domains.
To the extent that we are uncertain about whether the system's capabilities are below a satisfying upper bound, we want to put another layer of protection. Namely, we want to ensure that we will be able to interrupt the system in time if we notice that its effects are likely to lead to undesirable consequences. The assumption that before the system's effects become unacceptable, they will first become noticeable and that upon noticing them, we will be able to prevent them, corresponds to some fragment of this space, probably something like an upper shell of the Goldilocks Zone of Capabilities where we are, metaphorically speaking, playing with fire but can ensure that we have a sufficient number of extinguishers in the stock to prevent an accident from escalating to a conflagration. A lower bound on interruptibility thus follows from an upper bound on capabilities/effects.
This is the motivation for how BAI may allow us to avoid catastrophic risks if adopted widely. To be adopted widely, it needs (among other things) to be shown to be sufficiently capable and safe, which, in general, is not the default. Therefore, the next two sections focus on ensuring more "mundane" safety.
2.2 Four quadrants of Bounded AI
To limit the undesirable effects of an AI system, we want to ensure that its effects (both within and outside of the domain) remain confined to a safe, bounded region and also that their escalation into unacceptable disasters interruptible by human operators.
We can think about it in terms of the following 2×2 matrix.
To each AI system (perhaps consisting of multiple components, such as one or more ML models, an "agent" scaffolding, RAG, "non-AI" filters being applied to the model's outputs, or a procedure involving some degree of human oversight) we can assign a "spec" consisting of a value[9] for each cell in the matrix. An AI system is a Bounded AI if its spec is sufficient to infer that its deployment is not going to cause unacceptable damage. Importantly, boundedness and interruptibility are related. The bigger the system's possible effects or the more uncertain we are about whether its effects encroach upon dangerous territory, the stronger assurance of its interruptibility we want to have (everything else being equal).
Whatever the domain is, we most likely have less control over what's going on in the "external world" than over what's going on in the domain itself. Therefore, we want to have stronger bounds on the effects outside the domain. For this to be the case, it should be effectively impossible for the inside-domain effects to "diffuse" outside of the domain. We can think about it in terms of a boundary between the domain and the external world or, alternatively, in terms of there not being an interface between them that would allow the effects to diffuse to the outside (depending on which frame is more appropriate for the particular case). Depending on the situation, this boundary or lack of interface may be the default or may need to be deliberately established.
The extent to which the effects need to be bounded and interruptible depends on the domain of application or, more specifically, the possibility of causing unacceptable damage in the domain.
Importantly, to build a Bounded AI for a domain, we need to know enough about that domain to ensure that the system's effects are sufficiently bounded and/or interruptible. This rules out domains with large downside risks where our understanding is still very poor, such as automating AI safety/alignment research. However, Bounded AI might be useful for automating some subtasks involved in that kind of research, as long as those subtasks remain in a subdomain that can be reliably overseen and/or verified.
2.3 High-risk domains of application
We say that a domain of application is high-risk if the capabilities required to deliver great benefits within that domain are hard to disentangle from the capabilities sufficient to cause unacceptable damage within that domain.
We use "unacceptable damage" vaguely meaning "damage we strongly disprefer", relying on the spirit, rather than the letter, with catastrophic risks being the most central example of unacceptable damage.[10]
The motivation for this concept should be obvious in the context of AI risk.
People wouldn't bother deploying it.[11]
If the AI is highly capable of causing great benefits as well as causing great damage, we are in a danger zone where the promise of the upsides may encourage unjustified optimism about the system's safety, safety measures (whatever additional things will be appended to the system to ensure it's going to remain safe) will focus on addressing short-term/myopic risks, with fingers crossed that this will at least prepare us better for problems that are harder to address but less urgent, and yet leaving us effectively unprepared for those harder when push comes to shove.
If the domain is high-risk, i.e. the development of beneficial capabilities will also (by default) bring along dangerous capabilities, then we are more likely to find ourselves in the danger zone.
High-risk domains are common. In many situations where automation could produce large benefits, it could also cause a lot of damage, and minimizing the latter while preserving the former is highly non-trivial. Examples include self-driving cars, software engineering for life-critical systems, medical research, or psychotherapy, and, finally, using AI to solve problems that seem too difficult for us to solve without AI, especially when this involves the AI being capable of autonomous, non-overseeable action. In these domains, the possible upsides incentivize building highly capable AI, along with the possibility of downsides brought along by those capabilities.
Awareness of downsides creates incentives to safeguard against them. However, miscalibration about downsides, along with economic incentives towards myopic safety (addressing short-term risks that are relevant to the time horizon an AI company is planning for, while neglecting long-term risks, including larger/catastrophic risks) in combination with the distribution of downsides being heavy-tailed, can lead to inappropriate preparation against the greatest downsides.
In a sense, this is similar to ensuring the safety of non-AI technology in high-risk domains. However, it is also critically different because it involves (semi-)intelligent systems (even if bounded in their intelligence by design) that we understand very imperfectly and therefore need to apply correspondingly more caution.
2.4 Is it viable?
The viability of Bounded AI (as presented here) depends (among other things) on whether we can get an AI system whose profile is placed within the relevant parts of the capability space (i.e. the Goldilocks Zone) and, to the extent that we are sufficiently uncertain about its place, can be made interruptible. (This is a restatement of the "conjecture" from the previous section with more bells and whistles.)
This is a challenge on both the technical as well as the governance side. In this section, we cover some reasons why we think that the technical challenge is likely tractable.
2.4.1 Boundedness
As of early 2025, we've had powerful and generally knowledgeable models — i.e. LLMs — for two years (i.e. since the release of GPT-4). So far, their impact on the economy has been limited. While they are helpful in a few domains on the margin (e.g. code generation), as Cole Wyeth writes, they haven't done anything important yet.
In the current regime, a lot of schlep/integration is often required to make a very smart and knowledgeable LLM useful for a non-trivial application that is not the one it wasn't trained to perform in.[12] This might change soon if AI models acquire capabilities sufficient to integrate themselves into pre-existing systems with minimal or close to non-existent involvement of human workers, or if some infrastructure is designed that minimizes the friction of integrating a new AI component.[13] However, at least for now, general knowledge and general intelligence (at least the LLM variant of general intelligence) do not jointly imply "general domain mastery" or general ability to acquire any relevant capability.
While scaling makes models generally more performant on various tasks and benchmarks, there is also some dissociation between capabilities, as can be seen in the o1 line of models being worse than GPT-4o on some language understanding tasks.[14] This suggests that for a given effective model size, there are some trade-offs between competence in domains.[15]
The shape of LLM capabilities is different from that of human capabilities ("jagged frontier"). Surprising generalization failures are present even in seemingly advanced systems. For example, o1 competently answers PhD-level questions but can't play tic-tac-toe optimally, even if taught it in-context.[16] DeepResearch, on the other hand, can't add 4-digit numbers (a task that is trivial for GPT-4o)[17] and confabulates the answer when given a task that is too difficult or too far out of its training distribution.
Pre-training oriented for drug research doesn't need to involve knowledge of politics or society that would be useful for violating the boundary of the domain (e.g. via superhuman persuasion). Admittedly, imposing such restrictions might require some progress in AI governance.
Currently, we get some of this dissociation between general knowledge and general capability "for free". Even if this dissociation goes away in the future, we might still be able to deliberately shape the models' capabilities so that they remain within the Goldilocks Zone of Capabilities.
(On a more speculative note, in addition to reducing risks, there is also the possibility that a bounded AI design will leverage resources more efficiently and focus on what is essential for the task at hand.)
Finally, we already have some proof-of-concept AI systems that by default fall within the scope of Bounded AI and are delivering important benefits, e.g. AlphaFold. It's a powerful technology with the potential to remove some bottlenecks in biological research that is not going to to cause an existential catastrophe on its own.
2.4.2 Interruptibility
Boundedness of a system makes it easier to interrupt its actions if necessary. Probably, in most situations, the bound allowing for reliable interruptibility is higher than the bound ensuring safety, giving us a sort of margin of error.
If we are deploying the system in a limited domain that we know relatively well, we should also be able to:
2.4.3 Properties of the model vs properties of the joint system
We need to distinguish between the properties of the model itself versus the properties of the joint system.[18] The model on its own may have a certain capabilities profile but its capabilities profile may change when incorporated into a bigger system. It might expand, e.g. if it's placed in an "agent scaffolding". It might also shrink, e.g. if its chain-of-thought starts being monitored for certain kinds of undesirable behavior. It might expand and shrink at the same time, along different axes of the capability space.
It is the capabilities of the joint system that determine the effects that we care about. The ability to shape capabilities after training gives us some leverage on the capabilities of the joint system, in addition to training the model itself.
However, for this post-training capability shaping to work on the model, it needs to be sufficiently incapable along some dimensions in the capability space that it won't be able to circumvent/hack those measures. Therefore, these measures need to be appropriately matched with capability limitations.
2.5 Examples
Domain:
3. Relation to other ideas/agendas
Various ideas or agendas have been proposed that can be seen as special cases of Bounded AI or, at least, adjacent to it, pointing at the same region of safe capability configurations that we called the Goldilocks Zone of Capabilities. Here we list some examples.
Tool AI
The concept of Tool AI emerged in discussions around the Machine Intelligence Research Institute (MIRI; back then still called Singularity Institute). MIRI expected that once the threshold of AGI is achieved (and perhaps even before then) the primary shape of AI will be that of a general agent (cross-domain real-world consequentialist), acting autonomously and ruthlessly optimizing the world according to its objectives, whatever they may be. If this is the world we're going to find ourselves in, the only way to ensure the friendliness of that kind of AI would be to align their values with ours or, failing that, ensure their corrigibility.
MIRI's expectations conflicted with those of Holden Karnofsky. In his 2012 post, Holden proposed an alternative vision of AGI that he considered a more likely default endpoint of AI development, namely Tool AI. Holden's Tool AI doesn't take any "actions" on its own in the sense that its behavior doesn't result in "big effects on the world unless mediated by a human acting and being aware of their action". Tool AI only responds to questions performs computations, recommends actions for a human to take to fulfill their goal, &c. In short, its task is to make humans more informed and more capable of acting properly but it is the human who needs to take action. Think Google Maps or a calculator but more powerful in what kinds of answers it can provide, what problems it can solve (without "moving atoms" on its own), and what actions it can recommend.
Quoting from Holden's 2012 post Thoughts on the Singularity Institute (lightly edited for clarity):
[20]
Many objections were raised to the idea of Tool AI (see e.g. Eliezer and Gwern). The core problems of Tool AI can be summarized as the following:
The problem of internal incentives (instrumental convergence). Agency is often useful for accomplishing goals.[21] If the AI is sufficiently intelligent and capable and it wants to accomplish the goals given to it by the user (as we expect it to be the case past the AGI threshold), it will have strong incentives to self-modify to be more agent-like and less Tool-AI-like.
Although the idea of Bounded AI does not completely evade these problems, it does mitigate them to a large extent.[22] The core idea of BAI is to give the system "agentic" capabilities but restrict them to narrow domains, securing their benefits while ensuring that they remain bounded to those domains.
Recently, the concept has been brought back again by Max Tegmark in the 2024 post The Hopium Wars: the AGI Entente Delusion:
Conjecture's cognitive emulations (CoEms)
Guaranteed Safe AI (GSAI)
Quoting the abstract:
One specific and particularly ambitious variant of GSAI is davidad's Open Agency Architecture.
Open Agency Architecture (OAA)
More specifically, the AI proposes policies (courses of action) that are then evaluated or tested in a comprehensive world model. These policies are then executed only if, in the model, certain desiderata are satisfied.
The model itself is created with help from a legion of near-AGI[23] LLMs. This, in itself, might open the door for scheming (or just deficient) LLMs to insert some vulnerabilities in the world model. However, the model itself is written in a verifiable, human-created language, and checked automatically using formal methods.[24]
Comprehensive AI Services (CAIS)
Sumerian Containment
[25]
This is clearly a special case of Bounded AI. The domain of application is "a precise, narrow field of science (e.g. medicine design)". The boundary of the domain is constituted by the reporter AI which only allows texts relevant to the domain of application through the boundary. Moreover, even if the scientist AI breaches the boundary ("escapes"), its abilities will be very limited, and it will leave traces, thus making it more interruptible. Its knowledge of only the Sumerian language is a crucial capability constraint.
(We do not interpret this proposal as necessarily serious but it is provocative in a way that might prompt us to think in yet underexplored directions.)
4. Open questions, cruxes, and further directions
How can we put bounds on the system's capabilities?
Right now, we have some capability limitations already (the schlep/integration gap, and generalization failures). It isn't clear how long this situation is going to last.
To what extent are unevenly shaped capabilities the default outcome of the current training method? Given that o1 and o3 models are marginally worse on tasks that are not like math or coding, is this due to catastrophic forgetting (which seemed to have gone away)?
Moreover, fine-tuning base models for chat makes it at least more difficult to elicit certain capabilities, and not because that was the intention. (See dynomight: here and here.)
In humans, there is the general intelligence factor g, suggesting some "common core of general intelligence" (to the extent that those properties of human intelligence can be extrapolated to artificial intelligence).
However, despite the g-factor, there is some detachment of general intelligence from narrow domains of cognitive ability in specific developmental disorders in humans, such as dysgraphia, dyslexia, and dyscalculia. These impair human performance in one cognitive domain but otherwise leave general intelligence intact.
Are specific developmental disorders a good analogy for "AI savantism"?
We can influence the incentives through governance.
For example, domain-specific "high-risk" applications can be required to have specifications including capabilities limitations, safety measures, and ontology specification (involving, (i.a.) the domain and the boundary between the domain and the environment, as well as how the system will be prevented from crossing the domain robustly).
One research direction would be to investigate the viability of safety measures and methods of engineering and imposing appropriate capability limitations for applications of AI in particular high-risk domains as this is a crucial factor determining the viability of this proposal.
How can we ensure interruptibility?
AI Control.
Chain-of-thought interpretability.
"Standard cybersecurity".
Information-theoretic boxing.
Some reasons this cluster of ideas might fail. (Beyond the obvious "we don't effectively coordinate on doing it".)
The "suppose" here is not rhetorical. We're not claiming that these two assumptions are correct. Our intent here is to find strategies that robustly decrease AI X-risk in worlds where they are, unfortunately, satisfied.
We are using "non-interruptible", rather than "atomic" because it makes inferring the intended meaning easier.
This, on its own, is insufficient to ensure that the system is robustly safe but it's a good starting point.
Non-interruptible in principle, though they probably mostly believe (or hope) that it's going to be interruptible in practice.
Framing this in terms of "benefits of superintelligence" was borrowed from Joe Carlsmith.
We use the term "superintelligence" quite broadly, encompassing (using Bostrom's terminology) not just superintelligence coming from higher "quality of thought" but also superintelligence coming from speed of cognition, number of thinkers, or capacity to intelligently access and integrate information that surpass what humans would be ever capable of.
Naturally, in order to deploy such systems responsibly, we need to have justifiably strong beliefs in each of these claims.
This roughly corresponds to how people use terms like "AGI" and "ASI" in these contexts but is more precise.
Not necessarily a scalar value.
The boundaries of this concept are even more nebulous, given that the way we apply it depends on our epistemic state, including the time horizon over which we are assessing the possible damage. Still, our epistemic state is what we have to work with.
Assuming they are aware of it and are not malicious.
One might argue that this is the "fault" of humans, not of AIs, but here it is irrelevant.
A component being a model or agent scaffolding.
See Table 14 in the o3-mini system card.
Very speculatively, this might be analogous to how any value in the general IQ score can be obtained by combining the scores from different domains of human intelligence.
Speculating again, this might be analogous to specific developmental disorders.
In this particular instance, it couldn't even keep track of the taks. From the tweet:
We are assuming that we're staying within the modern ML-centric paradigm of AI where an "ML model" is a natural and meaningful unit of analysis.
For now at least, we are not discussing scenarios like diamondoid bacteria.
Notably, there are counterarguments to such a system not having any wants, see e.g. The Parable of Predict-O-Matic, The Solomonoff Prior is Malign, and Deep Deceptiveness. They are, however, beyond the scope of this post.
Perhaps even: agency is useful for accomplishing a vast majority of goals.
To the extent that it is viable, of course.
Or, in this post's terminology, near-UAI.
If it doesn't make sense, see the post: Davidad's Bold Plan for Alignment: An In-Depth Explanation.
Name made up by Mateusz, not the authors of the post.
By "stability", we mean: if we get a system that sits "comfortably" within the Goldilocks Zone, how likely is it that this system will move across the upper bound to catastrophic risk, due to causes that are either "internal" to the system (e.g. self-modification or, more mundanely, the system becoming more capable through learning), or "external" to it (e.g. the domain/context changes so that the capability bounds and interruptibility cease to apply or humans modify the system).
We do not expect this to scale up to AGI/ASI but this is not our purpose here. Our purpose is to provide an alternative guiding principle for AI trajectory.