Bounded AI might be viable

Mateusz Bagiński; JustinShovelain

(Work done at Convergence Analysis. Mateusz wrote the post and is responsible for most of the ideas with Justin helping to think it through. Thanks to Olga Babeeva for the feedback on this post.)

1. Motivation

Suppose the perspective of pausing or significantly slowing down AI progress or solving the technical problems necessary to ensure that arbitrarily strong AI has good effects on humanity (in time, before we get such systems) both look gloomy.^[1] What options do we have left?

Adam Shimi presents a useful frame on the alignment problem in Abstracting The Hardness of Alignment: Unbounded Atomic Optimization:

alignment [is] the problem of dealing with impact on the world (optimization) that is both of unknown magnitude (unbounded) and non-interruptible (atomic).

If the problem is about some system (or a collection of systems) having an unbounded, non-interruptible impact,^[2] can we handle it by ensuring that the impact of the systems in question is reliably bounded and interruptible (even if the systems are scheming against us, or trying to "liberate" themselves from the properties of boundedness and interruptibility)?

The core idea is to impose a multidimensional bound on the system's capabilities such that:

it is sufficiently capable along the dimensions corresponding to the capabilities we want to extract from it
but
sufficiently incapable along the dimensions corresponding to capabilities that would (in combination with the capabilities we want it to have) grant it the dangerous power of unbounded, non-interruptible optimization.^[3]

In other words, if we can't solve the problem of aligning unbounded, non-interruptible optimizers, we may be able to avoid this problem by:

not building unbounded, non-interruptible optimizers
and, instead,
building some other, safer, kind of AI that can be demonstrated to deliver enough value to make up for the giving up on the business-as-usual kind of AI along with the benefits it was expected to deliver (that "we", though not necessarily its creators, expect might/would lead to the creation of unbounded, non-interruptible AI posing a catastrophic risk),.

Unfortunately, at the current moment, frontier AI labs don't take actions sufficient to mitigate the risks from unbounded non-interruptible optimizers robustly. In business-as-usual futures, they will keep pushing the capabilities they expect to be most profitable while investing in myopic safety efforts prescribed by the incrementalist metastrategy, sacrificing long-term safety from catastrophes for short-term just-enough-safety-to-be-reliably-profitable (on some short-ish timescale). The biggest AI labs, such as OpenAI, Google DeepMind, and Anthropic, are not interested in systems that are safe by their boundedness. They think AGI/ASI (corresponding to unbounded, non-interruptible optimization^[4]) is the best way to harness the benefits of superintelligence.^[5]^[6]

However, suppose an alternative was shown to be viable. In that case, it might trigger a narrative bifurcation in which "race to AGI" is dethroned as the dominant frame and replaced with the idea of designing systems such that we can reliably reason about their possible effects, with a minimal sacrifice of short-term profits, relative to the promise of business-as-usual AI R&D.

With this motivation in mind, we introduce the concept of Bounded AI (BAI), i.e. a kind of AI system that (for now, speaking loosely),

can have large effects (i.e. sufficient power to realize the benefits of superintelligence)
but
those large effects are confined to a chosen domain of application
and
within that domain of application, the system's effects obey certain constraints (to prevent large damage even within said domain).^[7]

We contrast Bounded AI with Unbounded AI (UAI), i.e. a kind of AI system on the capabilities of which we can't put justified upper bounds that would exclude its ability to cause catastrophic outcomes.

We can think of UAI as unbounded and non-interruptible ("atomic") in the sense that Adam Shimi discusses in his UAO post.^[8]

The next section develops the idea of Bounded AI in more detail, looking at it from several perspectives, and providing some examples to illustrate it. Section 3 relates Bounded AI to prior ideas and agendas. Section 4 lays out some open cruxes, questions, and potentially valuable further research directions.

2. Bounded AI

2.1 The Goldilocks Zone of Capabilities

An illustrative way to think about the kind of boundedness we are interested in is the "Goldilocks Zone of Capabilities".

Imagine a high-dimensional space of all AI capabilities, each dimension corresponding to some specific capability. Some combinations of capabilities generate the central AI catastrophic risks associated with unbounded non-interruptible optimization. This is a region we don't want to enter, Bounded AI systems are those that remain below its "floor".

There is another important threshold — that of capability combinations that can deliver a significant portion of the benefits of superintelligence.

The Goldilocks Zone of Capabilities is a region "sandwiched" between these two thresholds. It contains systems that can deliver the benefits of superintelligence but do not constitute a source of central AI catastrophic risk: harm and loss of control due to unbounded non-interruptible optimization.

In a toy universe where there are only two relevant AI capabilities, this

Goldilocks Zone might look like this:

Or perhaps like this:

The purpose of the Bounded AI concept is to allow us to aim at this zone more reliably. It relies on a hypothesis that we might call the "Goldilocks Zone of Capabilities Conjecture": It is possible to safely elicit the key benefits of superintelligence using AI systems that remain confined to the Goldilocks Zone of Capabilities. By assumption, the capability profiles of AI systems falling within this zone will be "spiked": high on some capabilities, relatively low on others implying that each would only be useful in some domain(s) of application and not very useful in other domains.

To the extent that we are uncertain about whether the system's capabilities are below a satisfying upper bound, we want to put another layer of protection. Namely, we want to ensure that we will be able to interrupt the system in time if we notice that its effects are likely to lead to undesirable consequences. The assumption that before the system's effects become unacceptable, they will first become noticeable and that upon noticing them, we will be able to prevent them, corresponds to some fragment of this space, probably something like an upper shell of the Goldilocks Zone of Capabilities where we are, metaphorically speaking, playing with fire but can ensure that we have a sufficient number of extinguishers in the stock to prevent an accident from escalating to a conflagration. A lower bound on interruptibility thus follows from an upper bound on capabilities/effects.

This is the motivation for how BAI may allow us to avoid catastrophic risks if adopted widely. To be adopted widely, it needs (among other things) to be shown to be sufficiently capable and safe, which, in general, is not the default. Therefore, the next two sections focus on ensuring more "mundane" safety.

2.2 Four quadrants of Bounded AI

To limit the undesirable effects of an AI system, we want to ensure that its effects (both within and outside of the domain) remain confined to a safe, bounded region and also that their escalation into unacceptable disasters interruptible by human operators.

We can think about it in terms of the following 2×2 matrix.

	(un)bounded	(non-)interruptible
inside the domain	What are the bounds on the system's effects inside the domain?	How interruptible are the system's effects inside the domain?
outside the domain	What are the bounds on the system's effects outside the domain?	How interruptible are the system's effects outside the domain?

To each AI system (perhaps consisting of multiple components, such as one or more ML models, an "agent" scaffolding, RAG, "non-AI" filters being applied to the model's outputs, or a procedure involving some degree of human oversight) we can assign a "spec" consisting of a value^[9] for each cell in the matrix. An AI system is a Bounded AI if its spec is sufficient to infer that its deployment is not going to cause unacceptable damage. Importantly, boundedness and interruptibility are related. The bigger the system's possible effects or the more uncertain we are about whether its effects encroach upon dangerous territory, the stronger assurance of its interruptibility we want to have (everything else being equal).

Whatever the domain is, we most likely have less control over what's going on in the "external world" than over what's going on in the domain itself. Therefore, we want to have stronger bounds on the effects outside the domain. For this to be the case, it should be effectively impossible for the inside-domain effects to "diffuse" outside of the domain. We can think about it in terms of a boundary between the domain and the external world or, alternatively, in terms of there not being an interface between them that would allow the effects to diffuse to the outside (depending on which frame is more appropriate for the particular case). Depending on the situation, this boundary or lack of interface may be the default or may need to be deliberately established.

The extent to which the effects need to be bounded and interruptible depends on the domain of application or, more specifically, the possibility of causing unacceptable damage in the domain.

Importantly, to build a Bounded AI for a domain, we need to know enough about that domain to ensure that the system's effects are sufficiently bounded and/or interruptible. This rules out domains with large downside risks where our understanding is still very poor, such as automating AI safety/alignment research. However, Bounded AI might be useful for automating some subtasks involved in that kind of research, as long as those subtasks remain in a subdomain that can be reliably overseen and/or verified.

2.3 High-risk domains of application

We say that a domain of application is high-risk if the capabilities required to deliver great benefits within that domain are hard to disentangle from the capabilities sufficient to cause unacceptable damage within that domain.

We use "unacceptable damage" vaguely meaning "damage we strongly disprefer", relying on the spirit, rather than the letter, with catastrophic risks being the most central example of unacceptable damage.^[10]

The motivation for this concept should be obvious in the context of AI risk.

	Low capability for benefits	High capability for benefits
Low capability for damage	Irrelevant from the catastrophic risk perspective.	Irrelevant from the catastrophic risk perspective.
High capability for damage	People wouldn't bother deploying it.^[11]	Danger zone.

If the AI is highly capable of causing great benefits as well as causing great damage, we are in a danger zone where the promise of the upsides may encourage unjustified optimism about the system's safety, safety measures (whatever additional things will be appended to the system to ensure it's going to remain safe) will focus on addressing short-term/myopic risks, with fingers crossed that this will at least prepare us better for problems that are harder to address but less urgent, and yet leaving us effectively unprepared for those harder when push comes to shove.

If the domain is high-risk, i.e. the development of beneficial capabilities will also (by default) bring along dangerous capabilities, then we are more likely to find ourselves in the danger zone.

High-risk domains are common. In many situations where automation could produce large benefits, it could also cause a lot of damage, and minimizing the latter while preserving the former is highly non-trivial. Examples include self-driving cars, software engineering for life-critical systems, medical research, or psychotherapy, and, finally, using AI to solve problems that seem too difficult for us to solve without AI, especially when this involves the AI being capable of autonomous, non-overseeable action. In these domains, the possible upsides incentivize building highly capable AI, along with the possibility of downsides brought along by those capabilities.

Awareness of downsides creates incentives to safeguard against them. However, miscalibration about downsides, along with economic incentives towards myopic safety (addressing short-term risks that are relevant to the time horizon an AI company is planning for, while neglecting long-term risks, including larger/catastrophic risks) in combination with the distribution of downsides being heavy-tailed, can lead to inappropriate preparation against the greatest downsides.

In a sense, this is similar to ensuring the safety of non-AI technology in high-risk domains. However, it is also critically different because it involves (semi-)intelligent systems (even if bounded in their intelligence by design) that we understand very imperfectly and therefore need to apply correspondingly more caution.

2.4 Is it viable?

The viability of Bounded AI (as presented here) depends (among other things) on whether we can get an AI system whose profile is placed within the relevant parts of the capability space (i.e. the Goldilocks Zone) and, to the extent that we are sufficiently uncertain about its place, can be made interruptible. (This is a restatement of the "conjecture" from the previous section with more bells and whistles.)

This is a challenge on both the technical as well as the governance side. In this section, we cover some reasons why we think that the technical challenge is likely tractable.

2.4.1 Boundedness

As of early 2025, we've had powerful and generally knowledgeable models — i.e. LLMs — for two years (i.e. since the release of GPT-4). So far, their impact on the economy has been limited. While they are helpful in a few domains on the margin (e.g. code generation), as Cole Wyeth writes, they haven't done anything important yet.

In the current regime, a lot of schlep/integration is often required to make a very smart and knowledgeable LLM useful for a non-trivial application that is not the one it wasn't trained to perform in.^[12] This might change soon if AI models acquire capabilities sufficient to integrate themselves into pre-existing systems with minimal or close to non-existent involvement of human workers, or if some infrastructure is designed that minimizes the friction of integrating a new AI component.^[13] However, at least for now, general knowledge and general intelligence (at least the LLM variant of general intelligence) do not jointly imply "general domain mastery" or general ability to acquire any relevant capability.

While scaling makes models generally more performant on various tasks and benchmarks, there is also some dissociation between capabilities, as can be seen in the o1 line of models being worse than GPT-4o on some language understanding tasks.^[14] This suggests that for a given effective model size, there are some trade-offs between competence in domains.^[15]

The shape of LLM capabilities is different from that of human capabilities ("jagged frontier"). Surprising generalization failures are present even in seemingly advanced systems. For example, o1 competently answers PhD-level questions but can't play tic-tac-toe optimally, even if taught it in-context.^[16] DeepResearch, on the other hand, can't add 4-digit numbers (a task that is trivial for GPT-4o)^[17] and confabulates the answer when given a task that is too difficult or too far out of its training distribution.

Pre-training oriented for drug research doesn't need to involve knowledge of politics or society that would be useful for violating the boundary of the domain (e.g. via superhuman persuasion). Admittedly, imposing such restrictions might require some progress in AI governance.

Currently, we get some of this dissociation between general knowledge and general capability "for free". Even if this dissociation goes away in the future, we might still be able to deliberately shape the models' capabilities so that they remain within the Goldilocks Zone of Capabilities.

(On a more speculative note, in addition to reducing risks, there is also the possibility that a bounded AI design will leverage resources more efficiently and focus on what is essential for the task at hand.)

Finally, we already have some proof-of-concept AI systems that by default fall within the scope of Bounded AI and are delivering important benefits, e.g. AlphaFold. It's a powerful technology with the potential to remove some bottlenecks in biological research that is not going to to cause an existential catastrophe on its own.

2.4.2 Interruptibility

Boundedness of a system makes it easier to interrupt its actions if necessary. Probably, in most situations, the bound allowing for reliable interruptibility is higher than the bound ensuring safety, giving us a sort of margin of error.

If we are deploying the system in a limited domain that we know relatively well, we should also be able to:

monitor its behavior to predict its effects
figure out sufficiently early if the effects are likely to be undesirable
reliably intervene to prevent or mitigate those undesirable effects.

2.4.3 Properties of the model vs properties of the joint system

We need to distinguish between the properties of the model itself versus the properties of the joint system.^[18] The model on its own may have a certain capabilities profile but its capabilities profile may change when incorporated into a bigger system. It might expand, e.g. if it's placed in an "agent scaffolding". It might also shrink, e.g. if its chain-of-thought starts being monitored for certain kinds of undesirable behavior. It might expand and shrink at the same time, along different axes of the capability space.

It is the capabilities of the joint system that determine the effects that we care about. The ability to shape capabilities after training gives us some leverage on the capabilities of the joint system, in addition to training the model itself.

However, for this post-training capability shaping to work on the model, it needs to be sufficiently incapable along some dimensions in the capability space that it won't be able to circumvent/hack those measures. Therefore, these measures need to be appropriately matched with capability limitations.

2.5 Examples

Lab assistant that is allowed to automatically search for drug candidates and test them in vitro or on organoids.
- Domain:
  - Medical/biochemical research.
  - Searching through the literature and the internet (similar to DeepResearch).
  - Designing spaces and starting points for computational search, based on the literature search results.
  - Designing and executing experimental tests of the drugs that were found.Downside risk: Possibility of initiating a reaction that damages equipment or produces a drug that turns out to have disastrous consequences (perhaps because of an overly eager search leading to a Goodhart phenomenon).^[19]
- Safety measures:
  - Every experiment proposed by the AI needs to be assessed for safety by a trustworthy external party, either a human or a non-AI automated system before it's executed.
  - Biasing search towards drugs similar to the ones tested before and testing them on simulated models of human organs.
  - Once the results of the AI assistant's experiments are promising enough, comprehensive trials on animals and humans, with a lot of deliberating about unforeseen consequences of the drug that may manifest only in the long run or only in some cases not appropriately represented in the trial population.
- Required capabilities:
  - Understanding relevant literature.
  - Competent browsing and using tools such as Elicit or Deep Research.
  - Translating it into the language of chemical search spaces.
  - Designing experimental tests in vitro or on organoids and correcting faults/accidents in the course of their execution.
- Forbidden capabilities:
  - Good modeling of human psychology (e.g. persuasion) and social dynamics.
  - Computer engineering that would allow it to replicate or self-exfiltrate.
OS-integrated LLM assistant, with security measures that constrain its access to sensitive information and prevent too risky actions.
- Domain: Doing as much non-specialized computer-related work as the user wants to be done for them. ("Non-specialized" is meant to exclude the work that requires specialized skill knowledge that we wouldn't expect a BAI computer assistant to have.)
- Downside risks:
  - Acquiring root-level access.
  - Possibility of exfiltration.
  - Possibility of accessing the user's social media or other vulnerable information and using it for extortion.
  - Possibility of external parties gaining control over the assistant.
- Safety measures: Doing the low-risk routine computer work for the user may not require many safety measures.
  - Examples of low-risk routine computer work: formatting that is difficult to format using non-AI tools, cleaning up folders, debugging some minor OS issues, and figuring out which packages to install for the new programming project.
  - Doing progressively higher-risk less-routine computer work for the user would require progressively more checks and balances, including some mechanisms of checking that what the AI is trying to do is not going to cause damage that is irrecoverable or too costly to recover.
- Required capabilities: General computer-savviness.
- Forbidden capabilities: The kind of computer-savviness that would allow it to cleverly circumvent the safety measures to cause large damage or exfiltrate itself. Limited and very overseeable ability to communicate with the internet.
SWE agent (something like Devin).
- Domain: Writing software, designing tests for the software, possibly even ensuring that it follows some formal specification or that relevant things can be proven about it.
- Downside risks: Possibility of introducing backdoors, deliberate bugs, ways to exfiltrate itself, and use these capabilities to advance its goals if it has any.
- Safety measures: The agent is not allowed to improve upon itself, formal guarantees in high-risk applications, or at least very extensive tests (advanced LLMs may also prove helpful in creating formal tests, as in davidad's agenda).
- Required capabilities: SWE.
- Forbidden capabilities: Self-exfiltration, self-improvement, ability to deliberately introduce backdoors and vulnerabilities.
AI surgeon.
- Domain: Guiding a surgical robot. Based on the required goal of the operation and constraints (including the overall condition of the patient), creating a revisable plan that effectively accomplishes the goal while satisfying the constraints (as much as possible).
- Downside risks: Death or irrecoverable (or prohibitively costly to recover) damage if mistakes are made during the surgery.
- Safety measures: The AI's actions are first extensively checked in a digital model of the patient's body. Load-bearing (possibly implicit) assumptions are identified and then tested carefully on the patient. In situations of high uncertainty and huge downsides, it defers to the human surgeon. Before any action, the movement is simulated and the possible results are displayed, perhaps with probabilities.
- Required capabilities: A human-understandable algorithm that produces plans obeying constraints such as "Don't damage patient's organs unless approved by the human operator of the system.".
- Forbidden capabilities: Similarly to the SWE agent case, inability to understand the simulation sufficiently to break it, in service of the goal. Also, any knowledge irrelevant to surgery, biology, medicine, etc. is likely undesirable (although language understanding is helpful, at least to converse with the human doctor).

3. Relation to other ideas/agendas

Various ideas or agendas have been proposed that can be seen as special cases of Bounded AI or, at least, adjacent to it, pointing at the same region of safe capability configurations that we called the Goldilocks Zone of Capabilities. Here we list some examples.

Tool AI

The concept of Tool AI emerged in discussions around the Machine Intelligence Research Institute (MIRI; back then still called Singularity Institute). MIRI expected that once the threshold of AGI is achieved (and perhaps even before then) the primary shape of AI will be that of a general agent (cross-domain real-world consequentialist), acting autonomously and ruthlessly optimizing the world according to its objectives, whatever they may be. If this is the world we're going to find ourselves in, the only way to ensure the friendliness of that kind of AI would be to align their values with ours or, failing that, ensure their corrigibility.

MIRI's expectations conflicted with those of Holden Karnofsky. In his 2012 post, Holden proposed an alternative vision of AGI that he considered a more likely default endpoint of AI development, namely Tool AI. Holden's Tool AI doesn't take any "actions" on its own in the sense that its behavior doesn't result in "big effects on the world unless mediated by a human acting and being aware of their action". Tool AI only responds to questions performs computations, recommends actions for a human to take to fulfill their goal, &c. In short, its task is to make humans more informed and more capable of acting properly but it is the human who needs to take action. Think Google Maps or a calculator but more powerful in what kinds of answers it can provide, what problems it can solve (without "moving atoms" on its own), and what actions it can recommend.

Quoting from Holden's 2012 post Thoughts on the Singularity Institute (lightly edited for clarity):

Tool-AGI is not “trapped” and it is not Unfriendly or Friendly; it has no motivations and no driving utility function of any kind, just like Google Maps. It scores different possibilities and displays its conclusions in a transparent and user-friendly manner, as its instructions say to do; it does not have an overarching “want,” and so, as with the specialized AIs described above, while it may sometimes “misinterpret” a question (thereby scoring options poorly and ranking the wrong one #1) there is no reason to expect intentional trickery or manipulation when it comes to displaying its results.
… Another way of putting this is that a “tool” has an underlying instruction set that conceptually looks like: (1) Calculate which action A would maximize parameter P, based on existing data set D. (2T) Summarize this calculation in a user-friendly manner, including what Action A is, what likely intermediate outcomes it would cause, what other actions would result in high values of P, etc.
An “agent,” by contrast, has an underlying instruction set that conceptually looks like: (1) Calculate which action, A, would maximize parameter P, based on existing data set D. (2A) Execute Action A In any AI where (1) is separable (by the programmers) as a distinct step, (2) can be set to the “tool” version rather than the “agent” version, and this separability is in fact present with most/all modern software. Note that in the “tool” version, neither step (1) nor step (2) (nor the combination) constitutes an instruction to maximize a parameter—to describe a program of this kind as “wanting” something is a category error, and there is no reason to expect its step (2) to be deceptive … This is important because an AGI running in tool mode could be extraordinarily useful but far more safe than an AGI running in agent mode. In fact, if developing “Friendly AI” is what we seek, a tool-AGI could likely be helpful enough in thinking through this problem as to render any previous work on “Friendliness theory” moot.

^[20]

Many objections were raised to the idea of Tool AI (see e.g. Eliezer and Gwern). The core problems of Tool AI can be summarized as the following:

The problem of external incentives. There are many domains in which autonomous AI agents are much more useful than AI restricted to internal computation and displaying results to a human. This incentivizes humans who have control over the system's creation or deployment to defect and bend the rules.
The problem of fuzziness. Even without explicit defection from an imposed Tool AI regime, incentives for achieving big desirable effects push to exploit the fuzzy specification of Tool AI, adhering to it in letter, rather than in spirit.
The problem of internal incentives (instrumental convergence). Agency is often useful for accomplishing goals.^[21] If the AI is sufficiently intelligent and capable and it wants to accomplish the goals given to it by the user (as we expect it to be the case past the AGI threshold), it will have strong incentives to self-modify to be more agent-like and less Tool-AI-like.

Although the idea of Bounded AI does not completely evade these problems, it does mitigate them to a large extent.^[22] The core idea of BAI is to give the system "agentic" capabilities but restrict them to narrow domains, securing their benefits while ensuring that they remain bounded to those domains.

Recently, the concept has been brought back again by Max Tegmark in the 2024 post The Hopium Wars: the AGI Entente Delusion:

Here is what I advocate for instead of the entente strategy.
The tool AI strategy: Go full steam ahead with tool AI,
allowing all AI tools that meet national safety standards.

Conjecture's cognitive emulations (CoEms)

The core intuition is that instead of building powerful, Magical [~blackbox-y] end-to-end systems (as the current general paradigm in AI is doing), we instead focus our attention on trying to build emulations of human-like things. We want to build systems that are “good at chess for the same reasons humans are good at chess.”
CoEms are a restriction on the design space of AIs to emulations of human-like stuff. No crazy superhuman blackbox Magic, not even multimodal RL GPT5. We consider the current paradigm of developing AIs that are as general and as powerful as possible, as quickly as possible, to be intrinsically dangerous, and we focus on designing bounded AIs as a safer alternative to it.

Guaranteed Safe AI (GSAI)

Quoting the abstract:

Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a crucial challenge, especially for AI systems with a high degree of autonomy and general intelligence, or systems used in safety-critical contexts. In this position paper, we will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high assurance quantitative safety guarantees. This is achieved by the interplay of three core components: a world model (which provides a mathematical description of how the AI system affects the outside world in a way that appropriately handles both Bayesian and Knightian uncertainty), a safety specification (which is a mathematical description of what effects are acceptable), and a verifier (which provides an auditable proof certificate that the AI satisfies the safety specification relative to the world model). We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them. We also argue for the necessity of this approach to AI safety, and for the inadequacy of the main alternative approaches.

One specific and particularly ambitious variant of GSAI is davidad's Open Agency Architecture.

Open Agency Architecture (OAA)

Plan's tldr: Utilize near-AGIs to build a detailed world simulation, train and formally verify within it that the [AGI] adheres to coarse preferences and avoids catastrophic outcomes.

More specifically, the AI proposes policies (courses of action) that are then evaluated or tested in a comprehensive world model. These policies are then executed only if, in the model, certain desiderata are satisfied.

The model itself is created with help from a legion of near-AGI^[23] LLMs. This, in itself, might open the door for scheming (or just deficient) LLMs to insert some vulnerabilities in the world model. However, the model itself is written in a verifiable, human-created language, and checked automatically using formal methods.^[24]

Comprehensive AI Services (CAIS)

AI Service (as used in the context of Eric Drexler's technical report Reframing Superintelligence: Comprehensive AI Services as General Intelligence) is an AI system that delivers bounded results for some task using bounded resources in bounded time. It is contrasted with agentic AGI, which carries out open-ended goals over an unbounded period of time.

Sumerian Containment

^[25]

Say we want to train a scientist AI to help in a precise, narrow field of science (e.g. medicine design) but prevent its power from being applied anywhere else (e.g. chatting with humans, designing bio-weapons, etc.) even if it has these abilities.
Here's one safety layer one could implement:
Train a scientist AI on a large scientific corpus translated exclusively into Sumerian. Keep it in a secure containment environment.
Train a less-smart reporter whose sole ability is to translate from Sumerian to English only if the Sumerian content is about medical research. It refuses to translate other kinds of content.
Human operators are only allowed to interact with the scientist AI through the intermediate of the reporter.
This safety layer helps in at least two ways:
The scientist AI cannot directly manipulate humans to escape its containment. It should first trick the reporter to then trick the humans, rendering the effort still possible but less practical.
If the scientist AI exfiltrates, it'll have a harder time manipulating humans online and will leave traces: e.g. any occurrence of Sumerian on the Internet outside of ancient text studies becomes highly suspicious. One can even imagine recognizing the idiosyncrasies of English coming from translated Sumerian, or the artifacts left by the Sumerian-specific tokenizer reused to produce non-Sumerian text.

This is clearly a special case of Bounded AI. The domain of application is "a precise, narrow field of science (e.g. medicine design)". The boundary of the domain is constituted by the reporter AI which only allows texts relevant to the domain of application through the boundary. Moreover, even if the scientist AI breaches the boundary ("escapes"), its abilities will be very limited, and it will leave traces, thus making it more interruptible. Its knowledge of only the Sumerian language is a crucial capability constraint.

(We do not interpret this proposal as necessarily serious but it is provocative in a way that might prompt us to think in yet underexplored directions.)

4. Open questions, cruxes, and further directions

How can we put bounds on the system's capabilities?
- Right now, we have some capability limitations already (the schlep/integration gap, and generalization failures). It isn't clear how long this situation is going to last.
- To what extent are unevenly shaped capabilities the default outcome of the current training method? Given that o1 and o3 models are marginally worse on tasks that are not like math or coding, is this due to catastrophic forgetting (which seemed to have gone away)?
  - Moreover, fine-tuning base models for chat makes it at least more difficult to elicit certain capabilities, and not because that was the intention. (See dynomight: here and here.)
- In humans, there is the general intelligence factor g, suggesting some "common core of general intelligence" (to the extent that those properties of human intelligence can be extrapolated to artificial intelligence).
  - However, despite the g-factor, there is some detachment of general intelligence from narrow domains of cognitive ability in specific developmental disorders in humans, such as dysgraphia, dyslexia, and dyscalculia. These impair human performance in one cognitive domain but otherwise leave general intelligence intact.
    - Are specific developmental disorders a good analogy for "AI savantism"?
- We can influence the incentives through governance.
  - For example, domain-specific "high-risk" applications can be required to have specifications including capabilities limitations, safety measures, and ontology specification (involving, (i.a.) the domain and the boundary between the domain and the environment, as well as how the system will be prevented from crossing the domain robustly).
- One research direction would be to investigate the viability of safety measures and methods of engineering and imposing appropriate capability limitations for applications of AI in particular high-risk domains as this is a crucial factor determining the viability of this proposal.
How can we ensure interruptibility?
- AI Control.
- Chain-of-thought interpretability.
- "Standard cybersecurity".
- Information-theoretic boxing.
Some reasons this cluster of ideas might fail. (Beyond the obvious "we don't effectively coordinate on doing it".)
- Superintelligent systems are capable of scheming that is not visible to us. For example, they might be reasoning in their latent space, use steganography, or carry out goal oriented cognition in a single "forward pass".
- Above some (too low) threshold of the amount of RL applied, the model's chain-of-thought no longer resembles natural human language.
Other research directions
- Frameworks for reusing a Bounded AI system in a different domain of application that don't compromise its boundedness and safety.
- How big is the Goldilocks Zone of Capabilities? (Does it exist at all?) What is its shape? How stable is it?^[26] How do we measure it or, more generally, gather evidence about it?
- How promising is this approach to reducing X-risks, relative to alternatives?

^{^}
The "suppose" here is not rhetorical. We're not claiming that these two assumptions are correct. Our intent here is to find strategies that robustly decrease AI X-risk in worlds where they are, unfortunately, satisfied.
^{^}
We are using "non-interruptible", rather than "atomic" because it makes inferring the intended meaning easier.
^{^}
This, on its own, is insufficient to ensure that the system is robustly safe but it's a good starting point.
^{^}
Non-interruptible in principle, though they probably mostly believe (or hope) that it's going to be interruptible in practice.
^{^}
Framing this in terms of "benefits of superintelligence" was borrowed from Joe Carlsmith.
^{^}
We use the term "superintelligence" quite broadly, encompassing (using Bostrom's terminology) not just superintelligence coming from higher "quality of thought" but also superintelligence coming from speed of cognition, number of thinkers, or capacity to intelligently access and integrate information that surpass what humans would be ever capable of.
^{^}
Naturally, in order to deploy such systems responsibly, we need to have justifiably strong beliefs in each of these claims.
^{^}
This roughly corresponds to how people use terms like "AGI" and "ASI" in these contexts but is more precise.
^{^}
Not necessarily a scalar value.
^{^}
The boundaries of this concept are even more nebulous, given that the way we apply it depends on our epistemic state, including the time horizon over which we are assessing the possible damage. Still, our epistemic state is what we have to work with.
^{^}
Assuming they are aware of it and are not malicious.
^{^}
One might argue that this is the "fault" of humans, not of AIs, but here it is irrelevant.
^{^}
A component being a model or agent scaffolding.
^{^}
See Table 14 in the o3-mini system card.
^{^}
Very speculatively, this might be analogous to how any value in the general IQ score can be obtained by combining the scores from different domains of human intelligence.
^{^}
Speculating again, this might be analogous to specific developmental disorders.
^{^}
In this particular instance, it couldn't even keep track of the taks. From the tweet:
It "thought" for four minutes and then came back to me with the (correct, I admit!) answers to five unrelated 3-digit sums and no downloadable file.
^{^}
We are assuming that we're staying within the modern ML-centric paradigm of AI where an "ML model" is a natural and meaningful unit of analysis.
^{^}
For now at least, we are not discussing scenarios like diamondoid bacteria.
^{^}
Notably, there are counterarguments to such a system not having any wants, see e.g. The Parable of Predict-O-Matic, The Solomonoff Prior is Malign, and Deep Deceptiveness. They are, however, beyond the scope of this post.
^{^}
Perhaps even: agency is useful for accomplishing a vast majority of goals.
^{^}
To the extent that it is viable, of course.
^{^}
Or, in this post's terminology, near-UAI.
^{^}
If it doesn't make sense, see the post: Davidad's Bold Plan for Alignment: An In-Depth Explanation.
^{^}
Name made up by Mateusz, not the authors of the post.
^{^}
By "stability", we mean: if we get a system that sits "comfortably" within the Goldilocks Zone, how likely is it that this system will move across the upper bound to catastrophic risk, due to causes that are either "internal" to the system (e.g. self-modification or, more mundanely, the system becoming more capable through learning), or "external" to it (e.g. the domain/context changes so that the capability bounds and interruptibility cease to apply or humans modify the system).
^{^}
We do not expect this to scale up to AGI/ASI but this is not our purpose here. Our purpose is to provide an alternative guiding principle for AI trajectory.

LESSWRONG
LW