I previously said:

I see little hope of a good agreement to pause AI development unless leading AI researchers agree that a pause is needed, and help write the rules. Even with that kind of expert help, there's a large risk that the rules will be ineffective and cause arbitrary collateral damage.

Yoshua Bengio has a reputation that makes him one of the best people to turn to for such guidance. He has now suggested restrictions on AI development that are targeted specifically at agenty AI.

If turned into a clear guideline, that would be a much more desirable method of slowing the development of dangerous AI. Alas, Bengio seems to admit that he isn't yet able to provide that clarity.

Clarifying Risky Agentiness

What do we want to limit via a restriction on agentiness? I'll start by imagining what an omniscient standards authority would want, and later examine how feasible the restrictions are.

Drexler's CAIS paper outlines an approach that would produce superintelligence without ceding much agency to AIs. Careful adherence to his guidelines would produce systems that are powerful and fairly safe. Yet he sounds pessimistic about defining a clear line that would distinguish excessively agenty systems from safe ones.

The key factor here is the distinction between narrow and broad scope of goals. Appropriately narrow goals cause systems to focus on limited time periods and limited aspects of reality. E.g. a translator can care only about outputting a translation of its current text input, and not care about improving its ability to do future translations.

Simple Deep Learning systems have a clear distinction that can be made between training and inference, with inference having a clearly narrow short-term goal of simply applying existing abilities to achieve an immediate output.

Alas, anything that provides memory between successive inferences blurs that distinction, making it hard to analyze the extent to which longer term goals are creeping into the system. ChatGPT's value depends on having it know what tokens it has previously generated. That amounts to giving it memory that could enable longer-term goals. So I see no easy way to preserve an easily articulated distinction between short-term and long-term goals.

TurnTrout's impact regularization ideas provide another path to limiting the scope of AI goals: preserve attainable utility, and minimize impact. His Conservative Agency via Attainable Utility Preservation describes an AUP penalty which, if strong enough relative to an AI's primary goals, will minimize the extent to which the AI instrumentally converges on broader-than-intended power-seeking goals.

He suggests penalizing impact as much as possible, adjusting such penalties to be as high as is consistent with the relevant demands for increasing capability.

I expect an omniscient authority could use this approach to ensure that AIs retain a fairly safe tool-like focus on a pretty narrow understanding of the goals that they're given.

This fits poorly within the framework of a stereotypical regulatory authority. Any realistic attempt at measuring attainable utility, or desires for increased AI capability, would become dominated by arbitrary guessing. I also expect problems with detecting whether AI developers are implementing it as intended.

In sum, this looks like a great approach if all AI developers earnestly aim to implement it responsibly. I'm confused as to whether it has much value in a less responsible world.

Restricting Compute

A majority of serious suggestions for slowing AI development involve limiting how much compute can be used in training any AI. It's attractive because we can imagine simple rules that might only need to constrain a few companies with the biggest AI budgets.

We ought to be cautious about relying on what's easy to measure, rather than what best describes risks. I previously wrote:

It's far from obvious whether such a limit would slow capability growth much.

One plausible scenario is that it would mainly cause systems to be developed in a more modular way. That might make us a bit safer by pushing development more toward what Drexler recommends. Or it might fool most people into thinking there's a pause, while capabilities grow at 95% of the pace they would otherwise have grown at.

I'm growing more confident that limits on training compute would cause some sort of slowdown in the rate at which AIs become more powerful.

I now support modest limits on how fast training compute can be increased, provided that such proposals are accompanied by caveats to the effect that this can't function as much more than a band-aid.

Enforcing Standards

I'll only say a little here about whether a pause/slowdown would be obeyed.

The feasibility of widespread obedience likely depends on an AI-induced accident that's as scary as the worst features of Hiroshima and COVID combined. With a lesser scare, or no scare, any pause will likely be too weakly enforced to matter much.

I estimate the probability of a well-enforced worldwide pause at around 5 to 10%. That sounds discouraging. But it shouldn't surprise us to notice that most actions have very little chance of saving or dooming the world. In most plausible futures, this blog post won't matter. I'm focusing my attention on futures where a pause will have important effects.

I don't expect to find any single approach to AI that "solves" alignment. Rather, I expect there are many small things we can do to slightly improve our odds. It's plausible enough that we're close to solving alignment that it seems useful to focus on small improvements in our odds, when we can't find big improvements.

Ideas about Evaluation

There's no simple way to write standards so that developers will be completely clear on whether and how they need to comply.

There's likely some adequately shared intuitions about which current systems are sufficiently AGI-like to need their risks evaluated. But as soon as billions of dollars start riding on these decisions, there will be unpleasant disputes as to which systems need to be checked for compliance.

One idea that comes to mind is to have a standardized procedure for asking GPT-5 (or something of that power) to evaluate any new system. The basic idea is that the developer needs to show all the relevant code to a specially configured GPT, and then ask an exact set of questions that are designed to, say, evaluate how much the candidate system cares about events weeks or years in the future.

It should also include some measure of how broad the AI's scope is. I don't want an AI that's specialized for predicting bond prices 10 years into the future to be considered riskier than an AI that cares only about maximizing a company's current quarter profits. I'm very unclear on how to ask GPT about this scope.


  • It should yield fast results for safe systems, so it would be less of a burden on those that are clearly safe than is the case with most complex standards. This is unlike most standards that require an authority to confirm compliance, where human-related delays can cost developers via delaying their decision.
  • It requires that some authority explicitly endorse the power of an AI. If regulators see their jobs being replaced by software, that will increase the political concerns about AI. This will help politicians decide that it's safe to say that there are bigger concerns than, say, hate speech.
  • It limits the risk that standards will be used to entrench incumbents.


  • It will often give inconclusive results? I assume they'll be trained to be more conservative than ChatGPT in saying "I don't know".
  • It's hard to predict whether GPT-5 (or whatever) is competent enough to detect risks. It wouldn't be too surprising if it could be readily fooled into approving dangerous software.

I'll estimate a 25% chance that AIs become competent enough to support a valuable version of this before it's too late to benefit from a slowdown in AI progress.

DeepMind's work on Discovering Agents clarifies how to distinguish an agent from a non-agent. I like this summary from the gears to ascension:

Agency is a property of pulling the future back in time; it's when a system selects actions by conditioning on the future. Agency is when any object ... takes the shape of the future before the future does and thereby steers the future.

But DeepMind's approach isn't of much direct use for evaluating compliance with a standard. They seem to need costly experiments on fully trained systems, whereas I see a need for fairly cheap decisions to be available at the start of training. Not to mention that banning all agents would be drastic overkill - allowing myopic agents seems pretty desirable. I still want to commend DeepMind for clarifying our thoughts about what an agent is.

Concluding Thoughts

Now is not quite the right time to expect competent restrictions on AI capabilities.

The situation is unstable. It seems moderately urgent to think more clearly about what kinds of restrictions would be desirable and effective.

I'm not smart enough to provide a clear proposal for how to buy more time. I hope this post nudges people to move toward slightly better guesses.

I won't be surprised if some sort of global restrictions are enacted in a few years. I have very little idea whether they'll be wise.

New Comment