For concreteness, I’ll focus on the “off button” problem, which is that an AI (supposedly) will not let you turn it off. Why not? The AI will have some goal. ~Whatever that goal is, the AI will be better able to achieve it if it is “on” and able to act. Therefore, ~all AIs will resist being turned off.

Why is this wrong? First, I’ll offer an empirical argument. No actually existing AIs exhibit this problem, even the most agentic and capable ones. Go-playing AIs are superhuman in pursuing their “goal” of winning a Go game, but they do not offer any resistance to being turned off. Language models have a seemingly impressive world model, and are fairly capable at their “goal” of predicting the next token, but they do not resist being turned off. (Indeed, it’s not even clear that language models are “on” in any sense, since different executions of the model share no state, and any individual execution is short-lived). Automated trading systems make stateful economically significant decisions, but they do not resist being turned off (actually, being easy to turn off is an important design consideration).

So that’s the empirical case. How about a thought experiment? Suppose you scaled up a game-playing AI with 1000x more data/compute/model-size. It would get much better at achieving its “goal” of playing the game well. Would it start to resist being turned off? Obviously it would not. It’s not even going to be able to represent the idea of being “turned off”, since it’s just thinking very hard about e.g. Go.

So what about the theory? How come theory says ~all AIs should resist being turned off, and yet ~no actually existing AIs (or even powered-up versions of existing AIs), do resist being turned off? The reason is that “agents with a goal” is not a good way of thinking about AIs.

It’s a tempting way to think about AIs, because it is a decent (although imperfect) way of thinking about humans. And “reward functions” do look an awful lot like goals. But there are some key differences:

1) Training defines the “world” that the AI is thinking about. This was the problem with the super-Go-playing AI - we just told it to think hard about the world of Go. So even though it is superhuman, even godlike, *in that domain*, it hasn’t thought about the real world. It can’t even interact with the real world, since it takes input and provides output in the very specific way that it was trained to.

2) AI is composable. Unlike a human, you can run arbitrary code - or another AI - at *any point* inside an AI. You can add AI-driven checks that proposed actions are safe. You could have many different action-proposer AIs that were trained on different datasets or reward functions, totally independently and without considering how they will interact with the other AI-pieces. There is no reason a heterogenous agent like this will be well-modeled as having a singular goal, even if some of the pieces that compose it might.

OK, but what if we did build an agentic AI that interacted with the real world and was a single system? I agree, then you might have a problem. But you probably wouldn’t want to. Why?

1) Training an AI on the entire world is extremely wasteful, compared to thinking about some narrowly defined problem. If you want to “run a company” well, thinking about how your desk chair is made up of atoms that obey the laws of physics is likely to be a distraction compared to case studies of how similar companies acted in the past. Even thinking about how to circumvent the “off” button is a distraction, compared to just focusing on your job. Humans lack the ability to focus their thoughts like this; AIs don’t, and that is both a competitive advantage and a contribution to safety. (You could even monitor your AIs to make sure they’re staying on task; consider the contract-drafting em).

2) Composability is a very powerful way to improve safety, and is ubiquitous in existing computer systems. A software “off switch” is the most obvious example (e.g. “turn yourself off after you get this message”). Heart-beating (“turn yourself off if you haven’t talked to ‘home base’ in a few seconds) is also very common. Finance has “risk checks” (before taking any actions, sanity check them against some simple hard-coded rules). Composability also allows you to define pieces that address narrowly-defined problems, which is helpful as argued in (1).

So even agentic AIs probably won’t have “goals”. Some of their subcomponents may have goals, but they will probably be too narrow to translate into instrumental goals for the agent as a whole. Which is good, because instrumental goals would be dangerous, so we should try and emphasize these important ways that AIs are less goal-oriented than humans when thinking about and designing AIs.


9 comments, sorted by Click to highlight new comments since: Today at 5:10 PM
New Comment

Hmm this doesn't seem quite right to me. It sounds like you're arguing:

  1. That the off-switch problem is unlikely because we haven't seen it so far - However, I wouldn't expect this problem to arise until AIs are capable enough that they can manipulate their own off-switch
  2. General AI is wasteful/unproductive approach and narrow AIs are more efficient and useful - I don't have a good link handy and am not sure I can articulate it well at this moment, but intuitively generality can be a very powerful leap (similar to the leap from chimps to humans), and I have read compelling arguments about this before as well.
  3. Agentic AIs probably won't have goals - Some of them might not, but I think Evan Hubinger's "Risks from Learned Optimization" and some subsequent work makes a pretty good case for why we should expect (mesa-)optimizers or search processes to emerge sometimes from training advanced deep learning models. And these optimizers can be thought of as having a pretty clear "goal" (from my understanding anyway).

I may be missing something important from your argument - please correct me if so.

On (3): I don't expect an agentic AI to consist of a single deep learning model. For concreteness, let me sketch an architecture:

We have some robot R that we want an AI to run. R will supply the AI with sensory input and take actions that the AI recommends. R is like the human body and the AI is like the brain.

AI A is trained to operate R using whatever method you like. It ends up with some goal. AI B is trained as an "overseer" alongside A; it takes in actions recommended by A and decides whether to execute or reject them; B is supposed to reject actions which disable R's off switch.

After A and B are trained, R is deployed. In deployment, A does not have direct control over R. Instead, the actions recommended by A are filtered through B.

I would say the "agentic AI" here is "A+B+R", which consists of two deep learning models A and B and some hand-coded non-trained piece R. A might have goals, but the system "A+B+R" does not.

Maybe you would say "well actually A is the agentic AI and it has goals. B+R are just part of the environment that it has to deal with". But A cannot do anything without B's permission, and B is just as capable as A, and B was trained specifically to deal with A, and A had no idea it would have to deal with B during training. I claim this makes A significantly safer than if it was able to operate R by itself.

A couple of things that come to mind:

  • You'll run into ELK-like difficulties when trying to train B: maybe it looks to the human that the off switch is still accessible but in reality you would not be able to activate it for some reason -- A steers into a Faraday cage or something more clever.

  • (In response to your reply to acylhalide) Assuming you solve the training difficulties, B has to be at least as good at modelling the world as A: it needs to be able to predict the consequences of action sequences from A to see if the off switch is accessible.

  • (In response to your reply to Evan) I don't necessarily expect A+B+R to be as capable as A+R: adding B introduces a huge shift to the environment that A has to deal with. If A always disabled the off switch in regular training, and now is unable to execute action sequences that involve that, then it may not be able to act effectively.

I haven't read your poost but I did the read this comment. Can you order the three agents in terms of capabilities?

A, B, human

If B is much more capable than humans then it's a challenge to ensure B sticks to its job of monitoring A, without actually eliminating humans as a subgoal towards monitoring A.

Also B might wish to affect A, for instance if it kills A, it has much higher probability of successfully monitoring A.

If B is stupider than humans then why can't humans just play the role B, why do you need B?

And if B is stupider or at human-level capability, this just reduces to the original problem of humans initially in control of the off-switch and the AI A plotting to eliminate human's ability to turn it off. (It can do this by manipulating the humans, or hacking its code, etc. etc.)

A > B > human. I expect B < human would also be quite useful.

B does not have a lot of opportunity for action - all it can do is prevent A from acting. It seems like its hard to "eliminate humans" with just that freedom. I agree B has an incentive to hamper A.

Even a B that is dumber than humans is valuable, because it is faster than humans. B can provide real-time feedback on every action that A takes, whereas consulting humans for every action would be impractical - it costs too much human time and it would prevent A from operating in realtime.

Hmm okay so there's two sets of concerns.

First is A can play the same set of strategies it would otherwise play against human supervision, against B. Such as being silent or providing such minimal information until B (or human) is curious and turns its capability dial up. Or else hack the system and so on.

Second is like I said, B itself breaking out of human control. This is a problem if B exceeds human capability, and again reduces to the original alignment problem I guess? B will try to deceive humans till its own capability dial is turned up. Or again, hack the system. Sure restricting B's actions might help but so would restricting A directly and putting A under direct human supervision. How does B help?

I buy your point that B could be less than human capable and yet useful for this role, due to speed. Like for this point alone I get why you'd want to create B, but not otherwise. (I'm also not sure "slow" here is a bad thing - I guess it depends how slow we are talking - but yes it could make sense.)


Also stuff like inner alignment and ontology mismatch apply to this system too.

Ah I see, you're using a collection of narrower systems and oversight to try and provide safety. Well there are some proposals like this which provide for an AGI that's not agentic and may have better safety properties. Eric Drexler's CAIS comes to mind.

But if a proposal is going to be implemented by a major AI lab then it needs to be competitive too. I don't think it's clear that systems like this are competitive with agentic systems. So in the kinds of advanced AI we are still likely to see implemented in the real world, instrumental convergence is still very much a concern.

Why isn't it competitive? A is being trained the same way as an agentic system, so it will be competitive.

Adding B is a 2x runtime/training-cost overhead, so there is a "constant factor" cost; is that enough to say something is "not competitive"? In practice I'd expect you could strike a good safety/overhead balance for much less.

Hmm well if A is being trained the same way using deep learning toward being an agentic system, then it is subject to mesa-optimization and having goals, isn't it? And being subject to mesa-optimization, do you have a way to address inner misalignment failures like deceptive alignment? Oversight alone can be thwarted by a deceptively-aligned mesa-optimizer.

You might possibly address this if you give the overseer good enough transparency tools. But such tools don't exist yet.