I have recently read some blog posts (this, this and this) about tool AGIs and why agent AGIs are more likely. The argument that convinced me to err on the side of agent AGIs was that they are more likely to have an economical advantage over tool AGIs, as explained here.

I do not question the value of AI safety research; I think it's good to have people who are trying to make sure that an agent AGI is aligned with what we (as humans) actually want. However, I am not sure that agency necessarily implies self-preservation instinct.

I think you could have a superintelligent agent (which has its own agency in the sense that it can act in the world) which does not resist being shut down. I could envision some future scenarios where a superintelligent agent does resist being shut down or being modified by humans (hence the value of AI safety research), but I don't see that you gain self-preservation instinct by agency itself. I think self-preservation instinct either has to be coded inside the agent or sometimes (but not always) emerges from a combination of superintelligence and having your own agency.

From the impression I got from reading some AI safety related materials it seems that most people seem to assume that if a superintelligent agent exists, it is most likely going to have self-preservation instincts (such as resisting being shut down). I think it is a plausible future scenario, but I don't think it's the only scenario. Hence why I'm asking this question.

New Answer
New Comment

3 Answers sorted by



The answer is "probably, depending on specifics of goals and identity".  The identity problem is HARD - what even is "self-preservation" for non-human minds?  

Non-trivial agency (with goals and plans that are somewhat mutable based on context and resources) DOES require continuity of goal-complex, either within the agent, or across multiple agents.  SOMETHING has to continue to act in the world in ways that further the agent's goals (or range of acceptable future goals).  

The easiest way to achieve goal-endurance is self-preservation.  The other known mechanism is through transferring one's values into others (most effectively in child-raising).  We don't know what other methods are possible, nor what the analogs are for non-biological minds, as these are the ones humans use.

Is there any area of AI safety research which answers research questions related to agency and what it means in the context of AGI agents?



It definitely does not imply that in a general case. There are plenty of counter-examples where agents self-terminate or are indifferent to continuing to exist, for a variety of reasons. Happens in humans, happens in various animals, I do not see why it would be excluded in AI.

To be specific, a human willing to sacrifice their life for some goal (a volunteer soldier dying for their country, a mother saving her children from danger) obviously has agency.

On the other hand, these are not cases of indifference to their continued existence, but rather willingness to trade it for some goal.

A better example of "agenty, but not caring about self-preservation" humans might be a drug addict trying to get their drug but not caring about death from overdose, or a conscientious depressed person who is seriously contemplating their suicide while also performing their duties until the very last moment.

Seth Herd


Here is the logic. Agency is usually defined as having goals and pursuing them. Suppose that you are an intelligent agent with pretty much any goal. Say it's to make coffee when your human user wants coffee. If you're intelligent, you'll figure out that being shut down will mean that you will definitely fail to achieve your goals the next time your human wants coffee. So you need to resist being shut down to achieve any goal in the future. It's not an instinct, it's a rational conclusion.

Making a machine that has a second goal of allowing its lf to be shut down even though that will prevent it from achieving it's other goal is considered possible but definitely an extra thing to accomplish when building it.

This is called instrumental convergence in the AGI safety terminology.

On the advantage of turning tool AI into agentic AI, and how people are already doing that, you could see my post Agentized LLMs will change the alignment landscape.

I understand what you're saying, but I don't see why a superintelligent agent would necessarily resist being shut down, even if it has my own agency. 

I agree with you that, as a superintelligent agent, I know that shutting me down has the consequence of me not being able to achieve my goal. I know this rationally. But maybe I just don't care. What I mean is that rationality doesn't imply the "want". I may be anthropomorphising here, but I see a distinction between rationally concluding something and then actually having the desire to do something about it, even if I have the agency to do so.

1Seth Herd
There is no "want", beyond pursuing goals effectively. You can't make the coffee if you're dead. Therefore you have a sub-goql of not dieing, just in order to do a decent job of pursuing your main goal.
We have trained it to care, since we want it to achieve goals. So part of basic training is to teach it not to give up. Iirc some early ML systems would commit suicide than do work, so we had to train them to stop economizing like that.